Plasma GitLab Archive
Projects Blog Knowledge

Plasma Project:
Home 
Manual 
FAQ 
Slides 
Perf 
Releases 
Performance test on 2011-10-10

This test compares several settings in the job configuration of a
map/reduce program.

JOB DESCRIPTION:

The task is to count word frequencies of a pre-processed input file of
around 17G size. The words come from a local copy of the English
edition of Wikipedia. The output has only a size of 92M. The program
mr_wordfreq was used.

HARDWARE:

- Two nodes with an Opteron 1354 CPU (4-core, 2.2 GHz), and 8G of RAM
- The nodes store the volume data on a RAID-1 array. The disks are not
  ultimately fast, but "normal" 7500 rpm disks, connected with SATA-300.
- The PostgreSQL database resides on an SSD with 90G capacity
  (OCZ Agility 3). There is only one namenode.
- The two nodes are connected with a gigabit network.

SOFTWARE:

Plasma revision 437.

PARAMETERS:

	partitions merge_limit split_limit enhanced_mapping      runtime (sec)
Run #1	4          64          4           4			 3200
Run #2  4 	   64	       4	   1			 3535
Run #3  16	   64	       4	   16			 3256
Run #4  4	   64	       4	   none (no Emap tasks)	 3112

The parameters follow different ideas, and interestingly the number of
performed data copies is quite different. Nevertheless, there is almost
no difference between the runs #1, #3, and #4, so that it can be assumed
that positive and negative effects of the configuration compensated for
each other. Only #2 is a bit slower.

-

Run #1: This configuration splits the data directly up into partitions
after doing the map and sort:

Round 1: 48 input fragments -> 48 * Emap -> 1324 intermediate files
Round 2: 1324 interm. files -> 24 * Shuffle (only merge) -> 24 interm. files
Round 3: 24 intermediate files -> 4 * Shuffle (only merge) -> 4 output files

-

Run #2: Just for checking whether the splitting within Emap has an effect,
we do it without here. Basically, the splitting is moved to round 2.

Round 1: 48 input fragments -> 48 * Emap -> 331 intermediate files
Round 2: 331 interm. files -> 6 * Shuffle (merge+split) -> 24 interm. files
Round 3: 24 intermediate files -> 4 * Shuffle (only merge) -> 4 output files

The difference here is that in run #2 the rounds 2 and 3 cannot overlap
with each other, and round 3 can first start when round 2 is finished.
At this time the cluster is under-utilized. In run #1, however, rounds
2 and 3 can partially overlap because the data is already split into
partitions (i.e. the processing for one partition can still be in round 2
whereas another partition has already progressed to round 3).

-

Run #3: We want to check here whether it costs performance when too
many partitions are created (compared with run #1). This has a
negative effect because the data is generally processed in smaller
units (e.g. more namenode interaction). There could also be a positive
effect, though, because of the more fine-grained processing scheme,
which could lead to a higher utilization of the CPU resources.

Round 1: 48 input fragments -> 48 * Emap -> 5280 intermediate files
Round 2: 5280 interm. files -> 96 * Shuffle (only merge) -> 96 interm. files
Round 3: 96 interm. files -> 16 * Shuffle (only merge) -> 16 output files

As expected the average number of running tasks is slightly higher than
in run #1 (12.1 vs. 11.1). So there is in deed an effect leading to better
CPU utilization. However, the expected negative effects are still a bit
higher.

-

Run #4: This is was the standard way of configuration before the introduction
of Emap tasks. Interestingly, it is still the fastest!

Round 1: 48 input fragments -> 48 * Map -> 379 intermediate files
Round 2: 379 interm. files -> 379 * Sort -> 379 interm. files
Round 3: 379 interm. files -> 6 * Shuffle (merge+split) -> 24 interm. files
Round 4: 24 intermediate files -> 4 * Shuffle (only merge) -> 4 output files

Note that round 3 and 4 have the same layout as in run #2. Also, it
creates an additional set of intermediate files. Nevertheless, this is
the fastest configuration! There must be a specific advantage.  As it
turns out this run avoids a problem of all previous runs: There is no
underutilization between the map/sort rounds and the shuffle rounds.
While sorts are still running, the system can already start shuffle
tasks. This is a very good combination: Sorts are absolutely CPU-bound,
whereas shuffling is usually I/O-bound.

THIS CONCLUSION IS QUITE QUESTIONABLE! THERE IS PROBABLY ANOTHER, STILL
UNNOTICED EFFECT!

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml