Performance test on 2011-10-10 This test compares several settings in the job configuration of a map/reduce program. JOB DESCRIPTION: The task is to count word frequencies of a pre-processed input file of around 17G size. The words come from a local copy of the English edition of Wikipedia. The output has only a size of 92M. The program mr_wordfreq was used. HARDWARE: - Two nodes with an Opteron 1354 CPU (4-core, 2.2 GHz), and 8G of RAM - The nodes store the volume data on a RAID-1 array. The disks are not ultimately fast, but "normal" 7500 rpm disks, connected with SATA-300. - The PostgreSQL database resides on an SSD with 90G capacity (OCZ Agility 3). There is only one namenode. - The two nodes are connected with a gigabit network. SOFTWARE: Plasma revision 437. PARAMETERS: partitions merge_limit split_limit enhanced_mapping runtime (sec) Run #1 4 64 4 4 3200 Run #2 4 64 4 1 3535 Run #3 16 64 4 16 3256 Run #4 4 64 4 none (no Emap tasks) 3112 The parameters follow different ideas, and interestingly the number of performed data copies is quite different. Nevertheless, there is almost no difference between the runs #1, #3, and #4, so that it can be assumed that positive and negative effects of the configuration compensated for each other. Only #2 is a bit slower. - Run #1: This configuration splits the data directly up into partitions after doing the map and sort: Round 1: 48 input fragments -> 48 * Emap -> 1324 intermediate files Round 2: 1324 interm. files -> 24 * Shuffle (only merge) -> 24 interm. files Round 3: 24 intermediate files -> 4 * Shuffle (only merge) -> 4 output files - Run #2: Just for checking whether the splitting within Emap has an effect, we do it without here. Basically, the splitting is moved to round 2. Round 1: 48 input fragments -> 48 * Emap -> 331 intermediate files Round 2: 331 interm. files -> 6 * Shuffle (merge+split) -> 24 interm. files Round 3: 24 intermediate files -> 4 * Shuffle (only merge) -> 4 output files The difference here is that in run #2 the rounds 2 and 3 cannot overlap with each other, and round 3 can first start when round 2 is finished. At this time the cluster is under-utilized. In run #1, however, rounds 2 and 3 can partially overlap because the data is already split into partitions (i.e. the processing for one partition can still be in round 2 whereas another partition has already progressed to round 3). - Run #3: We want to check here whether it costs performance when too many partitions are created (compared with run #1). This has a negative effect because the data is generally processed in smaller units (e.g. more namenode interaction). There could also be a positive effect, though, because of the more fine-grained processing scheme, which could lead to a higher utilization of the CPU resources. Round 1: 48 input fragments -> 48 * Emap -> 5280 intermediate files Round 2: 5280 interm. files -> 96 * Shuffle (only merge) -> 96 interm. files Round 3: 96 interm. files -> 16 * Shuffle (only merge) -> 16 output files As expected the average number of running tasks is slightly higher than in run #1 (12.1 vs. 11.1). So there is in deed an effect leading to better CPU utilization. However, the expected negative effects are still a bit higher. - Run #4: This is was the standard way of configuration before the introduction of Emap tasks. Interestingly, it is still the fastest! Round 1: 48 input fragments -> 48 * Map -> 379 intermediate files Round 2: 379 interm. files -> 379 * Sort -> 379 interm. files Round 3: 379 interm. files -> 6 * Shuffle (merge+split) -> 24 interm. files Round 4: 24 intermediate files -> 4 * Shuffle (only merge) -> 4 output files Note that round 3 and 4 have the same layout as in run #2. Also, it creates an additional set of intermediate files. Nevertheless, this is the fastest configuration! There must be a specific advantage. As it turns out this run avoids a problem of all previous runs: There is no underutilization between the map/sort rounds and the shuffle rounds. While sorts are still running, the system can already start shuffle tasks. This is a very good combination: Sorts are absolutely CPU-bound, whereas shuffling is usually I/O-bound. THIS CONCLUSION IS QUITE QUESTIONABLE! THERE IS PROBABLY ANOTHER, STILL UNNOTICED EFFECT!