====================================================================== PlasmaFS performance impressions ====================================================================== SOFTWARE: plasma rev 479 (corresponding to plasma-0.4.1) HARDWARE: - Two nodes (office3, office4) with an Opteron 1354 CPU (4-core, 2.2 GHz), and 8G of RAM - The nodes store the volume data on a RAID-1 array. The disks are not ultimately fast, but "normal" 7500 rpm disks, connected with SATA-300. - The PostgreSQL database resides on an SSD with 90G capacity (OCZ Agility 3). - The two nodes are connected with a gigabit network. - A third node (office1) is used for submitting test requests CLUSTER CONFIGURATION: - There is only one namenode. The security mode for namenode accesses is "privacy" (i.e. full). The security mode for datanode accesses is "auth" (i.e. only authentication, but no encryption). - blocksize=1M ---------------------------------------------------------------------- TEST 1: sequential r/w ---------------------------------------------------------------------- - Create 10G test file on office1 (outside the cluster): office1 $ dd if=/dev/zero of=testfile bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 76.9597 s, 140 MB/s Note that PlasmaFS does not do any compression, so a file of zeroes is as good as any other file. - Copy this file to PlasmaFS - replication=1: office1 $ time plasma put -rep 1 testfile /testfile1 Copied: testfile -> /testfile1 real 2m28.382s user 0m3.296s sys 0m47.759s (Corresponds to 69MB/s. The CPU load on the cluster nodes varied between 5 and 15%.) - Copy this file to PlasmaFS - replication=2: office1 $ time plasma put -rep 2 testfile /testfile2 Copied: testfile -> /testfile2 real 3m49.302s user 0m6.300s sys 1m20.801s (Corresponds to 45MB/s. The CPU load on the cluster nodes varied between 10 and 20%.) - Check blocklists: office1 $ plasma blocks "/testfile*" /testfile1: 0 - 127: 192.168.5.30:2728[11264-11391] 128 - 255: 192.168.5.40:2728[3072-3199] 256 - 383: 192.168.5.30:2728[11392-11519] 384 - 639: 192.168.5.40:2728[3200-3455] 640 - 1023: 192.168.5.30:2728[11520-11903] 1024 - 1151: 192.168.5.40:2728[3456-3583] 1152 - 1791: 192.168.5.30:2728[11904-12543] 1792 - 1919: 192.168.5.40:2728[3584-3711] 1920 - 2175: 192.168.5.30:2728[12544-12799] 2176 - 2431: 192.168.5.40:2728[3712-3967] 2432 - 2559: 192.168.5.30:2728[12800-12927] 2560 - 2687: 192.168.5.40:2728[3968-4095] 2688 - 2943: 192.168.5.30:2728[12928-13183] 2944 - 3327: 192.168.5.40:2728[4096-4479] 3328 - 3455: 192.168.5.30:2728[13184-13311] 3456 - 3583: 192.168.5.40:2728[4480-4607] 3584 - 3839: 192.168.5.30:2728[13313-13568] 3840 - 3967: 192.168.5.40:2728[4608-4735] 3968 - 4095: 192.168.5.30:2728[13569-13696] 4096 - 4351: 192.168.5.40:2728[4736-4991] 4352 - 4479: 192.168.5.30:2728[13697-13824] 4480 - 4863: 192.168.5.40:2728[4992-5375] 4864 - 5119: 192.168.5.30:2728[13825-14080] 5120 - 5247: 192.168.5.40:2728[5376-5503] 5248 - 5503: 192.168.5.30:2728[14081-14336] 5504 - 5759: 192.168.5.40:2728[5504-5759] 5760 - 5887: 192.168.5.30:2728[14337-14464] 5888 - 6143: 192.168.5.40:2728[5760-6015] 6144 - 6399: 192.168.5.30:2728[14465-14720] 6400 - 6527: 192.168.5.40:2728[6016-6143] 6528 - 6783: 192.168.5.30:2728[14721-14976] 6784 - 7039: 192.168.5.40:2728[6145-6400] 7040 - 7422: 192.168.5.30:2728[14977-15359] 7423 - 7423: 192.168.5.30:2728[15363-15363] 7424 - 7551: 192.168.5.40:2728[6401-6528] 7552 - 7679: 192.168.5.30:2728[15364-15491] 7680 - 7807: 192.168.5.40:2728[6529-6656] 7808 - 7935: 192.168.5.30:2728[15492-15619] 7936 - 8063: 192.168.5.40:2728[6657-6784] 8064 - 8191: 192.168.5.30:2728[15620-15747] 8192 - 8447: 192.168.5.40:2728[6785-7040] 8448 - 8575: 192.168.5.30:2728[15748-15875] 8576 - 8831: 192.168.5.40:2728[7041-7296] 8832 - 9087: 192.168.5.30:2728[15876-16131] 9088 - 9471: 192.168.5.40:2728[7297-7680] 9472 - 10111: 192.168.5.30:2728[16132-16771] 10112 - 10239: 192.168.5.40:2728[7681-7808] blocks: 10240 actual replication: 1 requested replication: 1 /testfile2: 0 - 2060: 192.168.5.40:2728[217088-219148] 192.168.5.30:2728[19456-21516] 2061 - 4094: 192.168.5.40:2728[219150-221183] 192.168.5.30:2728[21517-23550] 4095 - 8202: 192.168.5.40:2728[221185-225292] 192.168.5.30:2728[23551-27658] 8203 - 10239: 192.168.5.40:2728[225294-227330] 192.168.5.30:2728[27659-29695] blocks: 20480 actual replication: 2 requested replication: 2 As one can see, PlasmaFS allocates the blocks for /testfile1 in alternating sequences of >= 128 blocks (i.e. 128M). The blocks are placed contiguously on disk. The placement scheme ensures that practically no random seeks are needed to read the file. Also, random accesses will be balanced between the two nodes. For /testfile2 we get even longer contiguous block sequences (2000-4000 blocks). - Read /testfile1 back to office1: office1 $ time plasma get /testfile1 testfile1 Copied: /testfile1 -> testfile1 real 1m39.445s user 0m30.002s sys 0m34.710s (Corresponds to 103MB/s. Probably maxes the gigabit network out. The CPU load on the cluster nodes is < 10%. An interesting minor point is that the plasma client utility consumes now relatively more CPU time in user space than in kernel space, compared to "plasma put" above. This has to do with the fact that it is now the client that has to decode the RPC messages, and thus has to create an additional copy of the data blocks.) - Read /testfile2 back to office1: office1 $ time plasma get /testfile2 testfile2 Copied: /testfile2 -> testfile2 real 1m32.232s user 0m26.834s sys 0m32.054s (Corresponds to 104MB/s. Probably maxes the gigabit network out. The CPU load on the cluster nodes is < 10%.) - Read /testfile1 back to office3 (i.e. _within_ the cluster): office3 $ time plasma get /testfile1 testfile1 Copied: /testfile1 -> testfile1 real 2m49.437s user 0m17.737s sys 0m35.638s Corresponds to 60MB/s. A parallel running "dstat" shows that the local disk is the bottleneck (data is read from the same disk as it is written to). - Read /testfile1 back to office3 (i.e. _within_ the cluster). without storing it: $ time plasma cat /testfile1 >/dev/null real 1m52.256s user 0m29.146s sys 0m21.961s (Corresponds to 91MB/s.) - Read /testfile1 back to office3 (i.e. _within_ the cluster). without storing it: office3 $ time plasma cat /testfile2 >/dev/null real 1m12.416s user 0m24.814s sys 0m20.633s (Corresponds to 141MB/s. This is around the maximum the local disks can provide. This test profits from the shared memory transfer method, i.e. when office3 reads blocks from itself, these do not travel via the network, not even via local-loopback.) ---------------------------------------------------------------------- TEST 2: parallel sequential reads ---------------------------------------------------------------------- /testfile1 and /testfile2 are created as in test 1. The only purpose of this test is to demonstrate that parallel reads do not lock each other out. We "plasma cat" the files several times on office1, i.e. outside the cluster. The expected effect is that the each "cat" gets 1/n of the network bandwidth. - Read /testfile2 this way: office1 $ time { for k in 1 2 3 4; do plasma cat /testfile2 >/dev/null & done; wait; } [1] 28920 [2] 28921 [3] 28922 [4] 28923 [1] Done plasma cat /testfile2 > /dev/null [2] Done plasma cat /testfile2 > /dev/null [4]+ Done plasma cat /testfile2 > /dev/null [3]+ Done plasma cat /testfile2 > /dev/null real 6m7.030s user 3m7.336s sys 2m18.477s This means 27.9 MB/s per "cat", and excellent 111.6 MB/s in total. Note that a similar test doing parallel writes to the same file would miserably fail, because only one writer at a time would get exclusive access to the file, and the other writers would have to wait until all data is written. Ideally, this serializes all writes. ---------------------------------------------------------------------- TEST 3: random reads ---------------------------------------------------------------------- /testfile1 and /testfile2 are created as in test 1. This test demonstrates that one can keep a file open, and read blocks from it in random order, without having to contact the namenode after a while. We use the test program in test/ps_random.ml (in the release tarball). - Read /testfile2 this way (1024 blocks of 1M size): office1 $ time ~/ps_random.opt -n 1024 /testfile2 Got 1048576 bytes from block 4544 Got 1048576 bytes from block 2205 Got 1048576 bytes from block 9302 ... Got 1048576 bytes from block 3717 Got 1048576 bytes from block 3746 Got 1048576 bytes from block 6397 Done real 0m26.563s user 0m10.033s sys 0m3.844s This is a speed of 38.5 MB/s (in the second run - we do not want to test the effectiveness of the page cache). The real point, however, is that this test emits only a few requests to the namenode (one can test this by running the namenode server in debug mode). This means that reads from an non-mutated file can be done without contacting the namenode. Effectively, the client loads the location of all blocks from the namenode, and once each block is known, the namenode is no longer needed. This is exactly the access pattern a read-only database would generate. As the namenode is not in the query path, the performance one can get from such a database is only limited by the number of replicas of the database file. - Read /testfile2 with two parallel ps_random.opt (2*1024 blocks of 1M size): office1 $ time { ~/ps_random.opt -n 1024 /testfile2 & ~/ps_random.opt -n 1024 /testfile2 & wait; } Got 1048576 bytes from block 4544 ... Got 1048576 bytes from block 6397 Done Got 1048576 bytes from block 7153 Got 1048576 bytes from block 5355 Got 1048576 bytes from block 3717 Got 1048576 bytes from block 3746 Got 1048576 bytes from block 6397 Done [1]- Done ~/ps_random.opt -n 1024 /testfile2 [2]+ Done ~/ps_random.opt -n 1024 /testfile2 real 0m32.703s user 0m15.093s sys 0m5.540s This yields 63 MB/s. - For three parallel ps_random.opt runs the result is 73 MB/s - For four parallel ps_random.opt runs the result is 78 MB/s ---------------------------------------------------------------------- TEST 4: Namenode read-only transactions ---------------------------------------------------------------------- The speed of the namenode is crucial for a distributed filesystem. Let's have a look at speed of inode lookups. The test program ps_lookup.ml performs n lookups of inode attributes in a loop (all in a single transaction). The inode is given as number (look this up with "plasma ls -i"): office1 $ time ~/ps_lookup.opt -n 1 23301 Done real 0m0.153s user 0m0.132s sys 0m0.008s This includes connection setup. There is some latency because of the authentication (ps_lookup first connects to the authentication daemon, and then to the namenode, which involves then RPC-level authentication, and finally the impersonation). This gives an import hint for application design: It is good to keep the number of connections low. The PlasmaFS protocol, fortunately, allows it to run any number of transactions concurrently over the same connection (transaction multiplexing). - Multiple synchronous inode lookups: office1 $ time ~/ps_lookup.opt -n 1000 23301 Done real 0m1.139s user 0m0.380s sys 0m0.068s By doing the inode lookup 1000 times, we get a better impression what a single RPC request/response round-trip takes (ps_lookup is synchronous). The inode data is cached in the namenode server, so there is usually no need to do the lookup in the database. We get here 878 requests/s. - four parallel runs: office1 $ time { ~/ps_lookup.opt -n 1000 23301 & ~/ps_lookup.opt -n 1000 23301 & ~/ps_lookup.opt -n 1000 23301 & ~/ps_lookup.opt -n 1000 23301 & wait; } [1] 1914 [2] 1915 [3] 1916 [4] 1917 Done Done [1] Done ~/ps_lookup.opt -n 1000 23301 [4]+ Done ~/ps_lookup.opt -n 1000 23301 Done [2]- Done ~/ps_lookup.opt -n 1000 23301 Done [3]+ Done ~/ps_lookup.opt -n 1000 23301 real 0m1.553s user 0m0.960s sys 0m0.176s This gives 2575 requests/s. - eight parallel runs: results in 3949 requests/s. We have done this test with the default configuration of the namenode, which starts four worker processes. By increasing this number, one can get even beyond 4000 requests/s. ---------------------------------------------------------------------- TEST 5: Namenode write transactions ---------------------------------------------------------------------- In this test we essentially check how many commits we can do per second. A commit writes the changes to the database, and finally syncs them there. Note that the test setup uses an SSD for the database, so we expect high values here. However, this test unveils some problems :-(. Also note that 2-phase commit is off (we have only one namenode - I don't have access to a second SSD, so I currently cannot repeat the test with two namenodes). The test program ps_create.ml does: - Looks up an existing directory - Creates n empty files in this directory Each creation is immediately committed. The database configuration is essential for this test. We use: synchronous_commit = off (but leave fsync on) wal_buffers = 8MB wal_writer_delay = 20ms checkpoint_segments = 256 Also, we set the "noop" disk scheduler: echo noop > /sys/block/sdc/queue/scheduler The filesystem is (unfortunately) not SSD-aware (no TRIM support). So the numbers are probably lower than they could be. - Run with n=1 office1 $ time ~/ps_create.opt -n 1 /dir Done real 0m0.186s user 0m0.060s sys 0m0.016s Compare with the ps_lookup run above with n=1: Here we are only slightly slower. Note that we run two transactions (one read transaction for the directory lookup, and the write transaction creating the file). - Run with n=1000 office1 $ time ~/ps_create.opt -n 1000 /dir Done real 0m8.392s user 0m1.136s sys 0m0.236s This would mean 119 commits/s. Let's see whether we can increase this by running ps_create several times in parallel: office1 $ time { ~/ps_create.opt -n 1000 /dir1 & ~/ps_create.opt -n 1000 /dir2 & ~/ps_create.opt -n 1000 /dir3 & ~/ps_create.opt -n 1000 /dir4 & wait; } [1] 2947 [2] 2948 [3] 2949 [4] 2950 Done Done Done Done [1] Done ~/ps_create.opt -n 1000 /dir1 [2] Done ~/ps_create.opt -n 1000 /dir2 [3]- Done ~/ps_create.opt -n 1000 /dir3 [4]+ Done ~/ps_create.opt -n 1000 /dir4 real 0m17.856s user 0m5.448s sys 0m1.148s These are 224 commits/s. If we turn the fsync feature in PostgreSQL off, the speed can further be increased to around 280 commits/s. A potential bottleneck is that PlasmaFS serializes commits strictly. This is usually overly pessimistic - most transactions could be committed in parallel. The serialization is normally only required for commits changing the same inode. This will be explored in more detail in the future. A final note: 224 commits/s is not so bad. A commit can do lots of things at once and usually consists of more operations than the few we used in the above test to create empty files. For instance, if a file is written with "plasma put", a commit is only done after each gigabyte of written data.