/home/cityd/production/web-camlcity/back0/trunk/plasma/perf/perf111013.txt

======================================================================
PlasmaFS performance impressions
======================================================================

SOFTWARE:

plasma rev 479 (corresponding to plasma-0.4.1)

HARDWARE:

- Two nodes (office3, office4) with an Opteron 1354 CPU (4-core, 2.2
  GHz), and 8G of RAM 
- The nodes store the volume data on a RAID-1 array. The disks are not
  ultimately fast, but "normal" 7500 rpm disks, connected with
  SATA-300.
- The PostgreSQL database resides on an SSD with 90G capacity (OCZ Agility 3). 
- The two nodes are connected with a gigabit network.  
- A third node (office1) is used for submitting test requests

CLUSTER CONFIGURATION:

- There is only one namenode.  The security mode for namenode
  accesses is "privacy" (i.e. full). The security mode for datanode
  accesses is "auth" (i.e. only authentication, but no encryption).
- blocksize=1M

----------------------------------------------------------------------
TEST 1: sequential r/w
----------------------------------------------------------------------

- Create 10G test file on office1 (outside the cluster):

office1 $ dd if=/dev/zero of=testfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 76.9597 s, 140 MB/s

Note that PlasmaFS does not do any compression, so a file of zeroes
is as good as any other file.

- Copy this file to PlasmaFS - replication=1:

office1 $ time plasma put -rep 1 testfile /testfile1 
Copied: testfile -> /testfile1

real	2m28.382s
user	0m3.296s
sys	0m47.759s

(Corresponds to 69MB/s. The CPU load on the cluster nodes varied between
5 and 15%.)

- Copy this file to PlasmaFS - replication=2:

office1 $ time plasma put -rep 2 testfile /testfile2
Copied: testfile -> /testfile2

real	3m49.302s
user	0m6.300s
sys	1m20.801s

(Corresponds to 45MB/s.  The CPU load on the cluster nodes varied between
10 and 20%.)

- Check blocklists:

office1 $ plasma blocks "/testfile*"
/testfile1:
        0 -       127: 192.168.5.30:2728[11264-11391]
      128 -       255: 192.168.5.40:2728[3072-3199]
      256 -       383: 192.168.5.30:2728[11392-11519]
      384 -       639: 192.168.5.40:2728[3200-3455]
      640 -      1023: 192.168.5.30:2728[11520-11903]
     1024 -      1151: 192.168.5.40:2728[3456-3583]
     1152 -      1791: 192.168.5.30:2728[11904-12543]
     1792 -      1919: 192.168.5.40:2728[3584-3711]
     1920 -      2175: 192.168.5.30:2728[12544-12799]
     2176 -      2431: 192.168.5.40:2728[3712-3967]
     2432 -      2559: 192.168.5.30:2728[12800-12927]
     2560 -      2687: 192.168.5.40:2728[3968-4095]
     2688 -      2943: 192.168.5.30:2728[12928-13183]
     2944 -      3327: 192.168.5.40:2728[4096-4479]
     3328 -      3455: 192.168.5.30:2728[13184-13311]
     3456 -      3583: 192.168.5.40:2728[4480-4607]
     3584 -      3839: 192.168.5.30:2728[13313-13568]
     3840 -      3967: 192.168.5.40:2728[4608-4735]
     3968 -      4095: 192.168.5.30:2728[13569-13696]
     4096 -      4351: 192.168.5.40:2728[4736-4991]
     4352 -      4479: 192.168.5.30:2728[13697-13824]
     4480 -      4863: 192.168.5.40:2728[4992-5375]
     4864 -      5119: 192.168.5.30:2728[13825-14080]
     5120 -      5247: 192.168.5.40:2728[5376-5503]
     5248 -      5503: 192.168.5.30:2728[14081-14336]
     5504 -      5759: 192.168.5.40:2728[5504-5759]
     5760 -      5887: 192.168.5.30:2728[14337-14464]
     5888 -      6143: 192.168.5.40:2728[5760-6015]
     6144 -      6399: 192.168.5.30:2728[14465-14720]
     6400 -      6527: 192.168.5.40:2728[6016-6143]
     6528 -      6783: 192.168.5.30:2728[14721-14976]
     6784 -      7039: 192.168.5.40:2728[6145-6400]
     7040 -      7422: 192.168.5.30:2728[14977-15359]
     7423 -      7423: 192.168.5.30:2728[15363-15363]
     7424 -      7551: 192.168.5.40:2728[6401-6528]
     7552 -      7679: 192.168.5.30:2728[15364-15491]
     7680 -      7807: 192.168.5.40:2728[6529-6656]
     7808 -      7935: 192.168.5.30:2728[15492-15619]
     7936 -      8063: 192.168.5.40:2728[6657-6784]
     8064 -      8191: 192.168.5.30:2728[15620-15747]
     8192 -      8447: 192.168.5.40:2728[6785-7040]
     8448 -      8575: 192.168.5.30:2728[15748-15875]
     8576 -      8831: 192.168.5.40:2728[7041-7296]
     8832 -      9087: 192.168.5.30:2728[15876-16131]
     9088 -      9471: 192.168.5.40:2728[7297-7680]
     9472 -     10111: 192.168.5.30:2728[16132-16771]
    10112 -     10239: 192.168.5.40:2728[7681-7808]
  blocks:                10240
  actual replication:    1
  requested replication: 1

/testfile2:
        0 -      2060: 192.168.5.40:2728[217088-219148] 192.168.5.30:2728[19456-21516]
     2061 -      4094: 192.168.5.40:2728[219150-221183] 192.168.5.30:2728[21517-23550]
     4095 -      8202: 192.168.5.40:2728[221185-225292] 192.168.5.30:2728[23551-27658]
     8203 -     10239: 192.168.5.40:2728[225294-227330] 192.168.5.30:2728[27659-29695]
  blocks:                20480
  actual replication:    2
  requested replication: 2

As one can see, PlasmaFS allocates the blocks for /testfile1 in
alternating sequences of >= 128 blocks (i.e. 128M). The blocks are
placed contiguously on disk. The placement scheme ensures that
practically no random seeks are needed to read the file. Also,
random accesses will be balanced between the two nodes.

For /testfile2 we get even longer contiguous block sequences
(2000-4000 blocks).

- Read /testfile1 back to office1:

office1 $ time plasma get /testfile1 testfile1
Copied: /testfile1 -> testfile1

real	1m39.445s
user	0m30.002s
sys	0m34.710s

(Corresponds to 103MB/s. Probably maxes the gigabit network out. The CPU
load on the cluster nodes is < 10%.

An interesting minor point is that the plasma client utility consumes
now relatively more CPU time in user space than in kernel space,
compared to "plasma put" above. This has to do with the fact that it
is now the client that has to decode the RPC messages, and thus has to
create an additional copy of the data blocks.)


- Read /testfile2 back to office1:

office1 $ time plasma get /testfile2 testfile2
Copied: /testfile2 -> testfile2

real	1m32.232s
user	0m26.834s
sys	0m32.054s

(Corresponds to 104MB/s. Probably maxes the gigabit network out. The CPU
load on the cluster nodes is < 10%.)

- Read /testfile1 back to office3 (i.e. _within_ the cluster):

office3 $ time plasma get /testfile1 testfile1
Copied: /testfile1 -> testfile1

real	2m49.437s
user	0m17.737s
sys	0m35.638s

Corresponds to 60MB/s. A parallel running "dstat" shows that the local
disk is the bottleneck (data is read from the same disk as it is written
to).

- Read /testfile1 back to office3 (i.e. _within_ the cluster).
  without storing it:

$ time plasma cat /testfile1 >/dev/null

real	1m52.256s
user	0m29.146s
sys	0m21.961s

(Corresponds to 91MB/s.)

- Read /testfile1 back to office3 (i.e. _within_ the cluster).
  without storing it:

office3 $ time plasma cat /testfile2 >/dev/null

real	1m12.416s
user	0m24.814s
sys	0m20.633s

(Corresponds to 141MB/s. This is around the maximum the local disks
can provide. This test profits from the shared memory transfer method,
i.e. when office3 reads blocks from itself, these do not travel via
the network, not even via local-loopback.)


----------------------------------------------------------------------
TEST 2: parallel sequential reads
----------------------------------------------------------------------

/testfile1 and /testfile2 are created as in test 1.

The only purpose of this test is to demonstrate that parallel reads
do not lock each other out. We "plasma cat" the files several times
on office1, i.e. outside the cluster. The expected effect is that the
each "cat" gets 1/n of the network bandwidth.

- Read /testfile2 this way:

office1 $ time { for k in 1 2 3 4; do plasma cat /testfile2 >/dev/null & done; wait; }
[1] 28920
[2] 28921
[3] 28922
[4] 28923
[1]   Done                    plasma cat /testfile2 > /dev/null
[2]   Done                    plasma cat /testfile2 > /dev/null
[4]+  Done                    plasma cat /testfile2 > /dev/null
[3]+  Done                    plasma cat /testfile2 > /dev/null

real	6m7.030s
user	3m7.336s
sys	2m18.477s

This means 27.9 MB/s per "cat", and excellent 111.6 MB/s in total.

Note that a similar test doing parallel writes to the same file would
miserably fail, because only one writer at a time would get exclusive
access to the file, and the other writers would have to wait until
all data is written. Ideally, this serializes all writes.


----------------------------------------------------------------------
TEST 3: random reads
----------------------------------------------------------------------

/testfile1 and /testfile2 are created as in test 1.

This test demonstrates that one can keep a file open, and read blocks
from it in random order, without having to contact the namenode after
a while.

We use the test program in test/ps_random.ml (in the release tarball).

- Read /testfile2 this way (1024 blocks of 1M size):

office1 $ time ~/ps_random.opt -n 1024 /testfile2
Got 1048576 bytes from block 4544
Got 1048576 bytes from block 2205
Got 1048576 bytes from block 9302
...
Got 1048576 bytes from block 3717
Got 1048576 bytes from block 3746
Got 1048576 bytes from block 6397
Done

real	0m26.563s
user	0m10.033s
sys	0m3.844s

This is a speed of 38.5 MB/s (in the second run - we do not want to
test the effectiveness of the page cache). The real point, however, is
that this test emits only a few requests to the namenode (one can test
this by running the namenode server in debug mode). This means that
reads from an non-mutated file can be done without contacting the
namenode. Effectively, the client loads the location of all blocks
from the namenode, and once each block is known, the namenode is no
longer needed.

This is exactly the access pattern a read-only database would
generate.  As the namenode is not in the query path, the performance
one can get from such a database is only limited by the number
of replicas of the database file.

- Read /testfile2 with two parallel ps_random.opt (2*1024 blocks of 1M size):

office1 $ time { ~/ps_random.opt -n 1024 /testfile2 &  ~/ps_random.opt -n 1024 /testfile2 & wait; }
Got 1048576 bytes from block 4544
...
Got 1048576 bytes from block 6397
Done
Got 1048576 bytes from block 7153
Got 1048576 bytes from block 5355
Got 1048576 bytes from block 3717
Got 1048576 bytes from block 3746
Got 1048576 bytes from block 6397
Done
[1]-  Done                    ~/ps_random.opt -n 1024 /testfile2
[2]+  Done                    ~/ps_random.opt -n 1024 /testfile2

real	0m32.703s
user	0m15.093s
sys	0m5.540s

This yields 63 MB/s.

- For three parallel ps_random.opt runs the result is 73 MB/s
- For four parallel ps_random.opt runs the result is 78 MB/s

----------------------------------------------------------------------
TEST 4: Namenode read-only transactions
----------------------------------------------------------------------

The speed of the namenode is crucial for a distributed filesystem.
Let's have a look at speed of inode lookups.

The test program ps_lookup.ml performs n lookups of inode attributes
in a loop (all in a single transaction). The inode is given as number
(look this up with "plasma ls -i"):

office1 $ time ~/ps_lookup.opt -n 1 23301
Done

real	0m0.153s
user	0m0.132s
sys	0m0.008s

This includes connection setup. There is some latency because of the
authentication (ps_lookup first connects to the authentication daemon,
and then to the namenode, which involves then RPC-level authentication,
and finally the impersonation).

This gives an import hint for application design: It is good to keep
the number of connections low. The PlasmaFS protocol, fortunately,
allows it to run any number of transactions concurrently over the
same connection (transaction multiplexing).


- Multiple synchronous inode lookups:

office1 $ time ~/ps_lookup.opt -n 1000 23301
Done

real	0m1.139s
user	0m0.380s
sys	0m0.068s

By doing the inode lookup 1000 times, we get a better impression what
a single RPC request/response round-trip takes (ps_lookup is synchronous).
The inode data is cached in the namenode server, so there is usually no
need to do the lookup in the database. We get here 878 requests/s.

- four parallel runs:

office1 $ time { ~/ps_lookup.opt -n 1000 23301 & ~/ps_lookup.opt -n 1000 23301 & ~/ps_lookup.opt -n 1000 23301 & ~/ps_lookup.opt -n 1000 23301 & wait; }
[1] 1914
[2] 1915
[3] 1916
[4] 1917
Done
Done
[1]   Done                    ~/ps_lookup.opt -n 1000 23301
[4]+  Done                    ~/ps_lookup.opt -n 1000 23301
Done
[2]-  Done                    ~/ps_lookup.opt -n 1000 23301
Done
[3]+  Done                    ~/ps_lookup.opt -n 1000 23301

real	0m1.553s
user	0m0.960s
sys	0m0.176s

This gives 2575 requests/s.

- eight parallel runs: results in 3949 requests/s.

We have done this test with the default configuration of the namenode,
which starts four worker processes. By increasing this number, one can
get even beyond 4000 requests/s.


----------------------------------------------------------------------
TEST 5: Namenode write transactions
----------------------------------------------------------------------

In this test we essentially check how many commits we can do per
second. A commit writes the changes to the database, and finally syncs
them there. Note that the test setup uses an SSD for the database, so
we expect high values here. However, this test unveils some problems
:-(.  Also note that 2-phase commit is off (we have only one namenode
- I don't have access to a second SSD, so I currently cannot repeat
the test with two namenodes).

The test program ps_create.ml does:
 - Looks up an existing directory
 - Creates n empty files in this directory

Each creation is immediately committed.

The database configuration is essential for this test. We use:

synchronous_commit = off     (but leave fsync on)
wal_buffers = 8MB
wal_writer_delay = 20ms
checkpoint_segments = 256

Also, we set the "noop" disk scheduler:

echo noop > /sys/block/sdc/queue/scheduler

The filesystem is (unfortunately) not SSD-aware (no TRIM support). So
the numbers are probably lower than they could be.


- Run with n=1

office1 $ time ~/ps_create.opt -n 1 /dir
Done

real	0m0.186s
user	0m0.060s
sys	0m0.016s

Compare with the ps_lookup run above with n=1: Here we are only slightly
slower. Note that we run two transactions (one read transaction for the
directory lookup, and the write transaction creating the file).

- Run with n=1000

office1 $ time ~/ps_create.opt -n 1000 /dir
Done

real	0m8.392s
user	0m1.136s
sys	0m0.236s

This would mean 119 commits/s. Let's see whether we can increase this
by running ps_create several times in parallel:

office1 $ time { ~/ps_create.opt -n 1000 /dir1 &  ~/ps_create.opt -n 1000 /dir2 & ~/ps_create.opt -n 1000 /dir3 &  ~/ps_create.opt -n 1000 /dir4 & wait; }
[1] 2947
[2] 2948
[3] 2949
[4] 2950
Done
Done
Done
Done
[1]   Done                    ~/ps_create.opt -n 1000 /dir1
[2]   Done                    ~/ps_create.opt -n 1000 /dir2
[3]-  Done                    ~/ps_create.opt -n 1000 /dir3
[4]+  Done                    ~/ps_create.opt -n 1000 /dir4

real	0m17.856s
user	0m5.448s
sys	0m1.148s

These are 224 commits/s.

If we turn the fsync feature in PostgreSQL off, the speed can further
be increased to around 280 commits/s.

A potential bottleneck is that PlasmaFS serializes commits strictly.
This is usually overly pessimistic - most transactions could be
committed in parallel. The serialization is normally only required for
commits changing the same inode. This will be explored in more detail
in the future.

A final note: 224 commits/s is not so bad. A commit can do lots of
things at once and usually consists of more operations than the few we
used in the above test to create empty files. For instance, if a file
is written with "plasma put", a commit is only done after each
gigabyte of written data.
This web site is published by Informatikbüro Gerd Stolpmann
Plasma	GitLab	Archive
Projects	Blog	Knowledge