Features
- What is Plasma? - The core of it is a distributed
filesystem, PlasmaFS, especially designed for large files. Right now,
there are two application sub-projects: Map/reduce allows it to
transform large files, e.g. by sorting the records in the files.
The key/value db allows it to store key/value pairs for fast
read access. All parts of Plasma are highly scalable.
- What can I do with Map/Reduce? - Map/Reduce is a framework
for transforming large quantities of data (at least several
terabytes). The data is represented as a sequence of records. The
records have to include a key, but otherwise the user can include any
kind of data. The transformation is configurable by the
user. Especially two parts of the transformation are defined by the
user: The map function is a preprocessing step where the raw
data can be brought into a shape so the key is easily extractable. The
reduce function is a postprocessing step that can compress
(fold) the rearranged data. The core transformation between the map
and reduce functions can be best imagined as sorting all data by key.
The execution of the job is done in a distributed way on many computers.
- How can I use PlasmaFS? - PlasmaFS
is a POSIX-compliant filesystem. As of now, there are three ways
of accessing a PlasmaFS volume: by using the
plasma
command-line client, by using the native API (a network protocol), or by mounting the volume via NFS.
- What kind of filesystem is PlasmaFS? - PlasmaFS differs in
many ways from a "normal" filesystem such as ext3 that is used for
local disks. PlasmaFS is distributed - this means it runs on many
computers, and can bundle all disks of these computers into a single
filesystem volume. Also, it can access all disks in parallel, even for
a single file, which may result in big speedups. The files can be
stored in a replicated way: All blocks of the files are stored on
several computers, so that they remain accessible when machines
crash. The best is that no special hardware is needed - normal server
hardware will do. However, PlasmaFS is also a specialized
filesystem. It focuses on large files, and is typically run with
blocksizes of 64K to 1M. The latency for accessing files is higher
than for a local filesystem (it first has to figure out where the
files are stored in the network). Last but not least, PlasmaFS does
not like it when parts of files are overwritten frequently. Ideally, PlasmaFS
files are written once as a whole - this gives the filesystem a better
idea how to allocate blocks in the network. (So e.g. running a database
on top of PlasmaFS would appear slow, although it is possible.)
PlasmaFS is implemented as user-space program (no kernel module).
- Can I run a replicated NFS service on top of PlasmaFS? - Yes,
you can: Just install PlasmaFS, and as many Plasma NFS bridges as you need.
These bridges translate the NFS protocol into Plasma's own network
protocol. The result is a high-speed, scalable, and replicated
filesystem accessible via NFS.
- How safe is PlasmaFS? - The design of PlasmaFS protects
against many kinds of failures. All data and metadata can be
replicated, i.e. stored on more than one machine (protection against
machine crashes). The consistency is ensured at all times:
ACID-compliant transactions are used for modifying file data and
metadata. For example, it cannot happen that a modification is only
partly done (e.g. because a machine fails during a transaction).
For the metadata updates, PlasmaFS uses the two-phase commit protocol
to ensure safety. In general, the design allows for better data safety
than for a local filesystem with journaled updates (because of
replication).
- How correct is PlasmaFS? - The design of the filesystem
(the protocols, and the way the implementation is designed) bases on
techniques that have long been used for distributed systems, e.g. RPC
and two-phase commit. It is likely that all design bugs on this level
are eliminated now. Of course, it is nevertheless possible that the
implementation is incorrect at some point. However, a number of safety
measures have been applied: (1) The implementation is done in one of
the safest programming languages that are available (Ocaml). This
language uses intriguing techniques to keep the programmer off from
making errors, e.g. by applying inference methods borrowed from the
field of artificial intelligence. (2) The lower half of the namenode
is actually an existing database system, namely PostgreSQL. By using
it Plasma profits a lot from the work that has been put into this
product. Especially, PostgreSQL ensures that the namenode data is
stored safely. (3) PlasmaFS has been designed so that there are only a
few critical code paths (e.g. the commit code path). If these are
correct, it cannot happen anymore that wrong namenode data are
stored. During the development of PlasmaFS, this feature has often
been "tested" unintentionally - bugs in non-critical code paths just
throw exceptions that are caught at some point, or at most the
namenode server just stops working (which is still better than
destroying data). (4) There are test programs for the complicated data
structures (e.g. for the compressed block list representation). (5)
Beyond that, the system has been tested a lot by running sample
applications (e.g. map/reduce runs).
- How secure is PlasmaFS? - PlasmaFS is now highly secure
(since version 0.4). All accesses to the namenodes are authenticated,
encrypted and integrity-protected. Accesses to the datanodes are at
least authenticated, and higher levels of security can be requested.
A simple ticket-based authentication system is used.
- How many computers can be added to a PlasmaFS cluster? -
There is no limit in the design. With higher numbers of machines it
can happen that the namenodes turn out as bottleneck, though.
Benchmarks suggest that the effective limit with standard hardware is
somewhere in the range 100-500 nodes, but it has not yet been explicitly
tested out.
- Can PlasmaFS be extended? - Yes. Further machines can be
added without downtime.
- Plasma KV: What can I do with it? - The key/value database
is a so-called "NoSQL" database (not only SQL). Plasma KV is just a
library that can be linked to programs accessing databases, very much
like GDBM or Berkeley DB. Because the database is stored in a
PlasmaFS file, one gets a number of interesting properties: The
database is automatically replicated. It can be accessed from any
node in the PlasmaFS network. Read accesses to the database are highly
scalable (beyond any reasonable limit). As PlasmaFS provides
ACID transactions, the databases can also profit from it, and reads
and writes to the same database can be isolated from each other
(especially one can modify the database while there are readers).
Plasma KV targets appications that need ultra-high speed for read
accesses to large databases, and where writes do not occur too often.
- Map/Reduce: How large can the files be? - There is no
limit in the design. Unlike other implementations of Map/Reduce,
Plasma also handles all corner cases well. For example, it cannot
happen that a single machine is overloaded with data so that the job
fails because of this (e.g. when a large chunk of data is sorted on a
single machine only).
- Do I have to port all my Map/Reduce programs to Ocaml? -
No. There is a streaming mode as in Hadoop. The implementations of
the map and reduce functions can be done in any
programming language. Plasma simply spawns external programs, and
pipes the data through them.
- What is the connection between Map/Reduce and PlasmaFS? -
For running Map/Reduce jobs a filesystem is needed that can store
large files. The filesystem should be accessible from all computers
that run parts of the job. Typically, the same set of computers
is used for the filesystem and for running the job - this works
well because the filesystem mostly needs I/O resources whereas the
job needs the CPU.
The filesystem also provides additional query functions that are typically not found in filesystem implementations. Especially, the Map/Reduce job executor can find out where the files are stored. This allows for an important optimization: The parts of the job accessing certain parts of a file are executed on the machine storing these parts. This avoids network traffic ("move the algorithm, not the data"). There are further special access functions: Jobs can request that files are preferrably stored on the local machine if there is still space. This avoids network traffic, too. PlasmaFS provides a fast access path via shared memory, so such local-only traffic does not even hit the network stack.
Although it would be possible to implement Map/Reduce on top of an arbitrary distributed filesystem, all these optimizations would not be available, and would let it appear much slower.
Build
- How do I build Plasma? - Plasma requires a number of
prerequisite software packages. In short, this is described in
the INSTALL file of the
distributed tarball. Doing that can take some time. Note that the
prerequisite software is often not completely available from Linux
distributions (or outdated).
A better way of building Plasma is to install GODI, a distribution of Ocaml that works on Linux and other operation systems, and that includes all the prerequisite software. If you select the 3.12 branch of GODI, Plasma will be offered there as package "godi-plasma". The build process in GODI is automated, so you do not have to type "configure" or "make".
For a few but popular Linux distributions there is a script that fully automates the build via GODI, so you only have to start this script, and the software is dropped into a directory you specify.
- Are there binary downloads? - There is now a binary release, but only for Linux on AMD64 platforms (see download page). This release is restricted as no custom map/reduce programs can be compiled with it. It is nevertheless useful for exploring PlasmaFS and the streaming mode of map/reduce.
Installation
- How do I install Plasma? - After the build, Plasma needs
to be installed on the machines. Read the
Deployment
Guide.
- Do I need the portmapper? - Although Plasma uses SunRPC, it does so without the portmapper.
Configuration
- Can a machine be both namenode and datanode? - Sure it can.
Note, however, that the type of load put onto namenodes and datanodes
do not play well together. Both kinds of nodes are I/O-bound where
namenodes need lots of disk seeks to do their work and datanodes access
the disks in a more sequential way. For small networks this probably
does not cause problems, but if the load of the namenodes grows this
can easily turn out as bottleneck.
- Can a machine be both datanode for PlasmaFS and tasknode for
map/reduce? - Sure it can,
and this is even recommended. There is an optimization so that it is
tried to execute tasks on the datanodes that store most of the data
that is processed by the tasks. This avoids network traffic.
- Is it better to put host names or IP addresses into namenode.hosts?
- The file namenode.hosts lists the nodes that play the role of namenodes.
Starting with release 0.3 it will become possible to put either host names
or IP addresses into this file (before 0.3, only host names are supported).
Host names and IP addresses have different tasks in a network. Basically,
the difference is that a host name can be seen as the unique name of a machine
while
an IP address says how a network packet reaches a network interface.
A machine has usually several network interfaces (at least from an OS
point of view, also counting "lo" and "ppp0"). Because of this an IP
address is a fragile identifier - it can happen that the same machine
is reachable by different IP addresses, and certain IP addresses like
127.0.0.1 do not identify anything at all. Also, sometimes IP addresses
need to be changed when a machine is physically moved to a new location.
Because of these difficulties there is a clear preference for using host names: Each machine has a unique identifier (returned by [gethostname]), and it can be ensured that this identifier resolves to the right IP address, depending on where the resolution is done.
Since version 0.3 IP addresses are also supported, provided that each machine has only one IP address (in addition to 127.0.0.1). It is useful as fallback method if the host names cannot be set up properly (e.g. missing root privileges).
The other ".hosts" files (datanode.hosts, tasknode.hosts, nfsnode.hosts) are less sensitive to problems than namenode.hosts. It should be no problem to use either form of referencing machines here.
- How much RAM is needed for map/reduce? - There are a number
of configuration parameters that affect the RAM
consumption. Generally, it is the tasks that need a lot of RAM, and
much less the namenode and datanodes. Of course, it mainly depends
how many tasks are run on a tasknode. As of Plasma 0.3, there is now a
sophisticated mechanism in place that prevents that too many tasks are
run on a node (so that swapping can usually be avoided). If memory
becomes tight the tasks switch to a RAM-saving mode, and if not enough
memory is available, tasks can even be delayed. This mechanism should
prevent slowdowns if too much load is put onto a single machine
because of misconfiguration.
Generally, the tasks allocate two sorts of RAM: normal private RAM, and shared memory. Private RAM is only partly covered by the mentioned mechanism, but at least the bigger buffers are taken into account. The shared buffers are all managed by this mechanism. One can set parameters when memory is considered as tight or even maxed out, and these parameters exist for both types of RAM (see the documentation for the module
Mapred_config
). For many operation systems Plasma knows a way how to get defaults for these parameters, so it is usually not necessary to set them. The default limits are: 75% of the available physical RAM can be used for private buffers, and 12% of the physical RAM can be used for shared memory.Now, buffers are used for many things. Usually, however, there are only two dominating buffer types: Sort buffers, and file buffers. The size of the sort buffers can (and must) be set with a parameter (
sort_size
). Because of some auxiliary buffers, the actual amount of RAM required for a sort task is higher (if the lines of the files are very short, this can be up to 3 times this configured size, but this is an extreme case). A good value forsort_size
is the amount of RAM divided by the maximum number of tasks running on the node, and this divided by 2 or 3.The size of the file buffers is controlled by the parameter
buffer_size
, and if RAM is tight, by the smaller valuebuffer_size_tight
. Both values must be at least as large as the bigblock size (a file buffer must at least hold a bigblock). Now, the tasks needing the most file buffers are the shuffle tasks. This type of task reads from several input files (up tomerge_limit
) and writes to several output files (up tosplit_limit
). For example, if the bigblocks are 64M, and the shuffle tasks open up to 4 files, and you want to run 16 such tasks in parallel, you need alone for the file buffers 4G of RAM.The recommended minimum of RAM is roughly calculated by:
8 * number of cores * bigblock size
For example, a quad-core system should be at least equipped with 2G of RAM if the bigblocks are 64M in size. But this is really a minimum. More RAM will make the system faster.
- How much RAM is needed for the namenode? - The namenode
runs a database, and of course it is advantegeous if at least as much
RAM is available so that the relevant parts of the db fit into memory.
(If you don't have an idea now, just assume 1G of RAM.)
Another factor is that the namenodes load the blockmaps into RAM. The blockmaps store which filesystem blocks are allocated and which are free. Fortunately, only 2 bits are needed per block, plus some management fields per block row. I haven't calculated an exact number yet, but it is in total probably not more than a byte per block.
So, e.g. for managing 1 million blocks you would need 1M of RAM. If the block size is 1M, this is enough for 1T of disk space. As you see, it is not that much memory, even for large clusters - for 1000T of disk space 1G of management RAM is needed.
Design
- Why does Plasma focus on rather small block sizes like 1M? - Other
map/reduce implementations recommend much larger block sizes, such as
64M or even more. Of course, you can run Plasma also with such large
blocks, but it is not necessary. You should see this as a feature -
the support for smaller blocks allows a more fine-grained disk and
RAM management, so resources can be better allocated. This is especially
valuable when not all jobs process large amounts of data, but there
are jobs with different data volumes.
There are two features in Plasma that make smaller blocks possible. First, Plasma controls where the blocks are placed on disk. The block allocation algorithm tries to allocate the blocks of a file in a way so that they end up close to each other on the disk. (This "control" works by pre-allocating the data space as a large file.) Ideally, if files are sequentially written, the blocks are placed sequentially on the disk, so far possible. The practice shows that good block placements can often be achieved.
Second, the blocklists are stored in a compressed way (sometimes called "extent-based" by filesystem developers). For example, if a file consists of the blocks 10,11,12,13,14,18,19,20 this is just stored as "10-14,18-20".
In short, these optimizations allow it to treat many relatively small blocks as a virtual large block. Often, PlasmaFS achieves to treat sequences of around 1000 blocks as if they were a single virtual block. As you see, this strategy is far better than starting with big blocks like 64M but not controlling their placement. Of course, the algorithms managing the block allocation, and the block lists are now quite complicated, but this is relatively easy to handle in a modern programming language like Ocaml.
There are more advantages of small blocks: Less disk space is wasted, or in other words, it becomes affordable to store both medium-sized and large files in the filesystem without wasting too much disk space. Small blocks also give the system more freedom to manage available RAM. The compatibility with other software is better (like NFS).
- Isn't the two-phase commit protocol too slow for
map/reduce? - Well, this protocol makes the db writes a bit
slower, but not too much. Also, you can turn it off. So the real
question is: Aren't databases too slow for map/reduce?
Reading from a database is generally fast enough, especially when all data required for executing a job can be cached in RAM. The bottleneck is clearly the write access. In order to guarantee data safety, we generally wait until the data of a db transaction is synced to disk, so the disk limits the speed of write transactions. (Btw, use SSDs to improve this - at least several hundred write transactions per second should be easily possible, and the data sheets promise even more.)
There is another factor, however. If we can achieve that db transactions only occur rarely, the whole problem might go away. The Plasma design allows it to allocate many blocks in a single db transaction. For example, if the bigblock size is 64M and the filesystem block size only 1M, Plasma allocates 64 blocks in one transaction when it writes out a bigblock. By improving the Plasma code, this ratio could even be made bigger, maybe 1024 blocks per transaction. (The code change is not dramatic - one only has to keep the transaction open from one bigblock write to the next.)
The whole point is now that the network already limits the speed of data processing, and becomes usually the bottleneck before the namenode. For example, if you have a switching capacity of 10Gbit/s at most 16 bigblocks are written per second (if the bigblock is 64M in size). This translates to 16 transactions. Even my laptop allows this. For 100Gbit/s (around 100 nodes) only 160 transactions are needed, and this is easy-going if the db resides on an SSD.
So, the current design is fast enough for even larger clusters when you put the db onto an SSD. There is of course some scalability problem - for several hundred nodes and ultrafast networks the bottleneck still persists. So the question finally translates into: Is the design capable of supporting even these big deployments? Is there some more parameter that can be used for scaling up?
The design allows at least this improvement: The database could be split so that we achieve a horizontal partitioning. Every db partition is stored on an independent disk (PostgreSQL allows this by defining tablespaces). The partitions are chosen so that typical writes only need to access a single partition. For example, there could be several partitions for storing the blocklists, and the partition is determined by hashing the file ID. Of course, it is not as simple as this - the namenode needs to know about the partitioning, and must drive the database so that writes to different partitions can happen in parallel. This is complicated but imaginable.
Btw, the current Plasma release does not know anything about the network topology, so we probably have to first talk about this before scaling up the database.
Troubleshooting
- The namenode does not find any datanode. Why? - The symptom
is a log message like "Multicast request cannot be routed". Probably there
is a problem with multicasting. First, check whether the MULTICAST
flag is on for the network device (e.g. eth0, both on the namenode and
all datanodes). Do this by looking at the output of:
ifconfig eth0
This flag is e.g. off for the "lo" interface, so a local-loopback-only installation does not work (workaround: configure eth0 even if there is no cable attached). Second, check whether there is a multicast route or a default route on the namenode. Do this by running:route -n
If there is a line starting with 224.0.0.0 or 0.0.0.0 everything is fine. Otherwise, add the route:route add -net 224.0.0.0 netmask 240.0.0.0 dev eth0
Multicast should now work on the local LAN segment. You may have to do more if there are (level 3) routers between the namenode and the datanode (e.g. increase the multicast_ttl value in the namenode config, or configure a multicast group).
- How do I check whether the PlasmaFS cluster has started up
successfully? - Check whether you can access it with the
plasma
command-line utility, and whether this utility reports that all datanodes are alive. For example, call it likeplasma fsstat -namenode HOST:PORT -cluster NAME -auth proot
Replace here HOST, PORT, and NAME with the real values that apply to your cluster. You will be asked for a password. This can be found inclusterconfig/instances/NAME/password_proot
, and is a long hex number (pick and paste with the mouse). The output is something like:BLOCK STATS: Total: 10000 blocks Used: 4496 blocks Transitional: 0 blocks Free: 5504 blocks DATANODE STATS: Enabled nodes: 2 Alive nodes: 2
If you get this output, but with too low numbers, it is likely that not all datanodes have been found (check the namenode log file). Otherwise, this utility normally complains that it cannot find the coordinator. The message may contain further hints:- Authentication failed (Auth_failed): This means that the password is wrong. Check whether you have installed the right instance.
- Authentication too weak (Auth_too_weak): This error is normally
not possible when you call
plasma
with the-auth
switch. If you do it without, you need the help of the authentication daemon, which is probably not running on the machine. - RPC program not available (Unavailable_program): The startup of the namenode is incomplete. The namenode port is already listening for TCP connections, but the startup procedure has not reached the stage where all RPC functions are enabled on it. Look into the namenode log file for errors.
- ".plasmafs: No such file or directory": You did not specify
both
-namenode
and-cluster
when callingplasma
. In this case, the utility looks for a config file.plasmafs
for missing pieces of configuration. - EFAILED error code: This is normally a serious error in the namenode. Check the namenode log files.
- I cannot mount PlasmaFS via NFS. How to debug? - First,
you should check whether the NFS bridge is running (with ps).
If so, the next source of errors is an incorrect mount command.
For Linux clients, the full mount command should look like:
mount -t nfs -o proto=tcp,port=2801,mountport=2800,mountproto=tcp,nfsvers=3,nolock,noacl,nordirplus,sec=sys 127.0.0.1:/test /mnt
Let's go through the options:- proto=tcp and mountproto=tcp: The NFS bridge only supports TCP-based mounts. Normally, the client should figure this out automatically, though.
- port=2801 and mountport=2800: The distributed config file for the NFS bridge uses non-standard ports. The NFS client needs these settings. Also, note that the NFS bridge does not notify portmapper about its port numbers.
- nfsvers=3: The NFS bridge only implements NFS version 3.
- nolock: There is no support for the locking add-on protocol.
- noacl: There is no support for the ACL add-on protocol.
- nordirplus: There is no support for the READDIRPLUS procedure (normally, the client should figure this out automatically)
- sec=sys: The client should use "system authentication", i.e. it only transmits the user ID without any credentials. Note that the NFS bridge only accepts this style of authentication if the client uses a port number < 1024.
- 127.0.0.1: This is the IP address where the NFS bridge is running.
- /test: This is the name of the PlasmaFS cluster to access, preceded with a slash. (The background of this is that future versions of the bridge will support to access several clusters simultaneously, so one needs to tell the bridge which cluster to mount.)
More Questions
- Is there a mailing list? - Yes.