Plasma GitLab Archive
Projects Blog Knowledge

Plasma Project:
Home 
Manual 
FAQ 
Slides 
Perf 
Releases 

FAQ

an incomplete list - by gerd, 2011-10-12

A list of FAQs.

Features

  • What is Plasma? - The core of it is a distributed filesystem, PlasmaFS, especially designed for large files. Right now, there are two application sub-projects: Map/reduce allows it to transform large files, e.g. by sorting the records in the files. The key/value db allows it to store key/value pairs for fast read access. All parts of Plasma are highly scalable.
     
  • What can I do with Map/Reduce? - Map/Reduce is a framework for transforming large quantities of data (at least several terabytes). The data is represented as a sequence of records. The records have to include a key, but otherwise the user can include any kind of data. The transformation is configurable by the user. Especially two parts of the transformation are defined by the user: The map function is a preprocessing step where the raw data can be brought into a shape so the key is easily extractable. The reduce function is a postprocessing step that can compress (fold) the rearranged data. The core transformation between the map and reduce functions can be best imagined as sorting all data by key.
     
    The execution of the job is done in a distributed way on many computers.
     
  • How can I use PlasmaFS? - PlasmaFS is a POSIX-compliant filesystem. As of now, there are three ways of accessing a PlasmaFS volume: by using the plasma command-line client, by using the native API (a network protocol), or by mounting the volume via NFS.
     
  • What kind of filesystem is PlasmaFS? - PlasmaFS differs in many ways from a "normal" filesystem such as ext3 that is used for local disks. PlasmaFS is distributed - this means it runs on many computers, and can bundle all disks of these computers into a single filesystem volume. Also, it can access all disks in parallel, even for a single file, which may result in big speedups. The files can be stored in a replicated way: All blocks of the files are stored on several computers, so that they remain accessible when machines crash. The best is that no special hardware is needed - normal server hardware will do. However, PlasmaFS is also a specialized filesystem. It focuses on large files, and is typically run with blocksizes of 64K to 1M. The latency for accessing files is higher than for a local filesystem (it first has to figure out where the files are stored in the network). Last but not least, PlasmaFS does not like it when parts of files are overwritten frequently. Ideally, PlasmaFS files are written once as a whole - this gives the filesystem a better idea how to allocate blocks in the network. (So e.g. running a database on top of PlasmaFS would appear slow, although it is possible.)
     
    PlasmaFS is implemented as user-space program (no kernel module).
     
  • Can I run a replicated NFS service on top of PlasmaFS? - Yes, you can: Just install PlasmaFS, and as many Plasma NFS bridges as you need. These bridges translate the NFS protocol into Plasma's own network protocol. The result is a high-speed, scalable, and replicated filesystem accessible via NFS.
     
  • How safe is PlasmaFS? - The design of PlasmaFS protects against many kinds of failures. All data and metadata can be replicated, i.e. stored on more than one machine (protection against machine crashes). The consistency is ensured at all times: ACID-compliant transactions are used for modifying file data and metadata. For example, it cannot happen that a modification is only partly done (e.g. because a machine fails during a transaction). For the metadata updates, PlasmaFS uses the two-phase commit protocol to ensure safety. In general, the design allows for better data safety than for a local filesystem with journaled updates (because of replication).
     
  • How correct is PlasmaFS? - The design of the filesystem (the protocols, and the way the implementation is designed) bases on techniques that have long been used for distributed systems, e.g. RPC and two-phase commit. It is likely that all design bugs on this level are eliminated now. Of course, it is nevertheless possible that the implementation is incorrect at some point. However, a number of safety measures have been applied: (1) The implementation is done in one of the safest programming languages that are available (Ocaml). This language uses intriguing techniques to keep the programmer off from making errors, e.g. by applying inference methods borrowed from the field of artificial intelligence. (2) The lower half of the namenode is actually an existing database system, namely PostgreSQL. By using it Plasma profits a lot from the work that has been put into this product. Especially, PostgreSQL ensures that the namenode data is stored safely. (3) PlasmaFS has been designed so that there are only a few critical code paths (e.g. the commit code path). If these are correct, it cannot happen anymore that wrong namenode data are stored. During the development of PlasmaFS, this feature has often been "tested" unintentionally - bugs in non-critical code paths just throw exceptions that are caught at some point, or at most the namenode server just stops working (which is still better than destroying data). (4) There are test programs for the complicated data structures (e.g. for the compressed block list representation). (5) Beyond that, the system has been tested a lot by running sample applications (e.g. map/reduce runs).
     
  • How secure is PlasmaFS? - PlasmaFS is now highly secure (since version 0.4). All accesses to the namenodes are authenticated, encrypted and integrity-protected. Accesses to the datanodes are at least authenticated, and higher levels of security can be requested. A simple ticket-based authentication system is used.
     
  • How many computers can be added to a PlasmaFS cluster? - There is no limit in the design. With higher numbers of machines it can happen that the namenodes turn out as bottleneck, though. Benchmarks suggest that the effective limit with standard hardware is somewhere in the range 100-500 nodes, but it has not yet been explicitly tested out.
     
  • Can PlasmaFS be extended? - Yes. Further machines can be added without downtime.
     
  • Plasma KV: What can I do with it? - The key/value database is a so-called "NoSQL" database (not only SQL). Plasma KV is just a library that can be linked to programs accessing databases, very much like GDBM or Berkeley DB. Because the database is stored in a PlasmaFS file, one gets a number of interesting properties: The database is automatically replicated. It can be accessed from any node in the PlasmaFS network. Read accesses to the database are highly scalable (beyond any reasonable limit). As PlasmaFS provides ACID transactions, the databases can also profit from it, and reads and writes to the same database can be isolated from each other (especially one can modify the database while there are readers). Plasma KV targets appications that need ultra-high speed for read accesses to large databases, and where writes do not occur too often.
     
  • Map/Reduce: How large can the files be? - There is no limit in the design. Unlike other implementations of Map/Reduce, Plasma also handles all corner cases well. For example, it cannot happen that a single machine is overloaded with data so that the job fails because of this (e.g. when a large chunk of data is sorted on a single machine only).
     
  • Do I have to port all my Map/Reduce programs to Ocaml? - No. There is a streaming mode as in Hadoop. The implementations of the map and reduce functions can be done in any programming language. Plasma simply spawns external programs, and pipes the data through them.
     
  • What is the connection between Map/Reduce and PlasmaFS? - For running Map/Reduce jobs a filesystem is needed that can store large files. The filesystem should be accessible from all computers that run parts of the job. Typically, the same set of computers is used for the filesystem and for running the job - this works well because the filesystem mostly needs I/O resources whereas the job needs the CPU.
     
    The filesystem also provides additional query functions that are typically not found in filesystem implementations. Especially, the Map/Reduce job executor can find out where the files are stored. This allows for an important optimization: The parts of the job accessing certain parts of a file are executed on the machine storing these parts. This avoids network traffic ("move the algorithm, not the data"). There are further special access functions: Jobs can request that files are preferrably stored on the local machine if there is still space. This avoids network traffic, too. PlasmaFS provides a fast access path via shared memory, so such local-only traffic does not even hit the network stack.
     
    Although it would be possible to implement Map/Reduce on top of an arbitrary distributed filesystem, all these optimizations would not be available, and would let it appear much slower.

Build

  • How do I build Plasma? - Plasma requires a number of prerequisite software packages. In short, this is described in the INSTALL file of the distributed tarball. Doing that can take some time. Note that the prerequisite software is often not completely available from Linux distributions (or outdated).
     
    A better way of building Plasma is to install GODI, a distribution of Ocaml that works on Linux and other operation systems, and that includes all the prerequisite software. If you select the 3.12 branch of GODI, Plasma will be offered there as package "godi-plasma". The build process in GODI is automated, so you do not have to type "configure" or "make".
     
    For a few but popular Linux distributions there is a script that fully automates the build via GODI, so you only have to start this script, and the software is dropped into a directory you specify.
     
  • Are there binary downloads? - There is now a binary release, but only for Linux on AMD64 platforms (see download page). This release is restricted as no custom map/reduce programs can be compiled with it. It is nevertheless useful for exploring PlasmaFS and the streaming mode of map/reduce.

Installation

  • How do I install Plasma? - After the build, Plasma needs to be installed on the machines. Read the Deployment Guide.
     
  • Do I need the portmapper? - Although Plasma uses SunRPC, it does so without the portmapper.

Configuration

  • Can a machine be both namenode and datanode? - Sure it can. Note, however, that the type of load put onto namenodes and datanodes do not play well together. Both kinds of nodes are I/O-bound where namenodes need lots of disk seeks to do their work and datanodes access the disks in a more sequential way. For small networks this probably does not cause problems, but if the load of the namenodes grows this can easily turn out as bottleneck.
     
  • Can a machine be both datanode for PlasmaFS and tasknode for map/reduce? - Sure it can, and this is even recommended. There is an optimization so that it is tried to execute tasks on the datanodes that store most of the data that is processed by the tasks. This avoids network traffic.
     
  • Is it better to put host names or IP addresses into namenode.hosts? - The file namenode.hosts lists the nodes that play the role of namenodes. Starting with release 0.3 it will become possible to put either host names or IP addresses into this file (before 0.3, only host names are supported). Host names and IP addresses have different tasks in a network. Basically, the difference is that a host name can be seen as the unique name of a machine while an IP address says how a network packet reaches a network interface. A machine has usually several network interfaces (at least from an OS point of view, also counting "lo" and "ppp0"). Because of this an IP address is a fragile identifier - it can happen that the same machine is reachable by different IP addresses, and certain IP addresses like 127.0.0.1 do not identify anything at all. Also, sometimes IP addresses need to be changed when a machine is physically moved to a new location.

    Because of these difficulties there is a clear preference for using host names: Each machine has a unique identifier (returned by [gethostname]), and it can be ensured that this identifier resolves to the right IP address, depending on where the resolution is done.

    Since version 0.3 IP addresses are also supported, provided that each machine has only one IP address (in addition to 127.0.0.1). It is useful as fallback method if the host names cannot be set up properly (e.g. missing root privileges).

    The other ".hosts" files (datanode.hosts, tasknode.hosts, nfsnode.hosts) are less sensitive to problems than namenode.hosts. It should be no problem to use either form of referencing machines here.
     

  • How much RAM is needed for map/reduce? - There are a number of configuration parameters that affect the RAM consumption. Generally, it is the tasks that need a lot of RAM, and much less the namenode and datanodes. Of course, it mainly depends how many tasks are run on a tasknode. As of Plasma 0.3, there is now a sophisticated mechanism in place that prevents that too many tasks are run on a node (so that swapping can usually be avoided). If memory becomes tight the tasks switch to a RAM-saving mode, and if not enough memory is available, tasks can even be delayed. This mechanism should prevent slowdowns if too much load is put onto a single machine because of misconfiguration.

    Generally, the tasks allocate two sorts of RAM: normal private RAM, and shared memory. Private RAM is only partly covered by the mentioned mechanism, but at least the bigger buffers are taken into account. The shared buffers are all managed by this mechanism. One can set parameters when memory is considered as tight or even maxed out, and these parameters exist for both types of RAM (see the documentation for the module Mapred_config). For many operation systems Plasma knows a way how to get defaults for these parameters, so it is usually not necessary to set them. The default limits are: 75% of the available physical RAM can be used for private buffers, and 12% of the physical RAM can be used for shared memory.

    Now, buffers are used for many things. Usually, however, there are only two dominating buffer types: Sort buffers, and file buffers. The size of the sort buffers can (and must) be set with a parameter (sort_size). Because of some auxiliary buffers, the actual amount of RAM required for a sort task is higher (if the lines of the files are very short, this can be up to 3 times this configured size, but this is an extreme case). A good value for sort_size is the amount of RAM divided by the maximum number of tasks running on the node, and this divided by 2 or 3.

    The size of the file buffers is controlled by the parameter buffer_size, and if RAM is tight, by the smaller value buffer_size_tight. Both values must be at least as large as the bigblock size (a file buffer must at least hold a bigblock). Now, the tasks needing the most file buffers are the shuffle tasks. This type of task reads from several input files (up to merge_limit) and writes to several output files (up to split_limit). For example, if the bigblocks are 64M, and the shuffle tasks open up to 4 files, and you want to run 16 such tasks in parallel, you need alone for the file buffers 4G of RAM.

    The recommended minimum of RAM is roughly calculated by:

    8 * number of cores * bigblock size
    For example, a quad-core system should be at least equipped with 2G of RAM if the bigblocks are 64M in size. But this is really a minimum. More RAM will make the system faster.
     
  • How much RAM is needed for the namenode? - The namenode runs a database, and of course it is advantegeous if at least as much RAM is available so that the relevant parts of the db fit into memory. (If you don't have an idea now, just assume 1G of RAM.)

    Another factor is that the namenodes load the blockmaps into RAM. The blockmaps store which filesystem blocks are allocated and which are free. Fortunately, only 2 bits are needed per block, plus some management fields per block row. I haven't calculated an exact number yet, but it is in total probably not more than a byte per block.

    So, e.g. for managing 1 million blocks you would need 1M of RAM. If the block size is 1M, this is enough for 1T of disk space. As you see, it is not that much memory, even for large clusters - for 1000T of disk space 1G of management RAM is needed.

Design

  • Why does Plasma focus on rather small block sizes like 1M? - Other map/reduce implementations recommend much larger block sizes, such as 64M or even more. Of course, you can run Plasma also with such large blocks, but it is not necessary. You should see this as a feature - the support for smaller blocks allows a more fine-grained disk and RAM management, so resources can be better allocated. This is especially valuable when not all jobs process large amounts of data, but there are jobs with different data volumes.

    There are two features in Plasma that make smaller blocks possible. First, Plasma controls where the blocks are placed on disk. The block allocation algorithm tries to allocate the blocks of a file in a way so that they end up close to each other on the disk. (This "control" works by pre-allocating the data space as a large file.) Ideally, if files are sequentially written, the blocks are placed sequentially on the disk, so far possible. The practice shows that good block placements can often be achieved.

    Second, the blocklists are stored in a compressed way (sometimes called "extent-based" by filesystem developers). For example, if a file consists of the blocks 10,11,12,13,14,18,19,20 this is just stored as "10-14,18-20".

    In short, these optimizations allow it to treat many relatively small blocks as a virtual large block. Often, PlasmaFS achieves to treat sequences of around 1000 blocks as if they were a single virtual block. As you see, this strategy is far better than starting with big blocks like 64M but not controlling their placement. Of course, the algorithms managing the block allocation, and the block lists are now quite complicated, but this is relatively easy to handle in a modern programming language like Ocaml.

    There are more advantages of small blocks: Less disk space is wasted, or in other words, it becomes affordable to store both medium-sized and large files in the filesystem without wasting too much disk space. Small blocks also give the system more freedom to manage available RAM. The compatibility with other software is better (like NFS).

  • Isn't the two-phase commit protocol too slow for map/reduce? - Well, this protocol makes the db writes a bit slower, but not too much. Also, you can turn it off. So the real question is: Aren't databases too slow for map/reduce?

    Reading from a database is generally fast enough, especially when all data required for executing a job can be cached in RAM. The bottleneck is clearly the write access. In order to guarantee data safety, we generally wait until the data of a db transaction is synced to disk, so the disk limits the speed of write transactions. (Btw, use SSDs to improve this - at least several hundred write transactions per second should be easily possible, and the data sheets promise even more.)

    There is another factor, however. If we can achieve that db transactions only occur rarely, the whole problem might go away. The Plasma design allows it to allocate many blocks in a single db transaction. For example, if the bigblock size is 64M and the filesystem block size only 1M, Plasma allocates 64 blocks in one transaction when it writes out a bigblock. By improving the Plasma code, this ratio could even be made bigger, maybe 1024 blocks per transaction. (The code change is not dramatic - one only has to keep the transaction open from one bigblock write to the next.)

    The whole point is now that the network already limits the speed of data processing, and becomes usually the bottleneck before the namenode. For example, if you have a switching capacity of 10Gbit/s at most 16 bigblocks are written per second (if the bigblock is 64M in size). This translates to 16 transactions. Even my laptop allows this. For 100Gbit/s (around 100 nodes) only 160 transactions are needed, and this is easy-going if the db resides on an SSD.

    So, the current design is fast enough for even larger clusters when you put the db onto an SSD. There is of course some scalability problem - for several hundred nodes and ultrafast networks the bottleneck still persists. So the question finally translates into: Is the design capable of supporting even these big deployments? Is there some more parameter that can be used for scaling up?

    The design allows at least this improvement: The database could be split so that we achieve a horizontal partitioning. Every db partition is stored on an independent disk (PostgreSQL allows this by defining tablespaces). The partitions are chosen so that typical writes only need to access a single partition. For example, there could be several partitions for storing the blocklists, and the partition is determined by hashing the file ID. Of course, it is not as simple as this - the namenode needs to know about the partitioning, and must drive the database so that writes to different partitions can happen in parallel. This is complicated but imaginable.

    Btw, the current Plasma release does not know anything about the network topology, so we probably have to first talk about this before scaling up the database.

Troubleshooting

  • The namenode does not find any datanode. Why? - The symptom is a log message like "Multicast request cannot be routed". Probably there is a problem with multicasting. First, check whether the MULTICAST flag is on for the network device (e.g. eth0, both on the namenode and all datanodes). Do this by looking at the output of:
    ifconfig eth0
    
    This flag is e.g. off for the "lo" interface, so a local-loopback-only installation does not work (workaround: configure eth0 even if there is no cable attached). Second, check whether there is a multicast route or a default route on the namenode. Do this by running:
    route -n
    
    If there is a line starting with 224.0.0.0 or 0.0.0.0 everything is fine. Otherwise, add the route:
    route add -net 224.0.0.0 netmask 240.0.0.0 dev eth0
    
    Multicast should now work on the local LAN segment. You may have to do more if there are (level 3) routers between the namenode and the datanode (e.g. increase the multicast_ttl value in the namenode config, or configure a multicast group).
     
  • How do I check whether the PlasmaFS cluster has started up successfully? - Check whether you can access it with the plasma command-line utility, and whether this utility reports that all datanodes are alive. For example, call it like
    plasma fsstat -namenode HOST:PORT -cluster NAME -auth proot
    
    Replace here HOST, PORT, and NAME with the real values that apply to your cluster. You will be asked for a password. This can be found in clusterconfig/instances/NAME/password_proot, and is a long hex number (pick and paste with the mouse). The output is something like:
    BLOCK STATS:
    Total:                       10000 blocks
    Used:                         4496 blocks
    Transitional:                    0 blocks
    Free:                         5504 blocks
    
    DATANODE STATS:
    Enabled nodes:   2
    Alive nodes:     2
    
    If you get this output, but with too low numbers, it is likely that not all datanodes have been found (check the namenode log file). Otherwise, this utility normally complains that it cannot find the coordinator. The message may contain further hints:
    • Authentication failed (Auth_failed): This means that the password is wrong. Check whether you have installed the right instance.
    • Authentication too weak (Auth_too_weak): This error is normally not possible when you call plasma with the -auth switch. If you do it without, you need the help of the authentication daemon, which is probably not running on the machine.
    • RPC program not available (Unavailable_program): The startup of the namenode is incomplete. The namenode port is already listening for TCP connections, but the startup procedure has not reached the stage where all RPC functions are enabled on it. Look into the namenode log file for errors.
    • ".plasmafs: No such file or directory": You did not specify both -namenode and -cluster when calling plasma. In this case, the utility looks for a config file .plasmafs for missing pieces of configuration.
    • EFAILED error code: This is normally a serious error in the namenode. Check the namenode log files.

     
  • I cannot mount PlasmaFS via NFS. How to debug? - First, you should check whether the NFS bridge is running (with ps). If so, the next source of errors is an incorrect mount command. For Linux clients, the full mount command should look like:
    mount -t nfs -o proto=tcp,port=2801,mountport=2800,mountproto=tcp,nfsvers=3,nolock,noacl,nordirplus,sec=sys 127.0.0.1:/test /mnt
    
    Let's go through the options:
    • proto=tcp and mountproto=tcp: The NFS bridge only supports TCP-based mounts. Normally, the client should figure this out automatically, though.
    • port=2801 and mountport=2800: The distributed config file for the NFS bridge uses non-standard ports. The NFS client needs these settings. Also, note that the NFS bridge does not notify portmapper about its port numbers.
    • nfsvers=3: The NFS bridge only implements NFS version 3.
    • nolock: There is no support for the locking add-on protocol.
    • noacl: There is no support for the ACL add-on protocol.
    • nordirplus: There is no support for the READDIRPLUS procedure (normally, the client should figure this out automatically)
    • sec=sys: The client should use "system authentication", i.e. it only transmits the user ID without any credentials. Note that the NFS bridge only accepts this style of authentication if the client uses a port number < 1024.
    • 127.0.0.1: This is the IP address where the NFS bridge is running.
    • /test: This is the name of the PlasmaFS cluster to access, preceded with a slash. (The background of this is that future versions of the bridge will support to access several clusters simultaneously, so one needs to tell the bridge which cluster to mount.)

More Questions

  • Is there a mailing list? - Yes.
This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml