This document assumes that PlasmaFS is already built and installed. It explains how to deploy it on the various nodes that are part of a PlasmaFS cluster.
Planning
Minimum
A PlasmaFS cluster needs at least one namenode and one datanode. For getting started it is possible to put both kinds of servers on the same machine.
The smallest reasonable blocksize is 64K. There is already a measurable improvement of speed for blocksizes of 1M (better ratio of metadata operations per amount of accessed data). Blocksizes beyond 64M may be broken in the current code base.
Big blocksizes also mean big buffers. For a blocksize of 64M an application consumes already 6.4G of memory for buffers if it "only" runs 10 processes where each process access 10 PlasmaFS files at the same time.
If NFS write performce is important, one should not configure blocksizes bigger than the NFS clients can support on the wire. For example, Linux supports up to 1M in standard kernels. It is better to stay below this value.
If you are undecided, choose 1M as blocksize. Good compromise.
Namenode considerations
The number of namenodes can reasonably be increased to two or three,
but it is unwise to go beyond this number. The more namenodes are
included in the system the more complicated the commit
operation
becomes. At a certain point, there is no additional safety from adding
more namenodes.
The namenode database is best put on SSD "disks" (if possible even use several SSDs in RAID 1 or RAID 10 arrays). This database has typically only a few gigs, and the typical access pattern profits enormously from the low latencies provided by SSDs.
Datanode considerations
The number of datanodes is theoretically unlimited. For now it is unwise to add more datanodes than can be connected to the same switch, because there is no notion of network distance in PlasmaFS yet.
There is the limitation that there can be only one datanode identity per machine. Because of this it makes sense to create one big single data volume from the available disks (using RAID 0, or JBOD as provided by volume managers like LVM), and to put the block data onto this volume.
It is advisable to use an extent-based filesystem for this volume, such as XFS. However, this is no requirement - any filesystem with Unix semantics will do.
Raw partitions are not supported, and probably will never be.
Network
PlasmaFS has been developed for Gigabit networks. One should prefer switches with a high bandwidth (there is often a limit of the total bandwidth a switch can handle before data packets are dropped). Ideally, all ports of the switch can simultaneously be run at full speed (e.g. 32 Gbit/s for a 16 port switch). SOHO switches often do not deliver this!
PlasmaFS uses multicasting to discover and monitor the datanodes.
This should normally work "out of the box" with modern routers
and switches. It might be required to adapt the dn_multicast_ttl
configuration if routers are used (to the maximum number of routers
between two nodes, plus 1).
It is assumed that hostnames resolve to IP addresses. It is allowed that the hostname resolves to a loopback IP address on the local machine. PlasmaFS never tries to perform reverse lookups.
PlasmaFS avoids DNS lookups as much as possible. However, when
PlasmaFS clients start up, DNS lookups are unavoidable. A well
performing DNS infrastructure is advisable. Actually, any alternate
directory system can also be used, because PlasmaFS only uses the
system resolver for looking up names. Even /etc/hosts
is ok.
Starting point
The machine where PlasmaFS was built and installed is called the operator node in the following text. From the operator node the software is deployed. It is easy to use one of the namenodes or datanodes as operator node.
We assume all daemons on all machines are running as the same user. This user should not be root! It is also required that the nodes can be reached via ssh from the operator node, and that ssh logins are possible without password.
On the namenodes there must be PostgreSQL running (version 8.2 or better). It must be configured as follows:
pg_hba.conf
line of
local all all ident
will do the trick (often there by default). This permits logins if the
hostname in connection configs is omitted.postgres
, and run:
createuser <username>
This will ask a few questions. The user must be able to create databases.
Other privileges are not required. psql template1 </dev/null
If it does not ask for a password, and if it does not emit an error,
everything is fine.max_prepared_transaction
in postgresql.conf
to a positive value. Restart PostgreSQL.clusterconfig
directory. This is installed together with the other
software. It is advisable not to change this version of clusterconfig
directly, but to copy it to somewhere else, and to run the scripts on
the copy.
Overview over the deployment
The clusterconfig
directory contains a few scripts, and a directory
instances
. In instances
there is a subdirectory template
with
some files. So this looks like
clusterconfig
: The scripts available here are run on the operator
node to control the whole clusterclusterconfig/instances
: This directory contains all the instances
that are controlled from here, and templates for creating new
instances. Actually, any instance can also be used as template - when
being instantiated, the template is simply copied.clusterconfig/instances/template
: This is the default template. It
contains recommended starting points for config files.
Creating an instance
Create a new instance with name inst
with
./new_inst.sh <inst> <prefix> <blocksize>
The prefix
is the absolute path on the cluster nodes where the PlasmaFS
software is to be installed. The deploy_inst.sh
script (see below) will
create a directory hierarchy:
<prefix>/bin
: for binaries<prefix>/etc
: config files, rc scripts, and passwords<prefix>/log
: for log files<prefix>/data
: data files (datanodes only)deploy_inst.sh
script will create these directories only if they
do not exist yet, either as directories or symlinks to directories.
The blocksize
can be given in bytes, or as a number with suffix "K" or "M".
Optionally, new_inst.sh
may be directed to use an alternate template:
./new_inst.sh -template <templ> <inst> <prefix> <blocksize>
templ
can be another instance that is to be copied instead of template
.
Configuring an instance
After new_inst.sh
there is a new directory <inst>
in instances
.
Go there and edit the files:
namenode.hosts
: Put here the hostnames of the namenodesdatanode.hosts
: Put here the hostnames of the datanodesauthnode.hosts
: Put here the hostnames of the authnodes (i.e. where
the authentication daemon needs to run). The authnodes are required for
password-less access to the filesystem. Usually they are only started on
nodes where commands to PlasmaFS (or map/reduce) are entered. It is
recommended to add at least the machine from which you normally start
jobs.nfsnode.hosts
: Put here the hostnames of the nodes that will run
NFS bridges (may remain empty)
The new_inst.sh
script has also created passwords for the user IDs
"proot" and "pnobody" (which are used by the RPC subsystem for authentication).
The passwords are contained in the files password_proot
and password_pnobody
(in case you need them).
Deploying an instance
With
./deploy_inst.sh <inst>
one can install the files under prefix
on the cluster nodes.
The installation procedure uses ssh
to copy the files.
The option -only-config
restricts the copy to configuration files
and scripts, but omits executables. This is useful when the
configuration of a running cluster needs to be changed (when running,
the executable files cannot be overwritten).
Initializing the namenodes
This step creates the PostgreSQL databases on the namenodes:
./initdb_inst.sh <inst>
If the databases already exist, this step will fail. Use the
switch -drop
to delete databases that were created in error.
The databases have the name plasma_<inst>
.
After initializing the namenodes, it is possible to start the namenode server with
./rc_nn.sh start <inst>
This must be done before continuing with the initialization of the datanodes.
Initializing the datanodes
This step creates the data area for the block files on the datanodes:
./initdn_inst.sh <inst> <size> all
Here, size
is the size of the data area in bytes. size
can be
followed by "K", "M", or "G". If size
is not a multiple of the
blocksize, it is rounded down to the next lower multiple.
The keyword "all" means that all datanodes are initialized with the same size. Alternately, one can also initialize the nodes differently, e.g.
./initdn_inst.sh inst1 100G m1 m2 m3
./initdn_inst.sh inst1 200G m4 m5 m6
This would initialize the hosts m1
, m2
, and m3
with a data area
of 100G, and m4
, m5
, and m6
with a data area of 200G.
The initdn_inst.sh
script also starts the datanode server on the
hosts, and registers the datanodes with the namenode.
Populating with users and groups
One needs to copy /etc/passwd
and /etc/group
to PlasmaFS before
a user can authenticate. To do so just run
./initauth_inst.sh inst1
It is possible to take the passwd
and group
files from different
locations by providing the options -passwd
and -group
, respectively.
Note that the password field in passwd
is ignored.
The authentication daemon must also be started before authentication
can succeed (./rc_an.sh start inst1
). On machines where this daemon
is running any logged-in user can access PlasmaFS without providing
credentials. Note that PlasmaFS only matches on user and group names,
but not on the corresponding numerical IDs (which are allowed to be
different on each connected machine).
Starting and stopping the cluster
You may have noticed that during initialization the cluster nodes were started in this order:
There are scripts rc_dn.sh
, rc_nn.sh
, rc_an.sh
, and rc_nfsd.sh
to start
and stop datanodes, namenodes, authnodes, and NFS bridges, resp., on the whole
cluster. These scripts take the host names of the nodes to ssh
to
from the configured *.hosts
files.
So the right order of startup is:
rc_all.sh
to
start/stop all configured services in one go.
What is a "collective" startup? As the namenode servers elect the coordinator at startup, they need to be started within a short time period (60 seconds).
Starting and stopping individual nodes
On the cluster nodes, there are also scripts rc_dn.sh
, rc_nn.sh
,
rc_an.sh
, and rc_nfsd.sh
in <prefix>/etc
. Actually, the scripts in
clusterconfig
on the operator node call the scripts with the same
name on the cluster nodes to perform their tasks. One can also call
the scripts on the cluster nodes directly to start/stop selectively
a single server, e.g.
ssh m3 <prefix>/etc/rc_dn.sh stop
would stop the datanode server on m3
.
The distribution does not contain a script that would be directly suited for inclusion into the system boot. However, the existing rc scripts could be called by a system boot script. (Especially, one would need to switch to the user PlasmaFS is running as before one can call the rc scripts.)
Testing with the
plasma
client
One can use the plasma
client for testing the system, e.g.
plasma ls -namenode m1:2730 -cluster inst1 /
would list the contents of /
of PlasmaFS. (This requires that the
authentication daemon runs on this machine.)
In order to avoid the switches, one can put the namenode and cluster
data into a configuration file ~/.plasmafs
:
plasmafs {
cluster {
clustername = "inst1";
node { addr = "m1:2730" };
(* optionally more nodes *)
}
}
Once this is done, a simple
plasma ls /
works. Refer to Plasma_client_config
for details about the configuration
file, and to Cmd_plasma
for details about the plasma
utility.
By default, the root directory /
has the file mode bits 1777, i.e.
everybody can create files there. You can change this with plasma chmod
and lock down the access.
Datanode administration
Please have a look at Plasmafs_dn_admin
.
Namenode administration
Please have a look at Plasmafs_nn_admin
.
Admin tasks not yet supported at all