Class type Mapred_def.mapred_job_config

class type mapred_job_config = object .. end

method name : string

A name for identifying this job, e.g. in log files

method job_id : string

An automatically generated name. This can be considered as unique. It is not possible to override this name.

method input_dir : string

This plasma directory contains the input files

method input_dir_designation : designation

How input_dir is interpreted

method output_dir : string

This plasma directory will get output files. It should exist, and it should be empty

method work_dir : string

This plasma directory is used for temporary files. It should exist, and it should be empty

method log_dir : string

This plasma directory is used for log files. It should exist, and it should be empty

method task_files : string list

These files are copied at job start to the "local" directory for this job on all task nodes. This should be regular files only. The location of this local directory can be queried with Mapred_def.get_job_local_dir.

method bigblock_size : int

Map/reduce processes files in units of bigblocks. The size of bigblocks can be chosen as multiples of the size of filesystem blocks. The value bigblock_size is in bytes. The maximum size of records (line length) is also bigblock_size. Reasonable values are in the multi-megabyte range, e.g. 16M.

Certain file formats also require that bigblocks are multiples of 64K.

method map_tasks : int

The number of map tasks that should not be exceeded. It is tried to hit this number, but it may be possible that not enough map tasks can be generated.

method merge_limit : int

How many files are merged at most by a shuffle task

method split_limit : int

How many files are created by a shuffle task. This is not a strict limit - actually the scheduler plans with slightly larger split limits near the end of the job.

method partitions : int

The number of partitions = number of reduce tasks

method enhanced_mapping : int

If >0, enhanced map tasks are created. This type of tasks also sorts and pre-partitions data. The int is the number of pre-partitions (must be <= partitions).

Increasing the number of pre-partitions reduces network traffic but makes it more likely that random disk seeks are needed to read data from disk.

method phases : phases

Which phases are enabled

method map_whole_files : bool

Whether only whole files are passed to a map job. If false, files to map can be split into parts

method custom : string -> string

Get a custom parameter or raise Not_found

This web site is published by Informatikbüro Gerd Stolpmann

Plasma	GitLab	Archive
Projects	Blog	Knowledge