Tuesday, February 11, 2014

Running Hadoop Example

In the last post we've installed Hadoop 2.2.0 on Ubuntu. Now we'll see how to launch an example mapreduce task on Hadoop.

In the Hadoop directory (which you should find at /opt/hadoop/2.2.0) you can find a JAR containing some examples: the exact path is $HADOOP_COMMON_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar .
This JAR contains different examples of mapreduce programs. We'll launch the WordCount program, which is the equivalent of "Hello, world" for MapReduce. This programs just count the occurrences of every single word of the file given as the input.
To run this example we need to prepare something. We assume that we have the HDFS service running; if we didn't create a user directory, we have to do it now (assuming the hadoop user we're using is mapred):
$ hadoop fs -mkdir -p /user/mapred
When we pass "fs" as the first argument to the hadoop command, we're telling hadoop to work on HDFS filesystem; in this case, we used the mkdir command as a switch to create a new directory on HDFS.
Now that our user has a home directory, we can create a directory that we'll use lo load the input file for the mapreduce programs:
$ hadoop fs -mkdir inputdir
We can check the result issuing a "ls" command on HDFS:
$ hadoop fs -ls 
Found 1 items
drwxr-xr-x   - mapred mrusers        0 2014-02-11 22:54 inputdir
Now we can decide which file we'll count the words of; in this example, I'll use the text of the novella Flatland by Edwin Abbot, which is freely available on gutemberg project for download:
$ wget http://www.gutenberg.org/cache/epub/201/pg201.txt
Now we can put this file onto the HDFS, more precisely into the inputdir dir we created a moment ago:
$ hadoop fs -put pg201.txt inputdir
The switch "-put" tells Hadoop to get the file from the machine's file system and to put it onto the HDFS filesystem. We can check that the file is really there:
$ hadoop fs -ls inputdir
Found 1 items
drwxr-xr-x   - mapred mrusers        227368 2014-02-11 22:59 inputdir/pg201.txt

Now we're ready to execute the MapReduce program. Hadoop tarball comes with a JAR containing the WordCount example; we can launch Hadoop with these parameters:
  • jar: we're telling Hadoop we want to execute a mapreduce program contained in a JAR
  • /opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar: this is the absolute path and filename of the JAR
  • wordcount: tells Hadoop which of the many examples contained in the JAR to run
  • inputdir: the directory on HDFS in which Hadoop can find the input file(s)
  • outputdir: the directory on HDFS in which Hadoop must write the result of the program
$ hadoop jar /opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount inputdir outputdir
and the output is:
14/02/11 23:16:19 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/02/11 23:16:20 INFO input.FileInputFormat: Total input paths to process : 1
14/02/11 23:16:20 INFO mapreduce.JobSubmitter: number of splits:1
14/02/11 23:16:21 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class
14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/02/11 23:16:21 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/02/11 23:16:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1392155226604_0001
14/02/11 23:16:22 INFO impl.YarnClientImpl: Submitted application application_1392155226604_0001 to ResourceManager at /0.0.0.0:8032
14/02/11 23:16:23 INFO mapreduce.Job: The url to track the job: http://hadoop-VirtualBox:8088/proxy/application_1392155226604_0001/
14/02/11 23:16:23 INFO mapreduce.Job: Running job: job_1392155226604_0001
14/02/11 23:16:38 INFO mapreduce.Job: Job job_1392155226604_0001 running in uber mode : false
14/02/11 23:16:38 INFO mapreduce.Job:  map 0% reduce 0%
14/02/11 23:16:47 INFO mapreduce.Job:  map 100% reduce 0%
14/02/11 23:16:57 INFO mapreduce.Job:  map 100% reduce 100%
14/02/11 23:16:58 INFO mapreduce.Job: Job job_1392155226604_0001 completed successfully
14/02/11 23:16:58 INFO mapreduce.Job: Counters: 43
 File System Counters
  FILE: Number of bytes read=121375
  FILE: Number of bytes written=401139
  FILE: Number of read operations=0
  FILE: Number of large read operations=0
  FILE: Number of write operations=0
  HDFS: Number of bytes read=227485
  HDFS: Number of bytes written=88461
  HDFS: Number of read operations=6
  HDFS: Number of large read operations=0
  HDFS: Number of write operations=2
 Job Counters 
  Launched map tasks=1
  Launched reduce tasks=1
  Data-local map tasks=1
  Total time spent by all maps in occupied slots (ms)=7693
  Total time spent by all reduces in occupied slots (ms)=7383
 Map-Reduce Framework
  Map input records=4239
  Map output records=37680
  Map output bytes=366902
  Map output materialized bytes=121375
  Input split bytes=117
  Combine input records=37680
  Combine output records=8341
  Reduce input groups=8341
  Reduce shuffle bytes=121375
  Reduce input records=8341
  Reduce output records=8341
  Spilled Records=16682
  Shuffled Maps =1
  Failed Shuffles=0
  Merged Map outputs=1
  GC time elapsed (ms)=150
  CPU time spent (ms)=5490
  Physical memory (bytes) snapshot=399077376
  Virtual memory (bytes) snapshot=1674149888
  Total committed heap usage (bytes)=314048512
 Shuffle Errors
  BAD_ID=0
  CONNECTION=0
  IO_ERROR=0
  WRONG_LENGTH=0
  WRONG_MAP=0
  WRONG_REDUCE=0
 File Input Format Counters 
  Bytes Read=227368
 File Output Format Counters 
  Bytes Written=88461
The last part of the output is a summary of the execution of the mapreduce program; just before this, we can spot the "Job job_1392155226604_0001 completed successfully" line, which tells us the mapreduce program has been executed successfully. As told, Hadoop wrote the output onto the outputdir on HDFS; let's see what's inside this dir:
$ hadoop fs -ls outputdir
Found 2 items
-rw-r--r--   1 mapred mrusers          0 2014-02-11 23:16 outputdir/_SUCCESS
-rw-r--r--   1 mapred mrusers      88461 2014-02-11 23:16 outputdir/part-r-00000
The presence of the _SUCCESS file confirms us the successful execution of the job; in the part-r-00000 Hadoop wrote the result of the execution. We can bring the file up to the filesystem of our machine using the "get" switch:
$ hadoop fs -get outputdir/part-r-00000 .
Now we can see the content of the file (this is a small subset of the whole file):
...
leading 2
leagues 1
leaning 1
leap    1
leaped  1
learn   7
learned 1
least   23
least.  1
leave   3
leaves  3
leaving 2
lecture 1
led     4
left    9
...
The wordcount program just count the occurrences of every single word and outputs it.
Well, we've successfully run our first mapreduce job on our Hadoop installation!