In the Hadoop directory (which you should find at /opt/hadoop/2.2.0) you can find a JAR containing some examples: the exact path is $HADOOP_COMMON_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar .
This JAR contains different examples of mapreduce programs. We'll launch the WordCount program, which is the equivalent of "Hello, world" for MapReduce. This programs just count the occurrences of every single word of the file given as the input.
To run this example we need to prepare something. We assume that we have the HDFS service running; if we didn't create a user directory, we have to do it now (assuming the hadoop user we're using is mapred):
$ hadoop fs -mkdir -p /user/mapredWhen we pass "fs" as the first argument to the hadoop command, we're telling hadoop to work on HDFS filesystem; in this case, we used the mkdir command as a switch to create a new directory on HDFS.
Now that our user has a home directory, we can create a directory that we'll use lo load the input file for the mapreduce programs:
$ hadoop fs -mkdir inputdirWe can check the result issuing a "ls" command on HDFS:
$ hadoop fs -ls Found 1 items drwxr-xr-x - mapred mrusers 0 2014-02-11 22:54 inputdirNow we can decide which file we'll count the words of; in this example, I'll use the text of the novella Flatland by Edwin Abbot, which is freely available on gutemberg project for download:
$ wget http://www.gutenberg.org/cache/epub/201/pg201.txtNow we can put this file onto the HDFS, more precisely into the inputdir dir we created a moment ago:
$ hadoop fs -put pg201.txt inputdirThe switch "-put" tells Hadoop to get the file from the machine's file system and to put it onto the HDFS filesystem. We can check that the file is really there:
$ hadoop fs -ls inputdir Found 1 items drwxr-xr-x - mapred mrusers 227368 2014-02-11 22:59 inputdir/pg201.txt
Now we're ready to execute the MapReduce program. Hadoop tarball comes with a JAR containing the WordCount example; we can launch Hadoop with these parameters:
- jar: we're telling Hadoop we want to execute a mapreduce program contained in a JAR
- /opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar: this is the absolute path and filename of the JAR
- wordcount: tells Hadoop which of the many examples contained in the JAR to run
- inputdir: the directory on HDFS in which Hadoop can find the input file(s)
- outputdir: the directory on HDFS in which Hadoop must write the result of the program
$ hadoop jar /opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount inputdir outputdirand the output is:
14/02/11 23:16:19 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/02/11 23:16:20 INFO input.FileInputFormat: Total input paths to process : 1 14/02/11 23:16:20 INFO mapreduce.JobSubmitter: number of splits:1 14/02/11 23:16:21 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class 14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/02/11 23:16:21 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 14/02/11 23:16:21 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 14/02/11 23:16:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1392155226604_0001 14/02/11 23:16:22 INFO impl.YarnClientImpl: Submitted application application_1392155226604_0001 to ResourceManager at /0.0.0.0:8032 14/02/11 23:16:23 INFO mapreduce.Job: The url to track the job: http://hadoop-VirtualBox:8088/proxy/application_1392155226604_0001/ 14/02/11 23:16:23 INFO mapreduce.Job: Running job: job_1392155226604_0001 14/02/11 23:16:38 INFO mapreduce.Job: Job job_1392155226604_0001 running in uber mode : false 14/02/11 23:16:38 INFO mapreduce.Job: map 0% reduce 0% 14/02/11 23:16:47 INFO mapreduce.Job: map 100% reduce 0% 14/02/11 23:16:57 INFO mapreduce.Job: map 100% reduce 100% 14/02/11 23:16:58 INFO mapreduce.Job: Job job_1392155226604_0001 completed successfully 14/02/11 23:16:58 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=121375 FILE: Number of bytes written=401139 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=227485 HDFS: Number of bytes written=88461 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=7693 Total time spent by all reduces in occupied slots (ms)=7383 Map-Reduce Framework Map input records=4239 Map output records=37680 Map output bytes=366902 Map output materialized bytes=121375 Input split bytes=117 Combine input records=37680 Combine output records=8341 Reduce input groups=8341 Reduce shuffle bytes=121375 Reduce input records=8341 Reduce output records=8341 Spilled Records=16682 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=150 CPU time spent (ms)=5490 Physical memory (bytes) snapshot=399077376 Virtual memory (bytes) snapshot=1674149888 Total committed heap usage (bytes)=314048512 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=227368 File Output Format Counters Bytes Written=88461The last part of the output is a summary of the execution of the mapreduce program; just before this, we can spot the "Job job_1392155226604_0001 completed successfully" line, which tells us the mapreduce program has been executed successfully. As told, Hadoop wrote the output onto the outputdir on HDFS; let's see what's inside this dir:
$ hadoop fs -ls outputdir Found 2 items -rw-r--r-- 1 mapred mrusers 0 2014-02-11 23:16 outputdir/_SUCCESS -rw-r--r-- 1 mapred mrusers 88461 2014-02-11 23:16 outputdir/part-r-00000The presence of the _SUCCESS file confirms us the successful execution of the job; in the part-r-00000 Hadoop wrote the result of the execution. We can bring the file up to the filesystem of our machine using the "get" switch:
$ hadoop fs -get outputdir/part-r-00000 .Now we can see the content of the file (this is a small subset of the whole file):
... leading 2 leagues 1 leaning 1 leap 1 leaped 1 learn 7 learned 1 least 23 least. 1 leave 3 leaves 3 leaving 2 lecture 1 led 4 left 9 ...The wordcount program just count the occurrences of every single word and outputs it.
Well, we've successfully run our first mapreduce job on our Hadoop installation!