Word Count MapReduce Program in Java

The Word Count program is like the Hello Word program of Big Data where we read an input text and count the number of occurrences of each word. In this sample program we will read input from a file which will be uploaded on HDFS and the final word count result will again be saved on HDFS. Hadoop setup is the basic pre-requisite for this example to work. Now lets get started with the coding.
Input File
First create an input data file, input.txt, to feed text to our program. You can use the vi editor to create this file in Linux. Following is the sample text used for testing our application.
this is a word count program to test basic map reduce operations in hadoop
hadoop rocks and lets hope this program will run successfully
Upload the file on HDFS e.g. at the path ‘/user/vaibhavsharma806060/hadoop-wc/’ using the copyFromLocal command.
hadoop fs -copyFromLocal input.txt /user/vaibhavsharma806060/hadoop-wc/
Dependency
Our program is dependent on some classes from Hadoop core jar file so first download the file using the wget command from Maven Central
$ wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-core/1.2.1/hadoop-core-1.2.1.jar
--2018-09-24 06:23:48-- http://central.maven.org/maven2/org/apache/hadoop/hadoop-core/1.2.1/hadoop-core-1.2.1.jar
Resolving central.maven.org (central.maven.org)... 151.101.200.209
Connecting to central.maven.org (central.maven.org)|151.101.200.209|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4203713 (4.0M) [application/java-archive]
Saving to: ‘hadoop-core-1.2.1.jar’
100%[===========================================================================>] 4,203,713 --.-K/s in 0.06s
2018-09-24 06:23:48 (69.4 MB/s) - ‘hadoop-core-1.2.1.jar’ saved [4203713/4203713]
$ ls
hadoop-core-1.2.1.jar
Set the CLASSPATH variable for the downloaded jar.
$ export CLASSPATH=$CLASSPATH:/home/vaibhavsharma806060/hadoop/lib/hadoop-core-1.2.1.jar
Imports
Now we will implement the Java classes for Mapper, Reducer and Driver functions. The purpose of each one of them is covered separately.
Following are the relevant imports from Hadoop core jar which are used by these classes.
import java.io.IOException;
import java.lang.ClassNotFoundException;
import java.lang.InterruptedException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
Mapper
Implement Mapper class which splits a line into list of words with whitespace delimiter. Then iterate through each word as key and assign a value ‘1’ for each key. The output from the Mapper class will be key-value pair where keys are the individual words and value is ‘1’.
public class WordCountMapper extends Mapper<LongWritable , Text, Text,
IntWritable> {
public void map(LongWritable inputkey, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString(); //converting text to string
String[] splits = line.split(“\\W+”); //splitting words delimited by whitespace
for( String outputkey : splits) {
context.write(new Text(outputkey), new IntWritable(1));
}
}
}
Reducer
Implement Reducer class which aggregates the number of occurrences by summing the values ‘1’ for each key.
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context output) throws IOException, InterruptedException {
int sum = 0;
for(IntWritable val : values){
sum = sum + val.get();
}
output.write(key,new IntWritable(sum));
}
}
Driver
Implement the driver class which holds the ‘main’ method. The class defines a map reduce job which takes the Mapper and Reducer class files as input. The program also taken the input file as a parameter with full path name. If unclear, get the hostname details for HDFS from Ambari console. The port normally has the default value 8020. After that provide the folder path where you have kept the input file.
Similarly specify the output folder but ensure that the output folder does not exist.
public class WordCountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration(); //tells where the namenode & resource manager resides
Job job = new Job(conf, “WordCountDriver”);
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class); //specifying the mapper class
job.setReducerClass(WordCountReducer.class); //specifying the reducer class
job.setNumReduceTasks(1); //specifying the no. of reduce tasks
job.setInputFormatClass(TextInputFormat.class); //default input format is TextInputFormat
job.setOutputKeyClass(Text.class); //specifying the reducers output key
job.setOutputValueClass(IntWritable.class); //specifying the reducers output value
FileInputFormat.addInputPath(job, new Path(“hdfs://ip-172-31-35-141.ec2.internal:8020/user/vaibhavsharma806060/hadoop-wc/input.txt”));
FileOutputFormat.setOutputPath(job, new Path(“hdfs://ip-172-31-35-141.ec2.internal:8020/user/vaibhavsharma806060/hadoop-wc/result”));
job.waitForCompletion(true);
}
}
Compile, Package and Execute
Compile the class files with JAVAC command
$ javac WordCountDriver.java
$ ls
WordCountDriver.class WordCountDriver.java WordCountMapper.class WordCountReducer.class
Create a JAR from the compiled files
$ jar cf WordCount.jar WordCount*.class
$ ls
WordCountDriver.class WordCountDriver.java WordCount.jar WordCountMapper.class WordCountReducer.class
Verify that the generated jar has all the required classes
$ jar tf WordCount.jar
META-INF/
META-INF/MANIFEST.MF
WordCountDriver.class
WordCountMapper.class
WordCountReducer.class
Execute the driver class to process the input data.
$ hadoop jar WordCount.jar WordCountDriver ./input.txt /wc-result
The result will be saved in the Part file under the destination folder we have provided in the Driver class.
$ hadoop fs -cat /user/vaibhavsharma806060/hadoop-wc/result/part-r-00000
a 1
and 1
basic 1
count 1
hadoop 2
hope 1
...
The main challenge in executing your first program comes in ensuring that the Hadoop setup is correct and all the dependencies have been resolved. In subsequent blogs we will look at more ways to write a map-reduce program.