Word Count MapReduce Program in Java

Posted: 24/09/2018/Under: Big Data, Hadoop/By: Vaibhav

The Word Count program is like the Hello Word program of Big Data where we read an input text and count the number of occurrences of each word. In this sample program we will read input from a file which will be uploaded on HDFS and the final word count result will again be saved on HDFS. Hadoop setup is the basic pre-requisite for this example to work. Now lets get started with the coding.

Input File

First create an input data file, input.txt, to feed text to our program. You can use the vi editor to create this file in Linux. Following is the sample text used for testing our application.

this is a word count program to test basic map reduce operations in hadoop hadoop rocks and lets hope this program will run successfully

Upload the file on HDFS e.g. at the path ‘/user/vaibhavsharma806060/hadoop-wc/’ using the copyFromLocal command.

hadoop fs -copyFromLocal input.txt /user/vaibhavsharma806060/hadoop-wc/

Dependency

Our program is dependent on some classes from Hadoop core jar file so first download the file using the wget command from Maven Central

$ wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-core/1.2.1/hadoop-core-1.2.1.jar --2018-09-24 06:23:48-- http://central.maven.org/maven2/org/apache/hadoop/hadoop-core/1.2.1/hadoop-core-1.2.1.jar Resolving central.maven.org (central.maven.org)... 151.101.200.209 Connecting to central.maven.org (central.maven.org)|151.101.200.209|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 4203713 (4.0M) [application/java-archive] Saving to: ‘hadoop-core-1.2.1.jar’ 100%[===========================================================================>] 4,203,713 --.-K/s in 0.06s 2018-09-24 06:23:48 (69.4 MB/s) - ‘hadoop-core-1.2.1.jar’ saved [4203713/4203713]

$ ls hadoop-core-1.2.1.jar

Set the CLASSPATH variable for the downloaded jar.

$ export CLASSPATH=$CLASSPATH:/home/vaibhavsharma806060/hadoop/lib/hadoop-core-1.2.1.jar

Imports

Now we will implement the Java classes for Mapper, Reducer and Driver functions. The purpose of each one of them is covered separately.

Following are the relevant imports from Hadoop core jar which are used by these classes.

import java.io.IOException; import java.lang.ClassNotFoundException; import java.lang.InterruptedException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Mapper.Context;

Mapper

Implement Mapper class which splits a line into list of words with whitespace delimiter. Then iterate through each word as key and assign a value ‘1’ for each key. The output from the Mapper class will be key-value pair where keys are the individual words and value is ‘1’.

public class WordCountMapper extends Mapper<LongWritable , Text, Text, IntWritable> {

public void map(LongWritable inputkey, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString(); //converting text to string String[] splits = line.split(“\\W+”); //splitting words delimited by whitespace for( String outputkey : splits) {

context.write(new Text(outputkey), new IntWritable(1));

}

Reducer

Implement Reducer class which aggregates the number of occurrences by summing the values ‘1’ for each key.

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context output) throws IOException, InterruptedException {

int sum = 0; for(IntWritable val : values){ sum = sum + val.get(); } output.write(key,new IntWritable(sum));

}

Driver

Implement the driver class which holds the ‘main’ method. The class defines a map reduce job which takes the Mapper and Reducer class files as input. The program also taken the input file as a parameter with full path name. If unclear, get the hostname details for HDFS from Ambari console. The port normally has the default value 8020. After that provide the folder path where you have kept the input file.

Similarly specify the output folder but ensure that the output folder does not exist.

public class WordCountDriver {

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

Configuration conf = new Configuration(); //tells where the namenode & resource manager resides

Job job = new Job(conf, “WordCountDriver”);

job.setJarByClass(WordCountDriver.class); job.setMapperClass(WordCountMapper.class); //specifying the mapper class job.setReducerClass(WordCountReducer.class); //specifying the reducer class job.setNumReduceTasks(1); //specifying the no. of reduce tasks job.setInputFormatClass(TextInputFormat.class); //default input format is TextInputFormat job.setOutputKeyClass(Text.class); //specifying the reducers output key job.setOutputValueClass(IntWritable.class); //specifying the reducers output value

FileInputFormat.addInputPath(job, new Path(“hdfs://ip-172-31-35-141.ec2.internal:8020/user/vaibhavsharma806060/hadoop-wc/input.txt”));

FileOutputFormat.setOutputPath(job, new Path(“hdfs://ip-172-31-35-141.ec2.internal:8020/user/vaibhavsharma806060/hadoop-wc/result”));

job.waitForCompletion(true);

}

Compile, Package and Execute

Compile the class files with JAVAC command

$ javac WordCountDriver.java $ ls WordCountDriver.class WordCountDriver.java WordCountMapper.class WordCountReducer.class

Create a JAR from the compiled files
$ jar cf WordCount.jar WordCount*.class $ ls WordCountDriver.class WordCountDriver.java WordCount.jar WordCountMapper.class WordCountReducer.class

Verify that the generated jar has all the required classes
$ jar tf WordCount.jar META-INF/ META-INF/MANIFEST.MF WordCountDriver.class WordCountMapper.class WordCountReducer.class

Execute the driver class to process the input data.

$ hadoop jar WordCount.jar WordCountDriver ./input.txt /wc-result

The result will be saved in the Part file under the destination folder we have provided in the Driver class.

$ hadoop fs -cat /user/vaibhavsharma806060/hadoop-wc/result/part-r-00000 a 1 and 1 basic 1 count 1 hadoop 2 hope 1 ...

The main challenge in executing your first program comes in ensuring that the Hadoop setup is correct and all the dependencies have been resolved. In subsequent blogs we will look at more ways to write a map-reduce program.