Apache Hadoop Map Reduce program - First Word count

Updated:01/08/2022 by Computer Hope

In this article ,a basic illustration of a MapReduce programme that counts how many times a word appears in a text file. This famous Hadoop MapReduce example, sometimes known as the "Word Count" example, is meant to serve as an introduction to the concept. Assume you have the following text in a file called input.txt:

Using Hadoop Mapreduce Let’s implement.

You need to create three files.

  • Reduce.java
  • Map.java
  • WordCount.java

Reduce.java

In Hadoop, Reducer takes the output of the Mapper (intermediate key-value pair) process each of them to generate the output to our map reduce program , we do aggregation or summation sort of computation.


package com.developer.code.examples.hadoop.mapred.wordcount;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class Reduce extends MapReduceBase implements Reducer
{
public void reduce(Text key, Iterator values, OutputCollectoroutput,
Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Map.java

Mapper is the first code which is responsible to migrate/ manipulate the HDFS block stored data into key and value pair.


package com.developer.code.examples.hadoop.mapred.wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class Map extends MapReduceBase implements Mapper
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollectoroutput, Reporter reporter)
throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

WordCount.java

Driver class that sets up and starts a word-counting Hadoop MapReduce process. The driver defines the mapper and reducer classes, sets up a number of parameters, and indicates the input and output pathways. Lastly, it sends the job to be completed.


package com.developer.code.examples.hadoop.mapred.wordcount;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WordCount
{
public static void main(String[] args) throws Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}

Below are step to execute program

  • Step 1:Creating Jar file using eclipse

    Here we need to create a Hadoop JAR (Java Archive) in Eclipse(use other tool as well) involves developing a Hadoop MapReduce program, exporting it as a JAR file, and then executing it on the Hadoop cluster.

    .
  • Step:2create input file
  • Hdfs dfs -mkdir ~/wordcount/inputsudo vi input_onehdfs dfs -copyFromLocal input_one ~/wordcount/input/
  • Step :3 For Run your Job
  • $HADOOP_HOME/bin/hadoop jar wordcounter.jar /input /output

Conclusion :

Hadoop is an open source software developed by the Apache Software Foundation (ASF). You can download Hadoop directly from the project website at http://hadoop.apache.org. Cloudera is a company that provides support, consulting, and management tools for Hadoop. Cloudera also has a distribution of software called Cloudera’s Distribution Including Apache Hadoop (CDH).
Here in this article contain a examples word count of a Hadoop MapReduce program, the term "driver" refers to the main class that orchestrates and configures the entire MapReduce job. The driver program is responsible for setting up the job parameters, input and output paths, defining the mapper and reducer classes, and managing other job-related configurations.