hadoop-MapReduce-tutorial

admin

3/2/2025

#hadoop-MapReduce-turial

Go Back

Apache Hadoop MapReduce Word Count Tutorial

Updated: 01/feb/2025 by Shubham Mishra

Introduction

Hadoop MapReduce is a powerful framework for processing large-scale data in a distributed environment. This article provides a step-by-step guide to implementing a Word Count program using Hadoop MapReduce. This is one of the simplest yet most fundamental examples of how MapReduce processes large datasets.


#hadoop-MapReduce-turial

Understanding the Word Count Program

In this tutorial, we will create a Hadoop MapReduce program to count the number of occurrences of words in a text file. The implementation consists of three Java files:

  1. Mapper (Map.java) – Converts input text into key-value pairs.
  2. Reducer (Reduce.java) – Aggregates the word count.
  3. Driver (WordCount.java) – Manages the job configuration and execution.

Example Input File (input.txt)

Using Hadoop MapReduce Let’s implement.

Code Implementation

1. Mapper Class (Map.java)

The Mapper processes input data, tokenizes words, and assigns each word a count of 1.

package com.developer.code.examples.hadoop.mapred.wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    
    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            output.collect(word, one);
        }
    }
}

2. Reducer Class (Reduce.java)

The Reducer aggregates the word counts produced by the Mapper.

package com.developer.code.examples.hadoop.mapred.wordcount;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
    }
}

3. Driver Class (WordCount.java)

The Driver configures and runs the MapReduce job.

package com.developer.code.examples.hadoop.mapred.wordcount;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class WordCount {
    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(WordCount.class);
        conf.setJobName("wordcount");
        
        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);
        conf.setMapperClass(Map.class);
        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);
        
        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);
        
        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));
        
        JobClient.runJob(conf);
    }
}

Steps to Execute the Program

1. Creating a JAR File

Compile the Java files and create a JAR file to run the Hadoop job.

javac -classpath `hadoop classpath` -d . Map.java Reduce.java WordCount.java
jar -cvf wordcounter.jar -C . .

2. Creating an Input File in HDFS

hdfs dfs -mkdir -p /wordcount/input
hdfs dfs -copyFromLocal input.txt /wordcount/input/

3. Running the MapReduce Job

hadoop jar wordcounter.jar com.developer.code.examples.hadoop.mapred.wordcount.WordCount /wordcount/input /wordcount/output

4. Viewing Output

hdfs dfs -cat /wordcount/output/part-r-00000

Conclusion

Hadoop MapReduce is a scalable and efficient way to process large datasets. This Word Count example demonstrates the basic structure of a MapReduce program, including the Mapper, Reducer, and Driver components.

For more Hadoop tutorials and big data insights, stay tuned to DeveloperIndian.com!

Table of content

  • Introduction to Hadoop
    • Hadoop Overview
    • What is Big Data?
    • History and Evolution of Hadoop
    • Hadoop Use Cases
  • Hadoop Architecture and Components
  • Hadoop Distributed File System (HDFS)
    • Hadoop HDFS
    • HDFS Architecture
    • NameNode, DataNode, and Secondary NameNode
    • HDFS Read and Write Operations
    • HDFS Data Replication and Fault Tolerance
    • What is fsck in Hadoop?
  • Hadoop YARN (Yet Another Resource Negotiator)
    • YARN Architecture
    • ResourceManager, NodeManager, and ApplicationMaster
    • YARN Job Scheduling
  • Hadoop Commands and Operations
  • Hadoop MapReduce
    • Hadoop Map Reduce
    • MapReduce Programming Model
    • Writing a MapReduce Program
    • MapReduce Job Execution Flow
    • Combiner and Partitioner
    • Optimizing MapReduce Jobs
  • Hadoop Ecosystem Tools
    • Apache Hive
    • Apache HBase
    • Apache Pig
    • Apache Sqoop
    • Apache Flume
    • Apache Oozie
    • Apache Zookeeper
  • Hadoop Integration with Other Technologies
  • Hadoop Security and Performance Optimization
    • Hadoop Security Features
    • HDFS Encryption and Kerberos Authentication
    • Performance Tuning and Optimization
  • Hadoop Interview Preparation
  • Hadoop Quiz and Assessments
  • Resources and References
    • Official Hadoop Documentation
    • Recommended Books and Tutorials
    • Community Support and Forums