Top Hadoop Interview Question and Answer 2023

Updated:01/01/2023 by shubham Mishra

Top 50 Hadoop Interview Question
Q1.What are HDFS and YARN? Ans: HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology. YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
Q2. How does NameNode tackle DataNode failures? Ans: NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly. A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.
Q3. Is there any way to change the replication of files on HDFS after they are already written to HDFS? Ans: We can change the dfs.replication value to a particular number in the $HADOOP_HOME/conf/hadoop-site.xml file, which will start replicating to the factor of that number for any new content that comes in.
Q4. What benefits did YARN bring in Hadoop 2.0 and how did it solve the issues of MapReduce v1? Ans: In Hadoop v2, the following features are available: Scalability - You can have a cluster size of more than 10,000 nodes and you can run more than 100,000 concurrent tasks. Compatibility - The applications developed for Hadoop v1 run on YARN without any disruption or availability issues. Resource utilization - YARN allows the dynamic allocation of cluster resources to improve resource utilization. Multitenancy - YARN can use open-source and proprietary data access engines, as well as perform real-time analysis and run ad-hoc queries.
Q5. How can you restart NameNode and all the daemons in Hadoop? Ans: You can stop the NameNode 1)/sbin /Hadoop-daemon.sh stop NameNode 2)./sbin/Hadoop-daemon.sh start NameNodeYou can stop all the daemons and Stop all daemons ./sbin /stop-all.sh command ./sbin/start-all.sh command.
Q6. Which of the following has replaced JobTracker from MapReduce v1? Ans: ResourceManager
Q7. Write the YARN commands to check the status of an application and kill an application. Ans: yarn application -status ApplicationID yarn application -kill ApplicationID
Q.8 What are the two types of metadata that a NameNode server holds? Ans: The two types of metadata that a NameNode server holds are: Metadata in Disk - This contains the edit log and the FSImage Metadata in RAM - This contains the information about DataNode
Q9. How is formatting done in HDFS? Ans: Hadoop distributed file system(HDFS) is formatted using bin/hadoop namenode -format command. This command formats the HDFS via NameNode. This command is only used for the first time. Formatting the file system means starting the working of the directory specified by the dfs.name.directory variable. If you execute this command on existing filesystem, you will delete all your data stored on your NameNode. Formatting a Namenode will not format the DataNodeFully distributed mode: This is the production phase of Hadoop where data is distributed across several nodes on a Hadoop cluster. Different nodes are allotted as Master and Slaves.Q10. Write the three modes in which Hadoop can run .
Q11. Explain rack awareness in Hadoop. Ans: HDFS replicates blocks onto multiple machines. In order to have higher fault tolerance against rack failures (network or physical), HDFS is able to distribute replicas across multiple racks. Hadoop obtains network topology information by either invoking a user-defined script or by loading a Java class which should be an implementation of the DNSToSwitchMapping interface. It’s the administrator’s responsibility to choose the method, to set the right configuration, and to provide the implementation of said method.
Q12. Explain rack awareness in Hadoop. Ans: Even though data is distributed amongst multiple DataNodes, NameNode is the central authority for file metadata and replication (and as a result, a single point of failure). The configuration parameter dfs.NameNode.replication.min defines the number of replicas a block should replicate to in order for the write to return as successful.
Q13.Explain How Input And Output Data Format Of The Hadoop Framework?. Ans: The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types. See the flow mentioned below (input) -> map -> -> combine/sorting -> -> reduce -> (output)
Q14. Which Interface Needs To Be Implemented To Create Mapper And Reducer For The Hadoop?. Ans: org.apache.hadoop.mapreduce.Mapper org.apache.hadoop.mapreduce.Reducer
Q15.Explain The Shuffle?. Ans: Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Q16. How Many Instances Of Jobtracker Can Run On A Hadoop Cluster?. Ans: Only One
Q17.What Is Fault Tolerance?. Ans: Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.
Q18. How Many Daemon Processes Run On A Hadoop Cluster?. Ans: Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM. Following 3 Daemons run on Master nodes. NameNode :This daemon stores and maintains the metadata for HDFS.Secondary NameNode :Performs housekeeping functions for the NameNode.JobTracker :Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.Following 2 Daemons run on each Slave nodes DataNode :Stores actual HDFS data blocks.TaskTracker :It is Responsible for instantiating and monitoring individual Map and Reduce tasks.
Q19. What Is A Job Tracker?. Ans: Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.
Q20. What Is A Task Tracker?. Ans: Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.
Q22. Explain about the indexing process in HDFS. Ans: Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.
Q23. Why do we need Hadoop? Ans: Storage-Since data is very large, so storing such huge amount of data is very difficult. Security-Since the data is huge in size, keeping it secure is another challenge. Analytics-In Big Data, most of the time we are unaware of the kind of data we are dealingwith. So analyzing that data is even more difficult. Data Quality-In the case of Big Data, data is very messy, inconsistent and incomplete. Discovery-Using a powerful algorithm to find patterns and insights are very difficult.
Q24. What happens to a NameNode that has no data? Ans: There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
Q25. What is Hadoop streaming? Ans: Hadoop distribution has a generic application programming interface for writing Map andReduce jobs in any desired programming language like Python, Perl, Ruby, etc. This isreferred to as Hadoop Streaming. Users can create and run jobs with any kind of shellscripts or executable as the Mapper or Reducers.
Q26 What is a block and block scanner in HDFS? Ans: Block-The minimum amount of data that can be read or written is generally referred to asa “block” in HDFS. The default size of a block in HDFS is 64MB. Block Scanner-Block Scanner tracks the list of blocks present on a DataNode and verifiesthem to find any kind of checksum errors. Block Scanners use a throttling mechanism toreserve disk bandwidth on the datanode.
Q27. Explain what is heartbeat in HDFS? Ans: Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker.
Q28.What happens when a datanode fails ? Ans: When a datanode fails Jobtracker and namenode detect the failure On the failed node all tasks are re‐scheduled Namenode replicates the users data to another node.
Q29.Explain what happens in textinformat ? Ans: In textinputformat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text
Q30. Explain what is sqoop in Hadoop ? Ans: To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS.
Q31. Consider below case : In M/R system, - HDFS block size is 64 MB - Input format is FileInputFormat - We have 3 files of size 64K, 70Mb and 120Mb How many input splits will be made by Hadoop Job? Ans: Hadoop will make splits as follows - 1 split for 64K files 2 splits for 70MB files 2 splits for 120MB files

Top 50 Hadoop Interview Question

Q1.What are HDFS and YARN?

Ans: HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology. YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.

Q2. How does NameNode tackle DataNode failures?

Ans: NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly. A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.

Q3. Is there any way to change the replication of files on HDFS after they are already written to HDFS?

Ans: We can change the dfs.replication value to a particular number in the
$HADOOP_HOME/conf/hadoop-site.xml file,
which will start replicating to the factor of that number for any new content that comes in.

Q4. What benefits did YARN bring in Hadoop 2.0 and how did it solve the issues of MapReduce v1?

Ans: In Hadoop v2, the following features are available:

Scalability - You can have a cluster size of more than 10,000 nodes and you can run more than 100,000 concurrent tasks.
Compatibility - The applications developed for Hadoop v1 run on YARN without any disruption or availability issues.
Resource utilization - YARN allows the dynamic allocation of cluster resources to improve resource utilization.
Multitenancy - YARN can use open-source and proprietary data access engines, as well as perform real-time analysis and run ad-hoc queries.

Q5. How can you restart NameNode and all the daemons in Hadoop?

Ans: You can stop the NameNode
1)/sbin /Hadoop-daemon.sh stop NameNode
2)./sbin/Hadoop-daemon.sh start NameNodeYou can stop all the daemons and Stop all daemons

./sbin /stop-all.sh command
./sbin/start-all.sh command.

Q6. Which of the following has replaced JobTracker from MapReduce v1?

Ans: ResourceManager

Q7. Write the YARN commands to check the status of an application and kill an application.

Ans: yarn application -status ApplicationID
yarn application -kill ApplicationID

Q.8 What are the two types of metadata that a NameNode server holds?

Ans: The two types of metadata that a NameNode server holds are:
Metadata in Disk - This contains the edit log and the FSImage
Metadata in RAM - This contains the information about DataNode

Q9. How is formatting done in HDFS?

Ans: Hadoop distributed file system(HDFS) is formatted using bin/hadoop namenode -format command.
This command formats the HDFS via NameNode. This command is only used for the first time.
Formatting the file system means starting the working of the directory specified by the dfs.name.directory variable.
If you execute this command on existing filesystem, you will delete all your data stored on your NameNode.
Formatting a Namenode will not format the DataNodeFully distributed mode: This is the production phase of Hadoop where data is distributed across several nodes on a Hadoop cluster.
Different nodes are allotted as Master and Slaves.Q10. Write the three modes in which Hadoop can run .

Q11. Explain rack awareness in Hadoop.

Ans: HDFS replicates blocks onto multiple machines. In order to have higher fault tolerance against rack failures (network or physical), HDFS is able to distribute replicas across multiple racks.
Hadoop obtains network topology information by either invoking a user-defined script or by loading a Java class which should be an implementation of the DNSToSwitchMapping interface. It’s the administrator’s responsibility to choose the method, to set the right configuration, and to provide the implementation of said method.

Q12. Explain rack awareness in Hadoop.

Ans: Even though data is distributed amongst multiple DataNodes, NameNode is the central authority for file metadata and replication (and as a result, a single point of failure). The configuration parameter dfs.NameNode.replication.min defines the number of replicas a block should replicate to in order for the write to return as successful.

Q13.Explain How Input And Output Data Format Of The Hadoop Framework?.

Ans: The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types.
See the flow mentioned below
(input) -> map -> -> combine/sorting -> -> reduce -> (output)

Q14. Which Interface Needs To Be Implemented To Create Mapper And Reducer For The Hadoop?.

Ans: org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Reducer

Q15.Explain The Shuffle?.

Ans: Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.

Q16. How Many Instances Of Jobtracker Can Run On A Hadoop Cluster?.

Ans: Only One

Q17.What Is Fault Tolerance?.

Ans: Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.

Q18. How Many Daemon Processes Run On A Hadoop Cluster?.

Ans: Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM.
Following 3 Daemons run on Master nodes.

NameNode :This daemon stores and maintains the metadata for HDFS.Secondary NameNode :Performs housekeeping functions for the NameNode.JobTracker :Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.Following 2 Daemons run on each Slave nodes
DataNode :Stores actual HDFS data blocks.TaskTracker :It is Responsible for instantiating and monitoring individual Map and Reduce tasks.

Q19. What Is A Job Tracker?.

Ans: Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

Q20. What Is A Task Tracker?.

Ans: Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

Q22. Explain about the indexing process in HDFS.

Ans: Indexing process in HDFS depends on the block size. HDFS stores the last part of the data
that further points to the address where the next part of data chunk is stored.

Q23. Why do we need Hadoop?

Ans:

Storage-Since data is very large, so storing such huge amount of data is very difficult.
Security-Since the data is huge in size, keeping it secure is another challenge.
Analytics-In Big Data, most of the time we are unaware of the kind of data we are dealingwith. So analyzing that data is even more difficult.
Data Quality-In the case of Big Data, data is very messy, inconsistent and incomplete.
Discovery-Using a powerful algorithm to find patterns and insights are very difficult.

Q24. What happens to a NameNode that has no data?

Ans: There does not exist any NameNode without data. If it is a NameNode then it should have
some sort of data in it.

Q25. What is Hadoop streaming?

Ans: Hadoop distribution has a generic application programming interface for writing Map andReduce jobs in any desired programming language like Python, Perl, Ruby, etc. This isreferred to as Hadoop Streaming. Users can create and run jobs with any kind of shellscripts or executable as the Mapper or Reducers.

Q26 What is a block and block scanner in HDFS?

Ans: Block-The minimum amount of data that can be read or written is generally referred to asa “block” in HDFS. The default size of a block in HDFS is 64MB.

Block Scanner-Block Scanner tracks the list of blocks present on a DataNode and verifiesthem to find any kind of checksum errors. Block Scanners use a throttling mechanism toreserve disk bandwidth on the datanode.

Q27. Explain what is heartbeat in HDFS?

Ans: Heartbeat is referred to a signal used between a data node and Name node, and between
task tracker and job tracker, if the Name node or job tracker does not respond to the
signal, then it is considered there is some issues with data node or task tracker.

Q28.What happens when a datanode fails ?

Ans: When a datanode fails
Jobtracker and namenode detect the failure
On the failed node all tasks are re‐scheduled
Namenode replicates the users data to another node.

Q29.Explain what happens in textinformat ?

Ans: In textinputformat, each line in the text file is a record. Value is the content of the line
while Key is the byte offset of the line. For instance, Key: longWritable, Value: text

Q30. Explain what is sqoop in Hadoop ?

Ans: To transfer the data between Relational database management (RDBMS) and Hadoop
HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like
MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS.

Q31. Consider below case : In M/R system, - HDFS block size is 64 MB
- Input format is FileInputFormat
- We have 3 files of size 64K, 70Mb and 120Mb
How many input splits will be made by Hadoop Job?