What is Hadoop? Introduction, Modules, Architecture7 min read

Hadoop Introduction

Hadoop is an Apache open-source platform for storing, processing, and analyzing massive amounts of data in a distributed manner through large clusters of commodity hardware. Apache Hadoop is written in Java and is basically used for batch processing. Large data sets are spread through clusters of commodity computers, and applications developed with Hadoop run on them. Computers that are sold as commodities are inexpensive and readily available. These are primarily used to increase computing power at a low cost. Learn how to install Hadoop in ubuntu.

Hadoop consists of four modules:
  1. Hadoop Distributed File System or HDFS: HDFS is a distributed file system designed to run on low-cost hardware. HDFS outperforms conventional file systems in terms of data throughput. It’s a Java-based distributed file system that lets you store large amounts of data across multiple nodes (machines).
  2. MapReduce: It is a framework that aids Java programs in parallel data computation. It consists of Map and Reduce tasks. The map task transforms input data into a dataset that can be computed using key-value pairs. The output of the Map task is consumed by the Reduce task, which then produces the final result.
  3. YARN: An Abbreviation of Yet Another Resource Negotiator is used for scheduling the jobs and managing the Hadoop clusters.
  4. Hadoop Common: Provides a collection of shared Java libraries that can be used for all Hadoop modules.

Hadoop Modules

As mentioned above, there are four main Modules of Hadoop. Let us look at the Modules in depth.

Hadoop Distributed File System or HDFS

The Hadoop Distributed File System (HDFS) is the foundation of the Hadoop Ecosystem. HDFS takes care of the storage required for Hadoop applications. HDFS makes several replicas of data blocks and distributes them through the nodes in a Hadoop cluster. This distribution allows for extremely accurate and quick computations.

There is a master/slave architecture in HDFS. This architecture is made up of a single NameNode that serves as the master and multiple DataNodes that serve as slaves.

MapReduce

MapReduce is a programming model and software architecture for processing large quantities of data in parallel. The MapReduce software has two phases: Map and Reduce. In Map tasks, data is split and mapped, while in Reduce tasks, data is shuffled and reduced.

A block of data is read and processed in the Map job, resulting in key-value pairs as intermediate outputs. The Map job’s output (Mapper) becomes the Reduce job’s input (Reducer). The key-value pair is received by the reducer from multiple map jobs. Finally, the Reducer condenses the data into a smaller collection of tuples (or key-value pairs), which appears as the final output.

Hadoop can run MapReduce programs written in a variety of languages, including Java, Python, C++, and Ruby. Map Reduce programs in cloud computing are parallel in design, making them ideal for large-scale data processing across many machines in a cluster.

YARN

YARN is an acronym for “Yet Another Resource Negotiator.” It was implemented in Hadoop 2.0 to overcome the issues that existed in Hadoop 1.0’s Job Tracker. It enables various data processing engines, such as graph processing, interactive processing, stream processing, and batch processing, to run and process data stored in HDFS (Hadoop Distributed File System), resulting in a much more effective system.

Yarn can dynamically assign various resources and schedule application processing through its various components. It’s important to properly manage the available resources for large-scale data processing so that each application can profit from them. Below are the main components of Yarn.

Components of Yarn

  1. Resource Manager: It is Yarn’s master daemon. The Resource Manager is in charge of allocating resources such as CPU and memory across all applications. It divides up the system’s resources among different applications. Resource Manager further has 2 main components:
    • Scheduler: The scheduler is in charge of allocating resources to the currently active program. The scheduler doesn’t control or track the program and doesn’t ensure that the failed tasks will be restarted due to application or hardware failures.
    • Application Manager: It is in charge of managing the cluster’s running Application Masters, including starting them, tracking them, and restarting them on separate nodes if they fail.
  2. Node Manager: NodeManager is Yarn’s slave daemon. The Node Manager is in charge of tracking container resource utilization and reporting it to the Resource Manager. It manages the user’s activity on the node in question. Yarn Node Manager also keeps tabs on the health of the node it’s running on.
  3. Application Master: The Application Master is in charge of negotiating resources with the Resource Manager, keeping track of the status of a particular application, and monitoring its development. The Application Master sends a Container Launch Context (CLC) to the node manager to request the container. The CLC contains everything an application requires to run. When the application is launched, it periodically sends a health report to the Resource Manager.

Hadoop Common

Hadoop Common is a collection of utilities and libraries that work together to support other Hadoop modules. Along with Hadoop Distributed File System (HDFS), Hadoop Yarn, and Hadoop MapReduce, it is an integral part or module of the Apache Hadoop Framework. Hadoop Common, like all other modules, assumes that hardware failures are common and should be treated automatically in software.


Hadoop Architecture

The Hadoop Architecture is based on a Master/Slave model, with a single NameNode (Master node) and multiple DataNodes (slave nodes). HDFS is compatible with a wide range of Java-enabled machines. While many DataNodes can be run on a single computer, but in practice, these DataNodes are spread over several machines. All this combined create a Hadoop Multi-node cluster.

Hadoop Architecture
Hadoop Architecture

NameNode

In the Hadoop Architecture, the NameNode is the master node that maintains and controls the blocks on the DataNodes (slave nodes). NameNode is a high-availability server that monitors client access to files and manages the File System Namespace. The HDFS architecture is designed such that user data is never stored on the NameNode. The data is only stored on DataNodes.

  • It is the only master server that is present in the HDFS cluster.
  • Since it is a single node, it can cause single-point failure.
  • It maintains the file system namespace by performing operations such as file opening, renaming, and closing.
  • NameNode keeps track of the metadata of all the files in the cluster. The metadata has two files associated with it.
    • FsImage: It includes the state of the file system namespace since the NameNode’s creation.
    • EditLogs: It includes all of the most recent changes to the file system in relation to the most recent FsImage.
  • To ensure that all DataNodes in the cluster are alive, it receives a Heartbeat and a Block Report on a regular basis.
  • It is also in charge of maintaining the replication factor of all blocks.
  • The NameNode selects new DataNodes for new replicas, balances disc utilization, and handles contact traffic to the DataNodes in the event of a DataNode failure.

DataNode

In the Hadoop Architecture, the DataNodes are the slave nodes. Unlike NameNode, DataNode is a commodity hardware system, which means it is a low-cost, low-quality, and low-availability system. The DataNode is a block server that stores data in an ext3 or ext4 local disc.

  • DataNodes are the slave daemons that run on every slave machine in a Hadoop cluster.
  • The real data is stored in the DataNodes
  • They send heartbeats to the NameNode on a regular basis to report on HDFS’ overall health; the default frequency is 3 seconds.
  • DataNode allows you to communicate with the blocks and control the state of an HDFS node.

Secondary NameNode

In addition to the above two daemons, Secondary NameNode is a third daemon or process. As a helper daemon, the Secondary NameNode runs alongside the main NameNode. It is not a backup NameNode for the main NameNode.

  • The Secondary NameNode is one that reads all of the file systems and metadata from the NameNode’s RAM and writes them to the hard disc or file system on a regular basis.
  • It’s in charge of merging the EditLogs with the NameNode’s FsImage.
  • It periodically downloads the EditLogs from the NameNode and adds them to FsImage.
  • The new FsImage is copied back to the NameNode, and it will be used the next time the NameNode is started.
  • Secondary NameNode is also known as CheckpointNode because it conducts routine checkpoints in HDFS.

Share:

Leave a Reply