How to Install Hadoop in Ubuntu9 min read

Introduction

Apache Hadoop is an open-source software about which you can learn more here. In this tutorial, we will learn how to install Hadoop in Ubuntu. Here, we are using a cloud platform (Amazon Web Service to be particular), but you can follow the same steps for your local system (Ubuntu) as well. Also, this will be a single node cluster or you can say we are installing Hadoop in pseudo-distributed mode.

We will be installing Hadoop 3.2.2 which is the latest version as of now. It was released on 9th January 2021. So, without wasting any further time, let us look at the installation steps.

Install Hadoop in Ubuntu

We have used a t1.micro free tier Ubuntu Amazon ec2 instance for this setup. Also, as mentioned earlier, you can follow the same steps for your local Ubuntu system as well.

Installing Java

Step 1. The very first step for installing Hadoop is to install Java because Hadoop is written in Java and for Hadoop to run, Java is necessary. We will install OpenJDK 8. Execute the below command for the same

sudo apt update
sudo apt install openjdk-8-jdk -y

Step 2. Create a new user.

sudo su
useradd hiberstack -m -d /home/hiberstack -s /bin/bash

Step 3. Add the new user to the sudoers file so that we can execute the commands without any errors. This will basically provide the root privileges to our user.

echo -e 'hiberstack ALL=(ALL)  NOPASSWD:  ALL' > /etc/sudoers.d/hiberstack

Step 4. Change the user to our new user

su - hiberstack

Configure SSH

Step 5. We need to configure ssh for our new user so that our user can ssh without any password. Execute the below command to create a new key

ssh-keygen

A new key is generated. Store this key in an authorized_keys file with the below command for the passwordless ssh access.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
SSH configuration
SSH configuration

Stop and Start your instance now to reflect the changes. Execute the below command to verify that ssh is successful.

ssh localhost
ssh localhost

Download and install Hadoop in Ubuntu

Step 6. Create a new directory and download the Hadoop binary distribution package in it.

mkdir hadoop
cd hadoop/
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
download hadoop in ubuntu
Downloading Hadoop

Note: If you want to install a different version of Hadoop, you can download the binary distribution for the same from the Apache Hadoop downloads site.

Step 7. Extract the Hadoop binary package.

tar -xvf hadoop-3.2.2.tar.gz

Step 8. Configure the Hadoop Environment variables. First change the directory to the default directory so that we can edit the .bashrc file to configure the variables.

cd
nano .bashrc

Add the below lines in the file so that the environment variables are configured.

export HADOOP_HOME=/home/hiberstack/hadoop/hadoop-3.2.2
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Hadoop Environment Variables
Hadoop Environment Variables

Save and exit.

Step 9. Persist the changes done in the .bashrc file with the below command.

source .bashrc

Edit the hadoop-env.sh file

Step 10. Add the Java path in the hadoop-env.sh file so that Hadoop knows which Java to use. Execute the below commands to do the same

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

The export JAVA_HOME line will already be present in the file. Just uncomment it and change it as below. If you are not able to find the line, you can simply add the below line as it is at the end of the file.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Setting JAVA_HOME for Hadoop

Save and exit.

Edit the core-site.xml file

Step 11. Edit the core-site.xml file to set the default NameNode URL and Hadoop temporary directory. Hadoop uses this temporary directory for map and reduce processes.

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the below lines between the <configuration> and </configuration> tags.

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hiberstack/tmpdata</value>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs://172.31.25.86:9000</value>
</property>
core-site.xml configuration

Note: Change the private IP address as per your cloud instance for the fs.default.name property. If you are using your local system, then provide the value as hdfs://127.0.0.1:9000

Edit the hdfs-site.xml file

Step 12. Edit the hdfs-site.xml file to set the NameNode directory, DataNode directory, and hdfs default replication factor.

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Copy-paste the below lines between the <configuration> and </configuration> tags.

<property>
  <name>dfs.data.dir</name>
  <value>/home/hiberstack/dfsdata/namenode</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/home/hiberstack/dfsdata/datanode</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
hdfs-site.xml configuration

Save and exit.

Edit the mapred-site.xml file

Step 13. Edit the mapred-site.xml file to define the MapReduce framework.

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the below property between the <configuration> and </configuration> tags.

<property> 
  <name>mapreduce.framework.name</name> 
  <value>yarn</value> 
</property>
mapred-site.xml configuration

Save and exit.

Edit the yarn-site.xml file

Step 14. Edit the yarn-site.xml file to set the configurations for Node Manager, Resource Manager, and Application Master.

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Paste the below lines between the <configuration> and </configuration> tags.

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>172.31.25.86</value>
</property>
<property>
  <name>yarn.acl.enable</name>
  <value>0</value>
</property>
<property>
  <name>yarn.nodemanager.env-whitelist</name>   
  <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>

Note: Change the private IP address as per your cloud instance for the yarn.resourcemanager.hostname property. If you are using your local system, then provide the value as 127.0.0.1

Save and exit.

Format the NameNode

Step 15. Before starting the Hadoop services for the first time, we need to format the NameNode. Execute the below command for the same

hdfs namenode -format

Once the formatting is complete, a shutdown message will be displayed as shown in the below screenshot

NameNode format
NameNode format

Start Hadoop services

Step 16. Start the Hadoop service now with the below command

$HADOOP_HOME/sbin/start-all.sh
installed Hadoop in ubuntu and Starting Hadoop Services
Starting Hadoop Services

Verify if the services are running with the below command.

jps
installed hadoop in ubuntu

Done. We have successfully installed Hadoop in Ubuntu.


Accessing Hadoop NameNode UI

You can also access the Hadoop NameNode UI on your browser with the URL: <public-ip>:9870 or if you have done the setup in your local system, then you can access the Hadoop NameNode UI with the URL: localhost:9870. 9870 is the default port of NameNode.

installed hadoop in ubuntu and accessing Hadoop NameNode UI
Hadoop NameNode UI

Share:

Leave a Reply