Hadoop Multi Node Cluster Setup

Published On: 19 September 2022.By suraj palod.

General

Hadoop Multi Node Cluster Setup

What is Hadoop ?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Installation Steps

Prerequisite are :

Make sure your Both Master and Slave System has Ssh installed & Active .
To Check weather it is Installed and Active or Not : Open Terminal and type

sudo systemctl status ssh

1

sudo systemctl status ssh
If Not installed then install by

sudo apt install openssh-client.

1

sudo apt install openssh-client.
java Should be installed .

Steps To install Multi-Cluster are:

We have a Server IP : 192.168.0.186 (Master) And a Node or Slave IP : 192.168.0.119.

Generation of Keys.

Generate the ssh key and add all the node keys in all the nodes under /home/username/.ssh/authorized_keys ( create this file if not exists ).
Generate the key (ssh-keygen -t rsa) – 4 times press enter.
Exchange the keys between Master Node and Slave Node . Example Paste Slave Node Key in Authorize _keys file in Master Node and vice versa.
You will find your key in /home/username/.ssh/id_rsa.pub file . Copy it & Paste on Master Node and Vice versa.
Key Look Like slave 1 – ssh-rsa AAAAB3N………….

Download the Hadoop & untar the file

Download the hadoop from official website or open the terminal and use the command

wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

1	wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

Configure Hadoop Environment Variables (bashrc)
Edit the .bashrc shell configuration file using a text editor of your choice (we will be using nano):

export HADOOP_HOME=/home/auriga/hadoop-3.2.3
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/nativ"

export HADOOP_HOME=/home/auriga/hadoop-3.2.3

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/nativ"

Once you add the variables, save and exit the .bashrc file.

It is vital to apply the changes to the current running environment by using the following command:

source ~/.bashrc

1	source ~/.bashrc

Edit hadoop-env.sh File

The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop-related project settings.
When setting up a single node Hadoop cluster, you need to define which Java implementation is to be utilized. Use the previously created $HADOOP_HOME variable to access the hadoop-env.sh file:

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

1	sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the OpenJDK installation on your system.

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_311

1	export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_311

Edit core-site.xml File

The core-site.xml file defines HDFS and Hadoop core properties.
Open the core-site.xml file in a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

1	sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration to both Master and Slave Node. Add IP of Master Node.

<configuration>
 <property>
        <name>fs.default.name</name>
        <value>hdfs://192.168.0.186:9000</value>
        <description>The name of the default file system></description>
    </property>
</configuration>

<name>fs.default.name</name>

<description>The name of the default file system></description>

</property>

</configuration>

Edit hdfs-site.xml File

sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

1	sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and edit log file.
Configure the file by defining the NameNode and DataNode storage directories.

<configuration>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>/home/auriga/hadoop-3.2.3/namenode-dir</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>/home/auriga/hadoop-3.2.3/datanode-dir</value>
</property>
<property>
  <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
  <value>false</value>
</property>
</configuration>

<name>dfs.namenode.name.dir</name>

<value>/home/auriga/hadoop-3.2.3/namenode-dir</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>/home/auriga/hadoop-3.2.3/datanode-dir</value>

</property>

<name>dfs.namenode.datanode.registration.ip-hostname-check</name>

<value>false</value>

</property>

</configuration>

Edit mapred-site.xml File

see the following command to access the mapred-site.xml file and define MapReduce values:

sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

1	sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

<name>mapreduce.framework.name</name>

</property>

</configuration>

Edit yarn-site.xml File

The yarn-site.xml file is used to define settings relevant to YARN.
It contains configurations for the Node Manager, Resource Manager, Containers, and Application Master.
Open the yarn-site.xml file in a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

1	sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Append the following configuration to the file:

<configuration>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>192.168.0.186</value>
</property>
<property>
  <name>yarn.resourcemanager.address</name>
  <value>192.168.0.186:8032</value>
</property>
</configuration>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<name>yarn.resourcemanager.hostname</name>

</property>

<name>yarn.resourcemanager.address</name>

</property>

</configuration>

Now Edit Workers file and put all the IP of The Nodes That you want to make slave. (Only In Master Node)

192.168.0.186
192.168.0.119

1 2	192.168.0.186 192.168.0.119

Now Format The Name-Node in Master Node

bin/hadoop namenode -format

1	bin/hadoop namenode -format

Now for start the hadoop by running command in sbin folder of hadoop Directory .

sbin/start-all.sh

1	sbin/start-all.sh

Type this simple command to check if all the daemons are active and running as Java processes:

jps

jps

If everything is working as intended, the resulting list of running Java processes contains all the HDFS and YARN daemons.

We can also access Web UI of Hadoop by hitting a url

http://192.168.0.186:9870

1	http://192.168.0.186:9870

The YARN Resource Manager is accessible on port 8088:

http://192.168.0.186:8088

1	http://192.168.0.186:8088

Conclusion

You have successfully installed Hadoop on Ubuntu and deployed it in a distributed mode.
A Multi node Hadoop deployment is an excellent starting point to explore basic HDFS commands and acquire the experience.