Hadoop Cluster (HDFS) Configuration using ansible.

Yash Hirulkar
4 min readDec 16, 2020

--

Whenever there is need to perform any task repetitively, over and over again we need automation and when it comes to configuration management we always think of an automation tool like Ansible.

What is Ansible?

Ansible is an IT automation tool. It can configure systems, deploy software, and orchestrate more advanced IT tasks such as continuous deployments or zero downtime rolling updates.
Ansible’s main goals are simplicity and ease-of-use. It also has a strong focus on security and reliability.

Ansible is mainly used for :

Provisioning: Set up the various servers you need in your infrastructure.
Configuration management: Change the configuration of an application, OS, or device; start and stop services; install or update applications; implement a security policy; or perform a wide variety of other configuration tasks.
Application deployment: Make DevOps easier by automating the deployment of internally developed applications to your production systems.

Ansible Architecture:

Ansible Playbooks: Ordered list of tasks, saved so that you can run those tasks in that order repeatedly. Playbooks are written in YAML format.

Inventory: A list of managed nodes. An inventory file is also sometimes called as a host file. Your inventory can specify information like IP address for each managed node.

Control Node: Any machine with ansible installed is called as controller node. You can run ansible commands and playbooks by invoking ansible-playbook command from any control node.

Managed Node: The machines you manage with ansible are called as managed nodes or hosts.

What is Hadoop?

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.
Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets more quickly.

Hadoop Distribution File System (HDFS) Architecture:

NameNode: NameNode is the centerpiece of HDFS and is also known as the Master. NameNode only stores the metadata of HDFS.

DataNode: DataNode is also known as the Slave. DataNode is responsible for storing the actual data in HDFS.

Steps to Configure Hadoop using Ansible :

  1. Install ansible in controller node:
    <yum install ansible> (command for Red Hat Linux)
    <ansible — version> (to check the version)

2. Ansible Inventory File : vim /inventory.txt

Ansible Configuration File : vim /etc/ansible/ansible.cfg

3. To check number of Managed Nodes/Hosts in Inventory:
<ansible all — list-hosts>

4. To check connectivity with all Managed Nodes:
<ansible all -m ping>

PlayBooks:
5. PlayBook for NameNode configuration (namenode.yml) :

6. To run NameNode PlayBook (namenode.yml) :
<ansible-playbook namenode.yml>

You can see in the above image NameNode has been successfully started.

7. PlayBook for DataNode Configuration (datanode.yml) :

8. To run DataNode PlayBook (datanode.yml):
<ansible-playbook datanode.yml>

You can see in the above image DataNode has been successfully started.

In the image below , you can see that the DataNode has been connected to NameNode succesfully.

9 . We can also check this using GUI in Web browser using:
<NameNode_IP:50070>

So, Hadoop (HDFS) has been configured succesfully..!

Hope you learnt something new today :)

Thank You!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Yash Hirulkar
Yash Hirulkar

Written by Yash Hirulkar

Tech enthusiast , DevOps Engineer

No responses yet

Write a response