Hadoop Cluster – Pre Maintenance procedure

In IT, it’s inevitable that all the servers will go for monthly security, vulnerability patching and hadoop servers are no exception.

There’d be a separate Systems team to perform OS related patching, security updates, etc and your role is to bring down/up the cluster, ensure the application is good post patching.

You’ve have to schedule a window accordingly for the maintenance and notify end users about the unavailability of cluster in advance. Your cluster shouldn’t be running during the activity as the servers would go for reboot which cause huge impact if cluster is running.

That said, you can’t just simply stop it either, you have to follow precautionary steps, take backups etc, so that if the cluster doesn’t start after the activity you can recover it using the backups.

In Hadoop Admin interviews, you can expect this question – how you perform maintenance on the cluster.

Let’s see the steps involved in preparing the cluster for maintenance. This is applicable only for small, non production clusters and not for production clusters. I’ll write a separate post for Production clusters.

1. Plan the downtime window and send notifications to Users.

You have to schedule a maintenance window preferably in weekends or after business hours in weekdays, and intimate the users one week advance.

This helps user to reschedule their jobs during the maintenance window. If you stop the cluster without notifying users, then when their jobs fail, they’ll come back to you with incidents and mails.

2. Place servers in maintenance mode

Before the start of the patching activity, you’ve to place the servers in maintenance mode, to avoid triggering unnecessary alerts when the servers are rebooted.

As per your company’s process, you’ll either place them in MM on your own or share the list to the concerned team and they’ll take care of it.

3. Capture the cluster status

Since Hadoop cluster has multiple services and each service may have some warnings/errors, it’s not possible to remember the status of all.

So take the screenshots of the CM page, warnings/alerts in services, namenode ui etc, so that you can compare the status of Cluster when you bring up the cluster post maintenance.

If  any service is not coming up after the maintenance and you’re not sure whether that service is supposed to be up or down, you can compare with screenshots and confirm.

Ideally, all the services in the cluster should be in green status but when the number of servers are high, it’s difficult to maintain it green.

4. Stop the Cluster Services

Once you’re done with MM, screenshots, you can stop the cluster services.

Go to CM – Cluster dropdown – stop Cluster.

Now CM will stop all services one by one, starting from least dependent service to most dependent service. Once all the services are stopped, now go to Cloudera Management Services dropdown and stop the services.

Now login to CM server and stop the Cloudera SCM Server service.

# service cloudera-scm-server stop

From CM server, using password less ssh login to rest of the nodes ( you can do this in shell script) and stop the Cloudera SCM Agent services.

# service cloudera-scm-agent stop

 

5. Take Database and Namenode backups

You should take the database and namenode backups and store them in remote server, so in worst case scenario if mysql/namenode servers are crashed during maintenance, we can setup them in some other server, using the backups.

You shouldn’t proceed for any activity, without taking the backups.

Database backup: This is a generic command for each databases. You can look at their documentation to find out.

For Mysql:

mysqldump -u username -p password --databases db1 db2 db3 > /hadoop/backups/mysqldump.sql

To take all:
mysqldump -u username -p password --all-databases > /hadoop/backups/mysqldumpfull.sql

 

Namenode backup:

You can take the namenode backups using the linux commands. Here namenode means namenode data dir, i.e where the nn stores its fsimage and edits log files.

The directory would be in NN server and you can find out the exact filesystem/directory by checking the property “dfs.name.dir” in hdfs-site.xml.

Login to NN server and run the below command

# tar -cvf /hadoop/backups/nn_backup.tar /hadoop/dfs/nn/

Tar is a command which zips multiple directories, 
files and creates a single file.

Syntax: tar --options destination source
-cvf: c -> create, v -> verbose, f -> force

Now that we have taken the backups, lets copy them to remote server.

# scp -rp /hadoop/backups/  user@remoteserver:/hadoop/clustername_backup

 

Once its copied, we can release the servers to the Systems team for the patching activity.

Thus we covered the pre-maintenance steps for bringing down our hadoop cluster. I’ll write a separate post for post maintenance steps.

Use the comments section below to post your doubts, questions and feedback.

Please follow my blog to get notified of more Hadoop Administration posts, certifications and interview topics etc.

 


 

Leave a Reply

%d bloggers like this: