Resolve performance problems/errors in cluster operation

This is again another scenario based topic.

Some of the common performance problems are jobs running slowly, services crash due to out of memory, etc.,

Example:

When I’m adding Yarn roles, one of the node managers failed to start.

In this case, we have to identify the cause for the failure. Select Log files dropdown and click Role log.

As you can see the logs, the service encountered an IO error when trying to access /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/LOCK file due to Permission denied.

To fix this, login to the server slave1 and check the /var/lib/ folder permissions and go down each level to check subfolder permissions.

Quite strangely, many of the folders under /var/lib are with 000 permissions and that’s the reason why nodemanager service unable to access LOCK file inside hadoop-yarn directory and failed to start.

Let’s change the permissions of directories to 755 recursively.

After fixing this permissions issue and giving start, node manager service started in the cluster.

So whenever you face a service failure, errors in the cluster go the service role logs, check in which step an error,fatal, occurs and the subsequent steps of error message and troubleshoot accordingly.

Problem Scenario:

· One of the services failed to start after adding. Identify the cause, fix the issue and start the service.

Thus we covered how to address and troubleshoot performance problems/errors in cluster operation.

Use the comments section below to post your doubts, questions and feedback.

Please follow my blog to get notified of more certification related posts, exam tips, etc.

 


 

Leave a Reply

%d bloggers like this: