Configure HDFS ACLs

Every file/folder in linux is owned by a owner and the group. If an user needs to access the file (read, write, modify) either the user has to be part of the group or the file has appropriate “others” permissions. In this model, we can’t set different permissions userwise, groupwise catering to our requirements.

ACLs control the access of HDFS files by providing a way to set different permissions for specific named users or named groups.

They enhance the traditional permissions model by allowing users to define access control for various combination of users and groups instead of a single owner/user or a single group.

 

Enabling HDFS ACLs Using Cloudera Manager

  1. Go to the CM – HDFS service.
  2. Click the Configuration tab.
  3. Select Scope > Service_name (Service-Wide)
  4. Locate the Enable Access Control Lists property and select its checkbox to enable HDFS ACLs.
  5. Click Save Changes to commit the changes.

Without enabling HDFS ACLS, we can’t perform ACL operations in HDFS.

 

Enabling HDFS ACLs Using the Command Line

To enable ACLs using the command line, set the dfs.namenode.acls.enabled property to true in the NameNode’s hdfs-site.xml.

<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>

Commands

To set and get file access control lists (ACLs), use the file system shell commands, setfacl and getfacl.

getfacl

hdfs dfs -getfacl [-R] <path>

<!-- COMMAND OPTIONS
<path>: Path to the file or directory for which ACLs should be listed.
-R: Use this option to recursively list ACLs for all files and directories.
-->

Examples:

<!-- To list all ACLs for the file located at /user/kannan -->
hdfs dfs -getfacl /user/kannan

<!-- To recursively list ACLs for /user/hdfs/file
hdfs dfs -getfacl -R /user/kannan

Note: We can set different ACLs for a directory, sub directory, files inside the directories.

setfacl

hdfs dfs -setfacl [-R] [-b|-k -m|-x <acl_spec> <path>]|[--set <acl_spec> <path>]

<!-- COMMAND OPTIONS
<path>: Path to the file or directory for which ACLs should be set.
-R: Use this option to recursively list ACLs for all files and directories.
-b: Revoke all permissions except the base ACLs for user, groups and others.
-k: Remove the default ACL.
-m: Add new permissions to the ACL with this option. Does not affect existing permissions.
-x: Remove only the ACL specified.
<acl_spec>: Comma-separated list of ACL permissions.
--set: Use this option to completely replace the existing ACL for the path specified. 
       Previous ACL entries will no longer apply.
-->

Examples:

### To give user stonecold read, write permission over /user/cold/file ###
hdfs dfs -setfacl -m user:stonecold:rw- /user/cold/file

### To remove user undertaker ACL entry for /user/taker/file ###
hdfs dfs -setfacl -x user:underataker /user/taker/file

 

Set up a local CDH repository

 

This post will explain you how to set up a local YUM/CDH repository for your network.

In Linux, /etc/yum.repos.d is the path for yum repos present in the server. For every repo , there will be a baseurl value which contains the link for the repository path.

When you execute “yum install packagename” the yum will look go through each repos and contact baseurl via internet for the availability of packagename you’ve given. If there’s no internet connectivity, baseurl can’t be reached and the command will fail. In organizations, it’s prohibited to download packages from external sites/repositories, so they’ll create a repo satellite and put all the necessary packages/rpms in the satellite, from there we can download the packages.

In this task, we are going to download the CDH repos to our server and create a local repository in the server, so that the other servers in our network can contact this local repo instead of cloudera for installing CDH packages.

You need internet connection to download the packages for the first time to set up the repository.

 

Step 1: Download the repo to your machine

RHEL / Cent OS 6 :

# wget https://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo

RHEL / Cent OS 7:

# wget https://archive.cloudera.com/cdh5/redhat/7/x86_64/cdh/cloudera-cdh5.repo

After wget, move the cloudera-cdh5.repo to /etc/yum.repos.d for understanding.

 

Step 2: Install webserver

We need webserver to be installed in this server, so that others can access the rpms through http.

# yum install httpd -y

This will create a /var/www/html directory. Whatever files you place under this directory can be accessed via http.

# service httpd start

 

Step 3: Install yum-utils and createrepo

The yum-utils package includes the reposync command, which is required to create the local Yum repository and createrepo will create a repo file.

# yum install yum-utils createrepo -y

 

Step 4: Fetch the rpms of CDH5 repo to your server

# reposync -r cloudera-cdh5

This command will download all the available rpms in cloudera-cdh5 repo (wget’d in step 1) to your server.

Copy the RPMs inside the downloaded directory to /var/www/html/cdh/5/rpms/ folder.

Now you should be able to access the rpms in browser via url “http://servername/cdh/5/rpms&#8221;.

 

Step 5: Create a repo file

Inside /var/www/html/cdh/5/ folder, run the below command.

# createrepo .

This creates or update the metadata required by the yum command to recognize the directory as a repository. The command creates a new directory called repodata.

Edit the repo file you downloaded in step 1 and replace the line starting with baseurl as baseurl=http://servername/cdh/5/, using the URL from step 4. Save the file back to /etc/yum.repos.d/.

 

Step 6: Local CDH repository created

Distribute the /etc/yum.repos.d/cloudera-cdh5 to all of your servers. Now they can download the rpms from this machine without a need of connecting to the internet.

S3 – Versioning

Versioning is a method of keeping multiple modifications/versions of an object in the same bucket. You can use versioning to preserve, retrieve, and restore every version of every object stored in your Amazon S3 bucket.

With versioning, you can easily recover from both unintended user actions, accidental deletes and application failures.

How Versioning works:

In Windows/Linux, if you try to store a file or copy a file with a name of the file which already exists, then you’ll get a pop up, error showing file already exists. Whereas, in versioning .

  • If you overwrite an object, it results in a new object version in the bucket. Remember, versioning will not create a new object/file when you overwrite, it’ll simply preserve and suppress the old version, with new version on top. You can always restore the previous version.

In versioning enabled buckets, by default, GET requests will retrieve the most recently written version. In order to retrieve the older version, you’ve to specify the version id of the object.

  • If you delete an object, instead of removing it permanently, Amazon S3 inserts a delete marker, which becomes the current object version.
  • Each version will take up the individual size of the object. If a 1GB file is uploaded to s3 bucket and then remove the contents of the source file upto 500MB and reuploads it with same name, then the 500MB file will be current version of the object and total object size would 1.5 GB.
  • Versioning can only be suspended after enabling, can’t be disabled.
  • Important: You can enable additional security for your S3 bucket by enabling Versioning and Multi factor authentication.

With respect to Versioning, Buckets can only be in 3 states.

  • unversioned (the default)
  • versioning-enabled
  • versioning-suspended.