cca131 · Hadoop

Install and configure Sentry

Before adding Sentry, below are the general prerequisites need to be done. This may be mentioned in the problem description. Please confirm the hive warehouse directory detail in /etc/hive/conf/hive-site.xml file. The Hive warehouse directory (/user/hive/warehouse) must be owned by the Hive user and group and should have 771 permissions. # sudo –u hdfs hadoop fs… Continue reading Install and configure Sentry

cca131 · Hadoop

Enable/configure log and query redaction

Data redaction is the suppression of sensitive data, such as any personally identifiable information (PII) such as credit card number, email address, social security number. Cloudera has a data redaction feature, which will mask the credit card, email address with random or custom strings(we specify), so that in queries, log files those random strings will… Continue reading Enable/configure log and query redaction

Hadoop

Efficiently copy data within a cluster/between clusters

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying of HDFS data. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.… Continue reading Efficiently copy data within a cluster/between clusters

Hadoop

Perform OS-level configuration for Hadoop installation

Before installing CDH in our server, we've to make the below configuration changes in OS level for successful installation. Disable SELINUX  "Security-Enhanced Linux (SELinux) is a Linux kernel security module that provides a mechanism for supporting access control security policies" If SElinux is enabled, then cloudera server installation will fail in the server. To disable… Continue reading Perform OS-level configuration for Hadoop installation