Install Apache Hive 2.x in CentOS6 min read


Table of Contents
Introduction
Apache Hive is a data warehouse software project that provides data query and analysis on top of Apache Hadoop. Hive provides a SQL-like interface for querying data stored in Hadoop-integrated databases and file systems. In order to execute SQL applications and queries over larger datasets, standard SQL queries must be used in the MapReduce Java API. In this article, we will look at how to install Apache Hive 2.x in CentOS.
Hive gives the SQL abstract concept required to integrate SQL-like queries, also known as HiveQL, into the fundamental Java without needing queries to be incorporated in the low-level Java API. Hive makes SQL-based applications more portable to Hadoop because most data warehouse systems use SQL-based querying languages. Though Apache Hive was created by Facebook, it is now used and produced by other companies such as Netflix.
Features of Apache Hive
Apache Hive allows you to analyze massive datasets stored in Hadoop’s HDFS and compatible file systems like Amazon S3. It has a SQL-like query language called HiveQL that translates queries to MapReduce, Apache Tez, and Spark jobs transparently. We use Hive on top of Hadoop to process the structured data available in tabular format. Hive is so powerful that it can query petabytes of data with ease. Following are some of the key features of Hive
- SQL-like Queries: Queries are written in HiveQL, a SQL-like language, and executed using MapReduce, Apache Tez, or Apache Spark, allowing more people to process and analyze massive quantities of data. HiveQL is non-procedural since it is a declarative language, similar to SQL.
- Data Storage: Hive allows users to access files stored in HDFS, Apache HBase, Amazon S3, and other storage systems.
- Supports different Data Structures: Hive metadata is exposed to other data processing tools, such as Apache Pig and MapReduce, using HCatalog, a table and storage management layer for Hadoop. Users can read and write data without having to think about where it’s processed, what format it’s in, or redefining the structure for each method.
- ETL support: ETL (Extract, Transform, and Load) is supported by Apache Hive. Previously, ETL was done with Python.
Install Apache Hive 2.x in CentOS
Follow the below steps to install Hive 2.x in CentOS
1. Create a directory in which you can download and extract the hive tar.gz file. Execute the below commands
mkdir hive cd hive wget https://archive.apache.org/dist/hive/hive-2.3.8/apache-hive-2.3.8-bin.tar.gz tar -xvf apache-hive-2.3.8-bin.tar.gz
2. Edit the .bashrc file to set Hive environment variables
cd vi .bashrc
Press i to write and add the below 2 lines in the file,
export HIVE_HOME=/home/hiberstack/hive/apache-hive-2.3.8-bin export PATH=$PATH:/home/hiberstack/hive/apache-hive-2.3.8-bin/bin


Save and exit. Execute the below command to apply changes,
source .bashrc
3. Create Hive directory in HDFS
hdfs dfs -mkdir -p /user/hive/warehouse
4. Assign appropriate permission to this directory
hdfs dfs -chmod g+w /user/hive/warehouse
5. Assign appropriate permission to the /tmp directory in hdfs
hdfs dfs -chmod g+w /tmp
6. Add Hadoop directory in hive config
vi $HIVE_HOME/bin/hive-config.sh
Add the below line in the file
export HADOOP_HOME=/usr/lib/hadoop-2.8.1
Note: Change the path and Hadoop directory name as per your setup.
Save and exit.
7. Create hive-site.xml file
cd cd hive/apache-hive-2.3.8-bin/conf/ vi hive-site.xml
Add the below lines in the file,
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>admin</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>true</value> </property> <property> <name>datanucleus.fixedDatastore</name> <value>true</value> </property> <property> <name>datanucleus.autoCreateTables</name> <value>True</value> </property> </configuration>
Note: Change the MySQL password as per your setup in the above file in the below shown property,
<property> <name>javax.jdo.option.ConnectionPassword</name> <value>admin</value> </property>
Save and exit.
8. Add MySQL connector jar in hive lib folder
cd cd hive/apache-hive-2.3.8-bin/lib wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.28/mysql-connector-java-5.1.28.jar
9. Create hive schema in MySQL
cd sudo service mysqld start cd hive/apache-hive-2.3.8-bin/scripts/metastore/upgrade/mysql/ mysql -u root -p
Once you are logged in to MySQL, execute the below queries
drop database IF EXISTS metastore; create database metastore; use metastore; source hive-schema-2.3.0.mysql.sql;
Exit from MySQL shell by executing exit query.
10. Done. Hive 2.x is installed. Check hive version with the below command
hive --version


Apache Hive Example
Let us now look at a simple example in Apache Hive. Here, we will create a database, a table in it, and add some data in the table from a file that is saved in hdfs.
Suppose, we have a file in hdfs by name emp.txt and the contents of the file is as below,
1,mark,twain 2,robert,fisher 3,dennis,ritchie


We will create a table in Hive to add the data from this file.
1. Create database
CREATE DATABASE hiberstack;
2. Create table
USE hiberstack; CREATE TABLE employee (id int, first_name string, last_name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
3. Load data in the table
LOAD DATA INPATH '/emp.txt' INTO TABLE employee;
Note: If your data is in local file system, the query will be: LOAD DATA LOCAL INPATH
4. Print table
SELECT * FROM employee;