Install Apache Hive 2.x in CentOS6 min read

Introduction

Apache Hive is a data warehouse software project that provides data query and analysis on top of Apache Hadoop. Hive provides a SQL-like interface for querying data stored in Hadoop-integrated databases and file systems. In order to execute SQL applications and queries over larger datasets, standard SQL queries must be used in the MapReduce Java API. In this article, we will look at how to install Apache Hive 2.x in CentOS.

Hive gives the SQL abstract concept required to integrate SQL-like queries, also known as HiveQL, into the fundamental Java without needing queries to be incorporated in the low-level Java API. Hive makes SQL-based applications more portable to Hadoop because most data warehouse systems use SQL-based querying languages. Though Apache Hive was created by Facebook, it is now used and produced by other companies such as Netflix.


Features of Apache Hive

Apache Hive allows you to analyze massive datasets stored in Hadoop’s HDFS and compatible file systems like Amazon S3. It has a SQL-like query language called HiveQL that translates queries to MapReduce, Apache Tez, and Spark jobs transparently. We use Hive on top of Hadoop to process the structured data available in tabular format. Hive is so powerful that it can query petabytes of data with ease. Following are some of the key features of Hive

  1. SQL-like Queries: Queries are written in HiveQL, a SQL-like language, and executed using MapReduce, Apache Tez, or Apache Spark, allowing more people to process and analyze massive quantities of data. HiveQL is non-procedural since it is a declarative language, similar to SQL.
  2. Data Storage: Hive allows users to access files stored in HDFS, Apache HBase, Amazon S3, and other storage systems.
  3. Supports different Data Structures: Hive metadata is exposed to other data processing tools, such as Apache Pig and MapReduce, using HCatalog, a table and storage management layer for Hadoop. Users can read and write data without having to think about where it’s processed, what format it’s in, or redefining the structure for each method.
  4. ETL support: ETL (Extract, Transform, and Load) is supported by Apache Hive. Previously, ETL was done with Python.

Install Apache Hive 2.x in CentOS

Follow the below steps to install Hive 2.x in CentOS

1. Create a directory in which you can download and extract the hive tar.gz file. Execute the below commands

mkdir hive
cd hive
wget https://archive.apache.org/dist/hive/hive-2.3.8/apache-hive-2.3.8-bin.tar.gz
tar -xvf apache-hive-2.3.8-bin.tar.gz

2. Edit the .bashrc file to set Hive environment variables

cd
vi .bashrc

Press i to write and add the below 2 lines in the file,

export HIVE_HOME=/home/hiberstack/hive/apache-hive-2.3.8-bin
export PATH=$PATH:/home/hiberstack/hive/apache-hive-2.3.8-bin/bin
Setting Hive environment variables
Setting Hive environment variables

Save and exit. Execute the below command to apply changes,

source .bashrc

3. Create Hive directory in HDFS

hdfs dfs -mkdir -p /user/hive/warehouse

4. Assign appropriate permission to this directory

hdfs dfs -chmod g+w /user/hive/warehouse

5. Assign appropriate permission to the /tmp directory in hdfs

hdfs dfs -chmod g+w /tmp

6. Add Hadoop directory in hive config

vi $HIVE_HOME/bin/hive-config.sh

Add the below line in the file

export HADOOP_HOME=/usr/lib/hadoop-2.8.1

Note: Change the path and Hadoop directory name as per your setup.

Save and exit.

7. Create hive-site.xml file

cd
cd hive/apache-hive-2.3.8-bin/conf/
vi hive-site.xml

Add the below lines in the file,

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>admin</value>
</property>
<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>true</value>
</property>
<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>
<property>
  <name>datanucleus.autoCreateTables</name>
  <value>True</value>
</property>
</configuration>

Note: Change the MySQL password as per your setup in the above file in the below shown property,

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>admin</value>
</property>

Save and exit.

8. Add MySQL connector jar in hive lib folder

cd
cd hive/apache-hive-2.3.8-bin/lib
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.28/mysql-connector-java-5.1.28.jar

9. Create hive schema in MySQL

cd
sudo service mysqld start
cd hive/apache-hive-2.3.8-bin/scripts/metastore/upgrade/mysql/
mysql -u root -p

Once you are logged in to MySQL, execute the below queries

drop database IF EXISTS metastore;
create database metastore;
use metastore;
source hive-schema-2.3.0.mysql.sql;

Exit from MySQL shell by executing exit query.

10. Done. Hive 2.x is installed. Check hive version with the below command

hive --version
installed apache Hive
Hive version

Apache Hive Example

Let us now look at a simple example in Apache Hive. Here, we will create a database, a table in it, and add some data in the table from a file that is saved in hdfs.

Suppose, we have a file in hdfs by name emp.txt and the contents of the file is as below,

1,mark,twain
2,robert,fisher
3,dennis,ritchie
Input file in hdfs
Input file in hdfs

We will create a table in Hive to add the data from this file.

1. Create database

CREATE DATABASE hiberstack;

2. Create table

USE hiberstack;
CREATE TABLE employee (id int, first_name string, last_name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

3. Load data in the table

LOAD DATA INPATH '/emp.txt' INTO TABLE employee;

Note: If your data is in local file system, the query will be: LOAD DATA LOCAL INPATH

4. Print table

SELECT * FROM employee;

Share:

Leave a Reply