Apache Hadoop is an open-source project that provides a new method for storing and processing large amounts of data. The Java-based software architecture is designed for distributed storage and processing of very large data sets on commodity-based computer clusters. If you want to know more about Hadoop click here. In this article, we will look at the Hadoop Commands with Examples.
Table of Contents
Basic Hadoop Commands
The Basic Hadoop Commands refer to the HDFS commands. With HDFS commands, we can perform multiple tasks in HDFS such as creating a directory, creating a file, transferring the directory/file from the local file system to HDFS, and vice-versa, etc. We will look at the most basic and important commands you should know if you are/will be working on Hadoop Distributed File System.
The ls command is used to display all the files and folders present in your hdfs.
hdfs dfs -ls /
Note: The default hdfs path can be assumed by giving just a forward slash ( / ).
The mkdir command is used to create a directory in hdfs.
hdfs dfs -mkdir /hdfs_path/directoy_name
The touchz command is used to create a blank file in hdfs.
hdfs dfs -touchz /hdfs_path/filename.ext
The put command is used to copy a file/directory from the local file system to the hdfs.
hdfs dfs -put /local_path/filename.ext /hdfs_path
The copyFromLocal command is identical to the put command i.e. it is also used to copy a file/directory from the local file system to the hdfs.
hdfs dfs -copyFromLocal /path/filename.ext /hdfs_path
The get command is opposite of put command. It is used to copy a file/directory from the hdfs to the local file system.
hdfs dfs -get /hdfs_path/filename.ext /local_path
The copyToLocal command is identical to the get command i.e. it is also used to copy a file/directory from the hdfs to the local file system .
hdfs dfs -copyToLocal /hdfs_path/filename.ext /local_path
The cat command is used to display the contents of a file.
hdfs dfs -cat /hdfs_path/filename.ext
The cp command is used to copy a file/directory from one location to the other in hdfs itself.
hdfs dfs -cp /hdfs_path/filename.ext /hdfs_target_path
The mv command is used to move a file/directory from one location to the other in hdfs itself.
hdfs dfs -mv /hdfs_path/filename.ext /hdfs_target_path
The rm command is used to delete a file/directory in hdfs.
hdfs dfs -rm /hdfs_path/filename.ext
To delete a directory, add -r argument in the command. For example: hdfs dfs -rm -r /directory_name
Hive commands in Hadoop are executed to perform SQL-like operations on big data. Apache Hive is software that has a SQL-like querying capability. It was initially developed at Facebook. Refer to our tutorial here for more information on Hive and how to install Hive. In this article, let us look at some examples of Hive queries.
(a) Hive DDL Commands
DDL stands for Data Definition Language and the DDL commands are basically used to modify the database and the structure of a table. Below are the DDL commands with definition and example
|CREATE||Used to create a database or a table.|
|USE||Used to select the database we need to work on.|
|DESCRIBE||Used to display the structure of a table.|
|ALTER||Used to modify a database or a table.|
|TRUNCATE||Used to delete the contents of a table.|
|DROP||Used to delete a database or a table.|
|SHOW||Used to display all the databases or the tables in a database.|
create database hiberstack;
Create TABLE <table_name> ( <column_name1> <data type>, <column_name2> <data type>,...) ROW FORMAT DELIMITED FIELDS TERMINATED BY '[delimiter]';
create table emp ( id int, name string, salary int) row format delimited fields terminated by '\t';
ALTER DATABASE<database_name> SET DBPROPERTIES (property_name=property_value, ...);
alter database hiberstack set OWNER ROLE admin;
- table: suppose we want to change the name of our table. The syntax and command will be as below
ALTER TABLE <table_name> RENAME TO <new_table_name>;
alter table emp rename to employee;
DROP DATABASE [IF EXISTS]<database_name>;
drop database if exists hiberstack;
DROP TABLE [IF EXISTS]<table_name>;
drop table if exists emp;
(b) Hive DML Commands
DML stands for Data Manipulation Language. As the name suggests, the DML Commands are used to manipulate the table such as load the data in it, display the contents of the table, etc. DML commands can be executed once the table is created in Hive using the DDL commands. Various Hive DML commands are as below:
|LOAD||Used to load data into the table using a file that is present either in the local file system or HDFS.|
|SELECT||Used to display the contents of the table.|
|INSERT||Used to insert the data in the table directly from the command (not using a file as in LOAD).|
|UPDATE||Used to update the data in the table such as changing the values of the columns/rows.|
|DELETE||Used to delete the data in the table.|
As mentioned earlier, the LOAD command is used to add the data to the table. You can add the data from a text file, CSV file, or any other format. If your input file is present in the local file system, then the command will be as below,
LOAD DATA LOCAL INPATH 'local_path/file.ext' [OVERWRITE] INTO TABLE ;
load data local inpath '/home/ubuntu/emp.txt' overwrite into table emp;
If your input file is present in HDFS, then the command will be as below,
LOAD DATA INPATH 'hdfs_path/file.ext' [OVERWRITE] INTO TABLE <table_ name>;
load data inpath '/emp.txt' overwrite into table emp;
It is used to show the contents of a table. If the table contains huge number of rows but you want to display just the first 10 rows, then you can add the “limit” parameter in the command.
SELECT [ * | column ] FROM emp [limit n];
select * from emp;#display all the records in the emp table.
select name from emp limit 15;#display only the ‘name’ column from the table and only the first 15 rows.
It is executed to insert the data in the table. The data to be inserted is provided at the time of executing the command itself.
INSERT INTO <table_name> (column1, column2, …) VALUES (row1), (row2), …;
insert into emp (id, name, salary) values (1, 'abc', 1000), (2, 'xyz', 5000);
The update command is used to update the table. We can update a particular column name in a table using a where clause. This command can be executed only on the tables that support ACID properties.
UPDATE TABLE <table_name> SET <column_name>=<value> WHERE <column_name>=<value>;
update table emp set name='pqr' where salary=5000;
The delete command is executed to delete a particular row from the table. Similar to the UPDATE command, this command can also be executed only on the tables that support ACID properties.
DELETE FROM <table_name> WHERE <condition>;
delete from emp where id=1;
Apache Pig is a platform that is built to run Apache Hadoop programs. Pig Latin is the language for this platform. MapReduce job is executed in the backend when Pig commands are executed. Apache Pig was originally created at Yahoo for the researchers to perform MapReduce jobs on huge datasets. If you want to install ApachePig, click here.
Now let us see some of the basic commands in Apache Pig.
The Pig commands are executed in Grunt shell. The grunt shell is the native shell provided by Apache Pig to execute the pig commands. Execute the command pig in the terminal to start the grunt shell.
Note: All the below commands are executed in the default pig mode i.e. MapReduce mode.
The fs command is used to display the files present in HDFS
You can also execute the mkdir command within the grunt shell to create a directory in HDFS.
fs -mkdir temp
The clear command is used to clear the grunt shell screen. This will put the cursor at the top of the screen.
The history command, as the name suggests, shows the list of commands that have been executed so far.
(d) Reading/loading data in Pig
emp = LOAD 'hdfs://localhost:9000/pig_data/emp_data.txt' USING PigStorage(',') AS (id:int, firstname:chararray, lastname:chararray, designation:chararray);
The PigStorage() option is used to define the delimiter in the dataset.
(e) Storing data
STORE emp INTO 'hdfs://localhost:9000/pig_output/';
The output file will be stored in a directory named ‘pig_output’ in the HDFS.
The dump command is used to process the pig commands and display the output in the grunt shell itself. It will not save the output in HDFS.