In this tutorial, we will focus on Hadoop Hive for processing big data.
Hive is a component in Hadoop Stack. It is an open-source data warehouse tool that runs on top of Hadoop. It was developed by Facebook and later it is donated to the Apache foundation. It reads, writes, and manages big data tables stored in HDFS or other data sources.
Hive doesn’t offer insert, delete and update operations but it is used to perform analytics, mining, and report generation on the large data warehouse. Hive uses Hive query language similar to SQL. Most of the syntax is similar to the MySQL database. It is used for OLAP(Online Analytical Processing) purposes.
In the year 2006, Facebook was generating 10 GB of data per day and in 2007 its data increased by 1TB per day. After few days, it is generating 15 TB of data per day. Initially, Facebook is using the Scribe server, Oracle database, and Python scripts for processing large data sets. As Facebook started gathering data then they shifted to Hadoop as its key tool for data analysis and processing.
Facebook is using Hadoop for managing its big data and facing problems for ETL operations because for each small operation they need to write the Java programs. They need a lot of Java resources that are difficult to find and Java is not easy to learn. So Facebook developed Hive which uses SQL-like syntaxes that are easy to learn and write. Hive makes it easy for people who know SQL just like other RDBMS tools.
The following are the features of the Hive.
[cloudera@quickstart ~]$ hive
hive>
hive> show databases;
hive> create database emp;
hive> use emp;
hive>create table employee(
> emp_id int,
> name string,
> location string,
> dep string,
> designation string,
> salary int)
> row format delimited fields terminated by ‘,’;
Load the data from employee.txt file
employee.txt
101,Alice,New York,IT,Soft Engg,4000
102,Ali,Atlanta,Data Science,Sr Soft Engg,4500
103,Chang,New York,Data Science,Lead,6000
104,Robin,Chicago,IT,Manager,7000
hive> load data local inpath ‘/home/cloudera/employee.txt’ into table employee;
hive> select * from employee;
Output:
hive>create table project(
> emp_id int,
> project_id int,
> pname string)
> row format delimited fields terminated by ‘,’;
Load the data from project.txt file
project.txt
101,2001,Web Portal
102,2002,NER Model
103,2003,OCR Model
104,2004,Web Portal
hive> load data local inpath ‘/home/cloudera/project.txt’ into table employee;
hive> select * from project;
Output:
The table you have created in the above subsection is an internal table or by default internal table. In order to create an external table, you have to use an external keyword as shown below syntax:
create external table project(
> emp_id int,
> project_id int,
> pname string)
> row format delimited fields terminated by ‘,’;
External table also non as non-managed table. You can understand the difference between internal and external table form the following comparison:
It is used to join two or more relations bases on the common column. Let’s perofrom the JOIN operation on employee and project table:
hive> select * from employee join project on employee.emp_id=project.emp_id;
Output:
It can used to group the data based on given field or column in a table. Let’s see an example of Group By in the following query:
hive> select location, avg(salary) from employee group by location;
Output:
Subquery is an query with in query or nested query. Here, output of one query will become input for other query. Let’s see an example of sub query in th following query:
hive> select * from employee where employee.emp_id in (select emp_id from project where pname='Web Portal');
Output:
The following are the limitations of the Hive.
In this tutorial, we have discussed Apache Hive Features, Architecture, Components, and Limitations. We have also compared the Hive Vs SQL, Various operations( such as Order By, Sort By, Distributed By, and Cluster By ), and Partitions Vs Buckets. Also, We have executed the HQL in Hive and performed various operations such as loading data, Join, Group By, and Sub-queries.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…