In this tutorial, we will focus on scripting language Apache PIG for processing big data.
Apache Pig is a scripting platform that runs top on Hadoop. It is a high-level and declarative language. It is designed for non-java programmers. Pig uses Latin scripts data flow language.
Hadoop is written in Java and initially, most of the developers write map-reduce jobs in Java. It means till then Java is the only language to interact with the Hadoop system. Yahoo came with one scripting language known as Pig in 2009. Here are few other reasons why Yahoo developed the Pig.
Yahoo’s managers have faced problems in performing small tasks. For each small and big change, they need to call the programming team. Also, programmers need to write lengthy codes for small tasks. To overcome these problems, yahoo developed a scripting platform Pig. Pig help researchers to analyze data with simple and few lines of declarative syntax.
Apache Pig offers two core components:
Pig works in the following steps:
We can execute Pig in two modes: Local and MapReduce mode.
file1.csv
EmpNr,location
5,The Hague
3,Amsterdam
9,Rotterdam
10,Amsterdam
12,The Haguee
file2.csv
EmpNr,Salary
5,10000
3,5000
9,2500
10,6000
12,4000
Lets perform operations in to the Pig environment:
grunt> file1 = LOAD ‘file1’ Using PigStorage(‘,’) as (emp_no:int, location:chararray);
grunt> file2 = LOAD ‘file2’ Using PigStorage(‘,’) as (emp_no:int, salary:int);
grunt> joined = JOIN file1 BY(emp_no), file2 by (emp_no);
grunt> STORE joined INTO ‘out’;
grunt> DUMP joined;
Output:
(3,Amsterdam,3,5000)
(5,The Hague,5,10000)
(9,Rotterdam,9,2500)
(10,Amsterdam,10,6000)
(12,The Hague,12,4000)
grunt> file2 = LOAD ‘file2’ Using PigStorage(‘,’) as (emp_no:int, salary:int);
grunt> filtered= FILTER file2 BY salary>5000;
grunt> STORE filtered INTO ‘out1’;
grunt> dump filtered;
Output:
(5,10000)
(10,6000)
grunt> grp = GROUP joined BY location;
grunt> STORE grp INTO ‘out1’;
grunt> DUMP grp;
Output:
(Amsterdam,{(10,Amsterdam,10,6000),(3,Amsterdam,3,5000)})
(Rotterdam,{(9,Rotterdam,9,2500)})
(The Hague,{(12,The Hague,12,4000),(5,The Hague,5,10000)})
grunt> out = FOREACH grp GENERATE group, COUNT(joined.location) as count;
grunt> STORE out INTO ‘out1’;
grunt> DUMP out;
Output:
(Amsterdam,2)
(Rotterdam,1)
(The Hague,2)
grunt> LIMIT file1 2
Output: (5,The Hague)
(3,Amsterdam)
We can count the occurrence of words in a given input file in PiG. Lets see the script below:
lines = LOAD ‘file_demo.txt’ AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
Output:
(PM,1)
(in,1)
(of,1)
(on,1)
(to,1)
(CMs,1)
(and,1)
(Modi,1)
(India,2)
(fresh,1)
(Health,1)
(issues,1)
(speaks,1)
(states,1)
(Vaccine,1)
(hotspot,1)
(Lockdown,1)
(Ministry,1)
(Shortage,1)
(Extension,1)
(guidelines,1)
(management,1)
(Coronavirus,3)
(containment,1)
In this tutorial, we have discussed Hadoop PIG Features, Architecture, Components, and its working. Also, we have discussed the scripting in Pig and experimented with operators, and implemented a word count program in Pig.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…