Pig: Analyzing
data sets
Daniel Lopes, B.Eng
Agenda
- Big Data and Analytics, the cool guys!
- Hadoop?! You heard, but…
- ETL? It is Extract, Transform and Load
- So? What’s Pig?
- Pig Latin
- Pig Latin Basics. Let’s get started! :)
- Pig vs SQL
- UDF’s the real magic!
2
Big Data and Analytics
You don’t have 1PB of data, so you don’t have Big Data. Serius?!
Everyone has Big Data, but don’t store it!
The most important thing is not have a lot of data, but what’s your
good question for your data, and it can answer with a trust data.
Sometimes the trust comes with your model, your good question.
When you don’t know your question, large amount of data can find
this question and the answer.
3
Big Data Landscape
(http://www.ongridventures.com/wp-content/uploads/2012/10/Big-Data-Landscape1.jpg)
4
Hadoop
Apache Hadoop is an open-source software framework written in Java
for distributed storage and distributed processing.
At center of Hadoop is HDFS (Hadoop Distributed File System) for
distributed storage and MapReduce for distributed processing.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005 who
was working at Yahoo! at the time and lunched as a Apache project in
December of 2011.
5
HDFS
HDFS stores files across multiple machines. It achieves reliability by
replicating the data across multiple hosts, and hence theoretically does
not require RAID storage on hosts.
Its “like” a RAID in the cloud.
With the default replication value, 3, data is stored on three nodes:
two on the same rack, and one on a different rack. Data nodes can talk
to each other to rebalance data, to move copies around, and to keep
the replication of data high. (WIKIPEDIA, 2015)
6
MapReduce is a programming model and an associated
implementation for processing and generating data sets with a parallel,
distributed algorithm on a cluster.
Basically there are three steps: Map, Shuffle and Reduce.
MapReduce
7
Map, Shuffle and Reduce steps
Map step: Each worker node applies the "map()" function to the local
data, and writes the output to a temporary storage, but, only one is
processed.
Shuffle step: Worker nodes redistribute data based on the output keys
(produced by the "map()" function), such that all data belonging to one
key is located on the same worker node.
Reduce step: Worker nodes now process each group of output data,
per key, in parallel.
8
MapReduce Process
(http://mm-tom.s3.amazonaws.com/blog/MapReduce.png) 9
ETL - Extract, Transform and Load
ETL refers to a process in database usage and especially in data
warehousing that:
Extracts data from homogeneous or heterogeneous data sources
Transforms the data for storing it in proper format or structure for
querying and analysis purpose
Loads it into the final target (database, more specifically, operational
data store, data mart, or data warehouse)
10
What’s Pig?
Pig is a plataform to create MapReduces Jobs in Hadoop.
You can used with a script file .pig or direct in it shell called grund.
Use the language Pig Latin, it is similar of SQL.
Pig Latin can be extended using UDF (User Defined Functions) which
the user can write in Java, Python, JavaScript, Ruby or Groovy and
then call directly from the language.
11
Pig Latin (Word Count Example)
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
''-- Extract words from each line and put them into a pig bag''
''-- datatype, then flatten the bag to get one word on each row''
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
''-- filter out any words that are just white spaces''
filtered_words = FILTER words BY word MATCHES 'w+';
''-- create a group for each word''
word_groups = GROUP filtered_words BY word;
''-- count the entries in each group''
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
''-- order the records by count''
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; 12
Pig Latin Basic Statements
LOAD [file]: For receive a file,
FOREACH [dataset] GENERATE [statement]: For iterate over the
dataset,
FILTER [dataset] BY [collumn] [type]: Used for filtering data,
GROUP [dataset] BY [collumn]: Used create groups in the dataset,
ORDER [dataset] BY [collumn] [type]: Can order records,
LIMIT [dataset] [integer]: For extract a number or rows,
STORE [dataset] INTO [file]: for save the new d.ataset
REGISTER for external libraries
13
Pig vs SQL
Pig Latin is procedural, where SQL is declarative.
Pig Latin allows pipeline developers to decide where to checkpoint data
in the pipeline.
Pig Latin allows the developer to select specific operator
implementations directly rather than relying on the optimizer.
Pig Latin supports splits in the pipeline.
Pig Latin allows developers to insert their own code almost anywhere in
the data pipeline.
Alan Gates, Pig Development Team, Yahoo!
(https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-sql-constructing-data-processing-pipelines-444.html)
14
UDF’s
A lot of times it necessary to add some functions at Pig Latin for
expecifics jobs.
Sudar Muthu (http://sudarmuthu.com/blog/writing-pig-udf-functions-using-python/)
@outputSchema("num:long")
def get_length(data):
str_data = ''.join([chr(x) for x in data])
return len(str_data)
REGISTER '/bkf-pig-live-talk/udf.py' USING jython as pyudf;
A = LOAD '/bkf-pig-live-talk/data.txt' USING PigStorage();
B = FOREACH A GENERATE $0, pyudf.get_length($0);
DUMP B;
15
Questions?
Let’s play a little?
16
Thanks!
Daniel Lopes
Computer Engineer
@dannyeuu
daniel@bankfacil.com.br
about.me/dannyeuu
We are
hiring
bankfacil.com.br/dev
17

Pig - Analyzing data sets

  • 1.
  • 2.
    Agenda - Big Dataand Analytics, the cool guys! - Hadoop?! You heard, but… - ETL? It is Extract, Transform and Load - So? What’s Pig? - Pig Latin - Pig Latin Basics. Let’s get started! :) - Pig vs SQL - UDF’s the real magic! 2
  • 3.
    Big Data andAnalytics You don’t have 1PB of data, so you don’t have Big Data. Serius?! Everyone has Big Data, but don’t store it! The most important thing is not have a lot of data, but what’s your good question for your data, and it can answer with a trust data. Sometimes the trust comes with your model, your good question. When you don’t know your question, large amount of data can find this question and the answer. 3
  • 4.
  • 5.
    Hadoop Apache Hadoop isan open-source software framework written in Java for distributed storage and distributed processing. At center of Hadoop is HDFS (Hadoop Distributed File System) for distributed storage and MapReduce for distributed processing. Hadoop was created by Doug Cutting and Mike Cafarella in 2005 who was working at Yahoo! at the time and lunched as a Apache project in December of 2011. 5
  • 6.
    HDFS HDFS stores filesacross multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require RAID storage on hosts. Its “like” a RAID in the cloud. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. (WIKIPEDIA, 2015) 6
  • 7.
    MapReduce is aprogramming model and an associated implementation for processing and generating data sets with a parallel, distributed algorithm on a cluster. Basically there are three steps: Map, Shuffle and Reduce. MapReduce 7
  • 8.
    Map, Shuffle andReduce steps Map step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage, but, only one is processed. Shuffle step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. Reduce step: Worker nodes now process each group of output data, per key, in parallel. 8
  • 9.
  • 10.
    ETL - Extract,Transform and Load ETL refers to a process in database usage and especially in data warehousing that: Extracts data from homogeneous or heterogeneous data sources Transforms the data for storing it in proper format or structure for querying and analysis purpose Loads it into the final target (database, more specifically, operational data store, data mart, or data warehouse) 10
  • 11.
    What’s Pig? Pig isa plataform to create MapReduces Jobs in Hadoop. You can used with a script file .pig or direct in it shell called grund. Use the language Pig Latin, it is similar of SQL. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language. 11
  • 12.
    Pig Latin (WordCount Example) input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); ''-- Extract words from each line and put them into a pig bag'' ''-- datatype, then flatten the bag to get one word on each row'' words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; ''-- filter out any words that are just white spaces'' filtered_words = FILTER words BY word MATCHES 'w+'; ''-- create a group for each word'' word_groups = GROUP filtered_words BY word; ''-- count the entries in each group'' word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ''-- order the records by count'' ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; 12
  • 13.
    Pig Latin BasicStatements LOAD [file]: For receive a file, FOREACH [dataset] GENERATE [statement]: For iterate over the dataset, FILTER [dataset] BY [collumn] [type]: Used for filtering data, GROUP [dataset] BY [collumn]: Used create groups in the dataset, ORDER [dataset] BY [collumn] [type]: Can order records, LIMIT [dataset] [integer]: For extract a number or rows, STORE [dataset] INTO [file]: for save the new d.ataset REGISTER for external libraries 13
  • 14.
    Pig vs SQL PigLatin is procedural, where SQL is declarative. Pig Latin allows pipeline developers to decide where to checkpoint data in the pipeline. Pig Latin allows the developer to select specific operator implementations directly rather than relying on the optimizer. Pig Latin supports splits in the pipeline. Pig Latin allows developers to insert their own code almost anywhere in the data pipeline. Alan Gates, Pig Development Team, Yahoo! (https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-sql-constructing-data-processing-pipelines-444.html) 14
  • 15.
    UDF’s A lot oftimes it necessary to add some functions at Pig Latin for expecifics jobs. Sudar Muthu (http://sudarmuthu.com/blog/writing-pig-udf-functions-using-python/) @outputSchema("num:long") def get_length(data): str_data = ''.join([chr(x) for x in data]) return len(str_data) REGISTER '/bkf-pig-live-talk/udf.py' USING jython as pyudf; A = LOAD '/bkf-pig-live-talk/data.txt' USING PigStorage(); B = FOREACH A GENERATE $0, pyudf.get_length($0); DUMP B; 15
  • 16.
  • 17.