1. Apache Pig
● What is it ?
● How does it work ?
● Why use it ?
● PigLatin Data Types
● PigLatin Maths
● PigLatin Example
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
2. Pig – What is it ?
● A high level language
● Used to analyse large data sets
● Used to create MapReduce jobs
● Abstracts definition of jobs
● Uses Pig Latin to define jobs
● Less code needed
● Compiles to MapReduce code
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
3. Pig – How does it work ?
● Three ways to use it
– Grunt – Pig's interactive shell
– Write Pig Latin in a script file
– Embed Pig commands in another language
● Run modes
– Local mode – single machine
– Hadoop – run on a Hadoop/MapReduce cluster
● Creates MapReduce code automatically
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
4. Pig – Why use it ?
● It is quicker
● It is data omnivorous
● It is easy to learn
● It is widely used
● Minor performance loss
– Compared to native code
● It can be extended via user defined functions ( UDF )
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
5. PigLatin Data Types
● Int
● Long
● Float
● Double
● Chararray
● Bytearray
● Tuple
● Bag
● Map
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
6. PigLatin Maths
Some of the built in maths functions
● ABS
● CEIL
● EXP
● FLOOR
● LOG
● ROUND
● SIN
● TAN
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
7. PigLatin Example
Example borrowed from Wikipedia
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES 'w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
8. Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems