Your SlideShare is downloading. ×
Introduction to pig
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction to pig


Published on

These are the slide deck that I used for the presentation at Houston Hadoop Meetup Group. …

These are the slide deck that I used for the presentation at Houston Hadoop Meetup Group.

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Apache Pig – Introduction andHands-onRavi MutyalaSystems Architect, HortonworksTwitter: @rmutyala© Hortonworks Inc. 2012
  • 2. Big Data PlatformsCost per TB, Adoption Size of bubble = cost effectiveness of solution Source: 2
  • 3. Topics• What is Pig?• Why Pig ?• Language Features• Labs• 0.10.0 Features• Features in the pipeline•Q &A Page 3 © Hortonworks Inc. 2012
  • 4. What is Pig?• System for processing large unstructured Data• Uses HDFS and MapReduce• Data flow Language• Directional Asymptotic Graph• Started at Yahoo! Research• Joined Apache incubator in 2007• Graduated to Subproject of Hadoop in 2008• Top level project in Apache since 2010 Page 4 © Hortonworks Inc. 2012
  • 5. Pig Philosophy • Pigs eat anything• Pigs live anywhere• Pigs are domesticated animals• Pigs can fly Page 5 © Hortonworks Inc. 2012
  • 6. Components• Pig Engine – Parser, Optimizer and distributed query execution• Grunt – CLI shell• Pig Latin – Procedural Language Page 6 © Hortonworks Inc. 2012
  • 7. Why Pig ?• High level language that increases programmer productivity.• Designed for Parallel Data flow.• Reduces complexity by abstracting low level Map and Reduce jobs and Map Reduce job chaining• Can be run on a client/gateway machine with no configuration on the cluster• Multiple versions of Pig can co-exist as long as they are compatible with Hadoop version. Page 7 © Hortonworks Inc. 2012
  • 8. Running PigPig Latin script executes in 3 modes• MapReduce: Code executes as MapReduce on a Hadoop Cluster $ pig myscript.pig• Local: Code executes locally in a single JVM using local data $ pig –x local myscript.pig• Interactive: pig with no script starts the grunt shell where commands can be run interactively Page 8 © Hortonworks Inc. 2012
  • 9. GRUNT shell• fs -ls• fs -cat filename• fs -copyFromLocal localfile hdfsfile Page 9 © Hortonworks Inc. 2012
  • 10. Data Types• Scalar Types – int, long, float, double, chararray, bytearray, boolean, datetime• Complex Types – Map. Collection of key value pairs – [name#alan, age#30] – Tuple. Ordered set of values – (alan,40,engineering) – Bags. Unordered collection of tuples – {(alan,40,engineering),(bob,45,sales)} Page 10 © Hortonworks Inc. 2012
  • 11. • Relations and a set of operations that work on relations• Schema for relations is optional• $0… $n can be used for fields in relations• null means the data in undefined.• Any missing or invalid fields are loaded as null Page 11 © Hortonworks Inc. 2012
  • 12. Input and Output• A = LOAD ‘file’ USING PigStorage(‘,’) AS (data1:datatype1, data2:datatype2.. )• STORE A INTO ‘file2’ using PigStorage(‘,’)• DUMP A• DESCRIBE A Page 12 © Hortonworks Inc. 2012
  • 13. Relational Operations• GROUP A BY A.age;• FOREACH B GENERATE A.$1 – A.$3;• FILTER A BY A.$1 > 10;• ORDER A BY A.$1 DESC, A.$2;• JOIN A BY A.$1, B BY B.$5;• JOIN A BY (A.$1, A.$5) LEFT OUTER, B BY (B.$2, B.$3); Page 13 © Hortonworks Inc. 2012
  • 14. • LIMIT A 10;• SAMPLE A 0.1;• GROUP A BY A.$1 PARALLEL 10;• User Definited Functions AND piggybank register your_path_to_piggybank/piggybank.jar; divs = load NYSE_dividends’; backwards = foreach divs generate org.apache.pig.piggybank.evaluation.string.Reverse($1); Page 14 © Hortonworks Inc. 2012
  • 15. • Invoking static java methods• FLATTEN• TOKENIZE Page 15 © Hortonworks Inc. 2012
  • 16. 0.10.0 Features• Ruby UDFs• PigStorage with schemas• Additional UDF improvements• Language Improvements – Boolean type – otherwise – Maps, Bags and Tuples can be generated without UDFs – Register collection of jars• Performance Improvements Page 16 © Hortonworks Inc. 2012
  • 17. Current work in progress• DataTime datatype• CUBE, ROLLUP and RANK operators• Native support for windows• Lower memory footprint Page 17 © Hortonworks Inc. 2012
  • 18. References• Labs are from – –• 0.10.0 Features and current WIP – by Alan Gates Page 18 © Hortonworks Inc. 2012
  • 19. Hortonworks Training The expert source for Apache Hadoop training & certificationRole-based Developer and Administration training – Coursework built and maintained by the core Apache Hadoop development team. – The “right” course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses availableComprehensive Apache Hadoop Certification – Become a trusted and valuable Apache Hadoop expert Page 19 © Hortonworks Inc. 2012
  • 20. Thank You!Questions & Answers Ravi Mutyala Systems Architect Hortonworks Twitter: @rmutyala Page 20 © Hortonworks Inc. 2012