Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to pig


Published on

These are the slide deck that I used for the presentation at Houston Hadoop Meetup Group.

Published in: Technology
  • Be the first to comment

Introduction to pig

  1. 1. Apache Pig – Introduction andHands-onRavi MutyalaSystems Architect, HortonworksTwitter: @rmutyala© Hortonworks Inc. 2012
  2. 2. Big Data PlatformsCost per TB, Adoption Size of bubble = cost effectiveness of solution Source: 2
  3. 3. Topics• What is Pig?• Why Pig ?• Language Features• Labs• 0.10.0 Features• Features in the pipeline•Q &A Page 3 © Hortonworks Inc. 2012
  4. 4. What is Pig?• System for processing large unstructured Data• Uses HDFS and MapReduce• Data flow Language• Directional Asymptotic Graph• Started at Yahoo! Research• Joined Apache incubator in 2007• Graduated to Subproject of Hadoop in 2008• Top level project in Apache since 2010 Page 4 © Hortonworks Inc. 2012
  5. 5. Pig Philosophy • Pigs eat anything• Pigs live anywhere• Pigs are domesticated animals• Pigs can fly Page 5 © Hortonworks Inc. 2012
  6. 6. Components• Pig Engine – Parser, Optimizer and distributed query execution• Grunt – CLI shell• Pig Latin – Procedural Language Page 6 © Hortonworks Inc. 2012
  7. 7. Why Pig ?• High level language that increases programmer productivity.• Designed for Parallel Data flow.• Reduces complexity by abstracting low level Map and Reduce jobs and Map Reduce job chaining• Can be run on a client/gateway machine with no configuration on the cluster• Multiple versions of Pig can co-exist as long as they are compatible with Hadoop version. Page 7 © Hortonworks Inc. 2012
  8. 8. Running PigPig Latin script executes in 3 modes• MapReduce: Code executes as MapReduce on a Hadoop Cluster $ pig myscript.pig• Local: Code executes locally in a single JVM using local data $ pig –x local myscript.pig• Interactive: pig with no script starts the grunt shell where commands can be run interactively Page 8 © Hortonworks Inc. 2012
  9. 9. GRUNT shell• fs -ls• fs -cat filename• fs -copyFromLocal localfile hdfsfile Page 9 © Hortonworks Inc. 2012
  10. 10. Data Types• Scalar Types – int, long, float, double, chararray, bytearray, boolean, datetime• Complex Types – Map. Collection of key value pairs – [name#alan, age#30] – Tuple. Ordered set of values – (alan,40,engineering) – Bags. Unordered collection of tuples – {(alan,40,engineering),(bob,45,sales)} Page 10 © Hortonworks Inc. 2012
  11. 11. • Relations and a set of operations that work on relations• Schema for relations is optional• $0… $n can be used for fields in relations• null means the data in undefined.• Any missing or invalid fields are loaded as null Page 11 © Hortonworks Inc. 2012
  12. 12. Input and Output• A = LOAD ‘file’ USING PigStorage(‘,’) AS (data1:datatype1, data2:datatype2.. )• STORE A INTO ‘file2’ using PigStorage(‘,’)• DUMP A• DESCRIBE A Page 12 © Hortonworks Inc. 2012
  13. 13. Relational Operations• GROUP A BY A.age;• FOREACH B GENERATE A.$1 – A.$3;• FILTER A BY A.$1 > 10;• ORDER A BY A.$1 DESC, A.$2;• JOIN A BY A.$1, B BY B.$5;• JOIN A BY (A.$1, A.$5) LEFT OUTER, B BY (B.$2, B.$3); Page 13 © Hortonworks Inc. 2012
  14. 14. • LIMIT A 10;• SAMPLE A 0.1;• GROUP A BY A.$1 PARALLEL 10;• User Definited Functions AND piggybank register your_path_to_piggybank/piggybank.jar; divs = load NYSE_dividends’; backwards = foreach divs generate org.apache.pig.piggybank.evaluation.string.Reverse($1); Page 14 © Hortonworks Inc. 2012
  15. 15. • Invoking static java methods• FLATTEN• TOKENIZE Page 15 © Hortonworks Inc. 2012
  16. 16. 0.10.0 Features• Ruby UDFs• PigStorage with schemas• Additional UDF improvements• Language Improvements – Boolean type – otherwise – Maps, Bags and Tuples can be generated without UDFs – Register collection of jars• Performance Improvements Page 16 © Hortonworks Inc. 2012
  17. 17. Current work in progress• DataTime datatype• CUBE, ROLLUP and RANK operators• Native support for windows• Lower memory footprint Page 17 © Hortonworks Inc. 2012
  18. 18. References• Labs are from – –• 0.10.0 Features and current WIP – by Alan Gates Page 18 © Hortonworks Inc. 2012
  19. 19. Hortonworks Training The expert source for Apache Hadoop training & certificationRole-based Developer and Administration training – Coursework built and maintained by the core Apache Hadoop development team. – The “right” course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses availableComprehensive Apache Hadoop Certification – Become a trusted and valuable Apache Hadoop expert Page 19 © Hortonworks Inc. 2012
  20. 20. Thank You!Questions & Answers Ravi Mutyala Systems Architect Hortonworks Twitter: @rmutyala Page 20 © Hortonworks Inc. 2012