Successfully reported this slideshow.
Pig and Hive
Csaba Toth
Central California .NET User Group
Meeting
Date: April 17th, 2014
Location: Bitwise Industries, Fr...
Agenda
• Little recap of Hadoop and Map-Reduce
• Pig and Hive
• Recommendation engine
• Demos 1: Exercises with on-premise...
Hadoop
• Hadoop is an open-source software
framework that supports data-intensive
distributed applications.
• Has two main...
Hadoop
• All of this in a cost effective way: Hadoop is
managing a cluster of commodity hardware
computers.
• The cluster ...
Name node
HDFS visually
Metadata
Store
Data node Data node Data node
Node 1 Node 2
Block A Block B Block A Block B
Node 3
...
Name node
Heart beat signals and
communication
Job / task management
Jobtracker
Data node Data node Data node
Tasktracker ...
MapReduce
• Hadoop leverages the functional programming model
of map/reduce.
• Moves away from shared resources and relate...
MapReduce
• It is about two functions: map and reduce
1. Map Step:
– Processes a key/value pairs and generate a set of
int...
Word count
http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
Map, Shuffle, and Reduce
https://mm-tom.s3.amazonaws.com/blog/MapReduce.png
Hadoop Ecosystem and MapReduce
• Writing MapReduce Jobs in Java or C# or other
languages is useful but it is the hard way
...
Hive and Pig
• Hive (http://hive.apache.org/)
– Provides SQL-like approach
– The best if the input data is at least confor...
Hadoop Ecosystem / Architecture
Log Data RDBMS
Data Integration Layer
Flume Sqoop
Storage Layer (HDFS)
Computing Layer (Ma...
Simple recommendation engine
• Sites such as Amazon.com and Netflix.com use
complex algorithms
• But the underlying concep...
Pearson coefficient
Pearson product-moment correlation
coefficient value
Comments
-1 Perfectly correlated data, but as one...
Ratings data
Name The Lord of the Rings The Chronicles of Narnia
Jack 2 3
Mark 4 4.5
Albert 4 3.5
John 5 5
• Pearson Corre...
Ratings data
Name of
movie critic
Name of movie Rating
Lisa Rose Lady in the Water 2.5
Lisa Rose Snakes on a Plane 3.5
Lis...
Simple recommendation
• DEMO
– C# implementation
– Simple naïve algorithm
– Non parallel, not Hadoop
Pig
• A data-flow language
• Express the processing as a series of
transformations
• Steps are translated into Map Reduce ...
Pig
• Load and store:
– can load/store data from/to HDFS
• Relations
– The transformations are performed on ‘relations’ – ...
Pig
• Joins
– Can accomplish joins in an conceptually intuitive
manner (using a common key)
• Filter
– Can apply filters t...
Pig
• Grouping
– Can group data by one or more keys. Once grouped,
you can maintain the hierarchical structure in the
rela...
Simple recommendation
• DEMO
– Pig implementation
– Local pseudo cluster (HDInsight on-premise)
Hive DEMO
• Analyzing previous meeting’s wordcount
(warpeace) results
• Analyzing the recommendation engine results
HDInsight
• Microsoft’s Hadoop PaaS solution in the cloud
• Hortonworks Hadoop implementation
• New Azure portal is coming...
References
• Daniel Jebaraj: Ignore HDInsight at Your Own
Peril: Everything You Need to Know
• Tom White: Hadoop: The Defi...
Thanks for your attention!
Hadoop vs RDBMS
Hadoop / MapReduce RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High (referential, typed)
...
Upcoming SlideShare
Loading in …5
×

Hive and Pig for .NET User Group

1,087 views

Published on

Introduction to Pig and Hive

Published in: Software, Technology
  • Be the first to comment

Hive and Pig for .NET User Group

  1. 1. Pig and Hive Csaba Toth Central California .NET User Group Meeting Date: April 17th, 2014 Location: Bitwise Industries, Fresno
  2. 2. Agenda • Little recap of Hadoop and Map-Reduce • Pig and Hive • Recommendation engine • Demos 1: Exercises with on-premise Hadoop emulator • Demos 2: Azure HDInsight
  3. 3. Hadoop • Hadoop is an open-source software framework that supports data-intensive distributed applications. • Has two main pieces: – Storing large amounts of data: HDFS, Hadoop Distributed File System – Processing large amounts of data: implementation of the MapReduce programming model
  4. 4. Hadoop • All of this in a cost effective way: Hadoop is managing a cluster of commodity hardware computers. • The cluster is composed of a single master node and multiple worker nodes • It is written in Java, utilizes JVMs
  5. 5. Name node HDFS visually Metadata Store Data node Data node Data node Node 1 Node 2 Block A Block B Block A Block B Node 3 Block A Block B
  6. 6. Name node Heart beat signals and communication Job / task management Jobtracker Data node Data node Data node Tasktracker Tasktracker Map 1 Reduce 1 Map 2 Reduce 2 Tasktracker Map 3 Reduce 3
  7. 7. MapReduce • Hadoop leverages the functional programming model of map/reduce. • Moves away from shared resources and related synchronization and contention issues • Thus inherently scalable and suitable for processing large data sets, distributed computing on clusters of computers/nodes. • The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various worker nodes, and process the data in parallel. • Hadoop leverages a distributed file system to store the data on various nodes.
  8. 8. MapReduce • It is about two functions: map and reduce 1. Map Step: – Processes a key/value pairs and generate a set of intermediate key/value pairs form that 2. Shuffle step: – Groups all intermediate values associated with the same intermediate key into one set 3. Reduce Step: – Processes the intermediate values associated with the same intermediate key and produces a set of values based on the groups (usually some kind of aggregate)
  9. 9. Word count http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
  10. 10. Map, Shuffle, and Reduce https://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  11. 11. Hadoop Ecosystem and MapReduce • Writing MapReduce Jobs in Java or C# or other languages is useful but it is the hard way • Several domain-specific higher level languages and frameworks exist – They allow phrase complicated tasks way more simpler and shorter than Java – These languages translate everything into Map- Reduce jobs under the hood transparently • We’ll see two examples: Hive and Pig
  12. 12. Hive and Pig • Hive (http://hive.apache.org/) – Provides SQL-like approach – The best if the input data is at least conform to some schema, so it can be consumed (SQL requires columnar format, tables) – Good for someone coming from SQL background • Pig (http://pig.apache.org/) – The syntax is closer to a programming language – It defines a series of transformations, projecting one schema into another – Also the best if the data is not totally free form and have some kind of schema
  13. 13. Hadoop Ecosystem / Architecture Log Data RDBMS Data Integration Layer Flume Sqoop Storage Layer (HDFS) Computing Layer (MapReduce) Advanced Query Engine (Hive, Pig) Data Mining (Pegasus, Mahout) Index, Searches (Lucene) DB drivers (Hive driver) Web Browser (JS)Presentation Layer
  14. 14. Simple recommendation engine • Sites such as Amazon.com and Netflix.com use complex algorithms • But the underlying concepts are simple: finding correlation between data • Pearson Coefficient, Excel CORREL function
  15. 15. Pearson coefficient Pearson product-moment correlation coefficient value Comments -1 Perfectly correlated data, but as one rises the other decreases 0 Uncorrelated data +1 Perfectly correlated data • DEMO
  16. 16. Ratings data Name The Lord of the Rings The Chronicles of Narnia Jack 2 3 Mark 4 4.5 Albert 4 3.5 John 5 5 • Pearson Correlation Coefficient: 0.8705715
  17. 17. Ratings data Name of movie critic Name of movie Rating Lisa Rose Lady in the Water 2.5 Lisa Rose Snakes on a Plane 3.5 Lisa Rose Just My Luck 3 Lisa Rose Superman Returns 3.5 Lisa Rose You Me and Dupree 2.5 Lisa Rose The Night Listener 3 Gene Seymour Lady in the Water 3 Gene Seymour Snakes on a Plane 3.5 Gene Seymour Just My Luck 1.5 Gene Seymour Superman Returns 5 Gene Seymour The Night Listener 3
  18. 18. Simple recommendation • DEMO – C# implementation – Simple naïve algorithm – Non parallel, not Hadoop
  19. 19. Pig • A data-flow language • Express the processing as a series of transformations • Steps are translated into Map Reduce jobs • We can look at it like LINQ • We’ll learn it by example – Pig’s command line shell: grunt – Pig’s language: Pig Latin
  20. 20. Pig • Load and store: – can load/store data from/to HDFS • Relations – The transformations are performed on ‘relations’ – Pig calls the collections like that, don not confuse with traditional relational DB terminology! – Think of it as a table with rows and columns of data – When grouped relations can contain associative key- values
  21. 21. Pig • Joins – Can accomplish joins in an conceptually intuitive manner (using a common key) • Filter – Can apply filters to data. A predicate should be specified • Projection – Can project from an existing collection = form a new collection in a way like an SQL select does. That is the “GENERATE” command of Pig
  22. 22. Pig • Grouping – Can group data by one or more keys. Once grouped, you can maintain the hierarchical structure in the relation throughout the transformations. Projections can be made, or sometimes you can flatten some of the hierarchy. • Dump – DUMP statement outputs a contents of a relation onto the console. Useful when fooling around in the Pig shell • Extensible – UDFs: User Defined Functions
  23. 23. Simple recommendation • DEMO – Pig implementation – Local pseudo cluster (HDInsight on-premise)
  24. 24. Hive DEMO • Analyzing previous meeting’s wordcount (warpeace) results • Analyzing the recommendation engine results
  25. 25. HDInsight • Microsoft’s Hadoop PaaS solution in the cloud • Hortonworks Hadoop implementation • New Azure portal is coming, in preview: http://azure.microsoft.com/en-us/services/preview/ • Build conference: 40 new functions in Azure • Latest addition: ISS, Azure Intelligent Systems Service (https://connect.microsoft.com/site1132/) • Analytics Platform System (APS): evolutionary combination of SQL Server PDW and Hadoop
  26. 26. References • Daniel Jebaraj: Ignore HDInsight at Your Own Peril: Everything You Need to Know • Tom White: Hadoop: The Definitive Guide, 3rd Edition, Yahoo Press • Lynn Langit’s various presentations and YouTube videos • Dattatrey Sindol: Big Data Basics - Part 1 - Introduction to Big Data • Bruno Terkaly’s presentations (for example Hadoop on Azure: Introduction)
  27. 27. Thanks for your attention!
  28. 28. Hadoop vs RDBMS Hadoop / MapReduce RDBMS Size of data Petabytes Gigabytes Integrity of data Low High (referential, typed) Data schema Dynamic Static Access method Interactive and Batch Batch Scaling Linear Nonlinear (worse than linear) Data structure Unstructured Structured Normalization of data Not Required Required Query Response Time Has latency (due to batch processing) Can be near immediate

×