Big Data Warehousing MeetupToday’s Topic: Exploring Big DataAnalytics Techniques with Datameer                            ...
WELCOME!  Joe Caserta  Founder & President, Caserta Concepts
Agenda7:00     Networking         Grab a slice of pizza and a drink...7:15     Joe Caserta                              We...
About BDW Meetup• Big Data is a complex, rapidly changing landscape• We want to share our stories and hear about yours• Gr...
About Caserta Concepts Focused                             Industries Served Expertise                                    ...
Client PortfolioFinance& InsuranceRetail/eCommerce& ManufacturingEducation& Services
Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration B...
OpportunitiesDoes this word cloud excite you?Speak with us about our open positions: jobs@casertaconcepts.com
Contacts     Joe Caserta     President & Founder, Caserta Concepts     P: (855) 755-2246 x227     E: joe@casertaconcepts.c...
ANALYZING DATA: PIG AND HIVE    Elliott Cordo    Principal Consultant, Caserta Concepts
Big Data Analysis• Let’s review some tools for analyzing and processing Big Data• We will go over some simple use cases – ...
Big Data Analysis – Map Reduce?Distributed programming framework – Divide and Conquer!  • Master divides work into digesti...
HELP• We are doing low-level language coding to perform low- level operations• For productivity we need higher level tools...
HIVE• The Hadoop “Data Warehouse”• HiveQL is a SQL-Like interface that allows you to abstract “relational-db like” structu...
HIVEBut don’t get too excited!• Hive is not a Database, especially in terms of  optimizations.• SQL is interpreted to Map ...
HIVE DDL– Create and load a tablehive> create table user_movie_ratings(  > user_id int,  > movie_id int,                  ...
HIVE DDL– Create an external tablehive> create external table user (  > user_id int,  > age int,                          ...
HIVE – YAY SQL!hive> select occupation, count(1)  > from user_movie_ratings m  > join user u on u.user_id=m.user_id  > gro...
PIG• Powerful High Level Programming Language• SQL-ish, small learning curve for SQL and procedural programmers• Excellent...
PIG Examplegrunt> lens_users= load /user/movie_lens/u.user using PigStorage(|) as(user_id:int, age:int, gender:chararray, ...
PIG - Results                Grouping in PIG is a fair                deviation from SQL ->                original elemen...
SummaryHive:• Helpful for ETL• Very good for Ad-Hoc Analysis - Not necessarily suited  for front end users but definitely ...
Big Data Warehousing - Meetup
Upcoming SlideShare
Loading in...5
×

Big Data Warehousing: Pig vs. Hive Comparison

14,769

Published on

In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.

http://www.casertaconcepts.com

Published in: Technology
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
14,769
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide

Transcript of "Big Data Warehousing: Pig vs. Hive Comparison"

  1. 1. Big Data Warehousing MeetupToday’s Topic: Exploring Big DataAnalytics Techniques with Datameer Sponsored By:
  2. 2. WELCOME! Joe Caserta Founder & President, Caserta Concepts
  3. 3. Agenda7:00 Networking Grab a slice of pizza and a drink...7:15 Joe Caserta Welcome President, Caserta Concepts About the Meetup and about Caserta Concepts Author, Data Warehouse ETL Toolkit7:30 Elliott Cordo Pig and Hive Principal Consultant, Caserta Concepts Walkthrough of these powerful native Hadoop tools7:50 Adam Gugliciello Datameer Solutions Engineer, Datameer8:10 - More Networking9:00 Tell us what you’re up to…
  4. 4. About BDW Meetup• Big Data is a complex, rapidly changing landscape• We want to share our stories and hear about yours• Great networking opportunity for like minded data nerds• Opportunities to collaborate on exciting projects• Next BDW Meetup: April 22.• Topic: Intro to NoSQL Databases
  5. 5. About Caserta Concepts Focused Industries Served Expertise • Financial Services • Big Data Analytics • Healthcare / Insurance • Data Warehousing • Retail / eCommerce • Business Intelligence • Digital Media / Marketing • Strategic Data • K-12 / Higher Education Ecosystems Founded in 2001 • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004)
  6. 6. Client PortfolioFinance& InsuranceRetail/eCommerce& ManufacturingEducation& Services
  7. 7. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Master Data Management
  8. 8. OpportunitiesDoes this word cloud excite you?Speak with us about our open positions: jobs@casertaconcepts.com
  9. 9. Contacts Joe Caserta President & Founder, Caserta Concepts P: (855) 755-2246 x227 E: joe@casertaconcepts.com Erik Laurence VP Marketing, Caserta Concepts P: (855) 755-2246 x528 info@casertaconcepts.com E: erik@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com
  10. 10. ANALYZING DATA: PIG AND HIVE Elliott Cordo Principal Consultant, Caserta Concepts
  11. 11. Big Data Analysis• Let’s review some tools for analyzing and processing Big Data• We will go over some simple use cases – point out what is interesting about them• Develop a point of view of what each one is well suited for.
  12. 12. Big Data Analysis – Map Reduce?Distributed programming framework – Divide and Conquer! • Master divides work into digestible chunks and distributes to worker nodes – > MAP • Work from nodes is then collected by the master and combined to form an answer -> REDUCEPowerful tool for to solve interesting computational problems at scale
  13. 13. HELP• We are doing low-level language coding to perform low- level operations• For productivity we need higher level tools!• We will get help from a few animals! N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS)
  14. 14. HIVE• The Hadoop “Data Warehouse”• HiveQL is a SQL-Like interface that allows you to abstract “relational-db like” structure on top of non-relational or unstructured data • Flat Files, JSON, Web logs • HBase, Casandra, other NoSQL stores like MongoDB• Thanks to ODBC/JDBC drivers some conventional BI tools can interact with Hive• Ability to integrate custom programming, mappers, reducers
  15. 15. HIVEBut don’t get too excited!• Hive is not a Database, especially in terms of optimizations.• SQL is interpreted to Map Reduce Jobs, expect even simple queries to be around a minute or more. Start query, go get coffee• But now that expectations have been set, it’s still a very useful tool
  16. 16. HIVE DDL– Create and load a tablehive> create table user_movie_ratings( > user_id int, > movie_id int, Looks like a typical > rating int, > time_unix_ts string) table declaration, > row format delimited except we are specify > fields terminated by t the ingested file > stored as textfile; formatOKTime taken: 0.395 secondshive> load data inpath /user/hive/staging/data/u.data overwrite into tableuser_movie_ratings;Loading data to table default.user_movie_ratingsDeleted hdfs://localhost:54310/user/hive/warehouse/user_movie_ratingsTable default.user_movie_ratings stats: [num_partitions: 0, num_files: 1, num_rows: 0,total_size: 1979173, raw_data_size: 0]OKTime taken: 0.474 seconds
  17. 17. HIVE DDL– Create an external tablehive> create external table user ( > user_id int, > age int, This time we don’t > gender string, want Hive to own this > occupation string, data’s lifecycle > postal_code int ) > row format delimited fields terminated by | > location /user/hive/staging/user;OKTime taken: 0.096 seconds
  18. 18. HIVE – YAY SQL!hive> select occupation, count(1) > from user_movie_ratings m > join user u on u.user_id=m.user_id > group by occupation;Total MapReduce jobs = 2Launching Job 1 out of 2...Total MapReduce CPU Time Spent: 47 seconds 170 msecOKadministrator 7479artist 2308doctor 540educator 9442engineer 8175entertainment 2095….retired 1609salesman 856scientist 2058student 21957technician 3506writer 5536 Hmmm..Time taken: 110.331 seconds
  19. 19. PIG• Powerful High Level Programming Language• SQL-ish, small learning curve for SQL and procedural programmers• Excellent for data transformation, ETL• Not meant to be an ad-hoc query tool, happy with doing grunt work• Plenty of supported file formats, databases, ability to create custom UDF’s
  20. 20. PIG Examplegrunt> lens_users= load /user/movie_lens/u.user using PigStorage(|) as(user_id:int, age:int, gender:chararray, occupation:chararray, postal_code:int);grunt> lens_data= load /user/movie_lens/u.data using PigStorage(t) as(user_id:int, movie_id:int, rating:int, time_unix_ts:chararray);grunt> joined = join lens_users by user_id, lens_data by user_idgrunt> grouped = group joined by (occupation);grunt> results = FOREACH grouped GENERATE COUNT_STAR(joined),*;grunt> store results into /user/movie_lens_user_summary Interesting, We are doing our aggregate functions after grouping
  21. 21. PIG - Results Grouping in PIG is a fair deviation from SQL -> original elements are preserved in a bag
  22. 22. SummaryHive:• Helpful for ETL• Very good for Ad-Hoc Analysis - Not necessarily suited for front end users but definitely helpful for data analysts• Directly leverages SQL expertise!!PIG:• Great for ETL• Powerful, transformation and processing capabilities• SQL-like, but different in many ways, will take some time to master.
  23. 23. Big Data Warehousing - Meetup

×