Big Data Warehousing: Pig vs. Hive Comparison


Published on

In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit or contact us at info (at)

Published in: Technology

Big Data Warehousing: Pig vs. Hive Comparison

  1. 1. Big Data Warehousing MeetupToday’s Topic: Exploring Big DataAnalytics Techniques with Datameer Sponsored By:
  2. 2. WELCOME! Joe Caserta Founder & President, Caserta Concepts
  3. 3. Agenda7:00 Networking Grab a slice of pizza and a drink...7:15 Joe Caserta Welcome President, Caserta Concepts About the Meetup and about Caserta Concepts Author, Data Warehouse ETL Toolkit7:30 Elliott Cordo Pig and Hive Principal Consultant, Caserta Concepts Walkthrough of these powerful native Hadoop tools7:50 Adam Gugliciello Datameer Solutions Engineer, Datameer8:10 - More Networking9:00 Tell us what you’re up to…
  4. 4. About BDW Meetup• Big Data is a complex, rapidly changing landscape• We want to share our stories and hear about yours• Great networking opportunity for like minded data nerds• Opportunities to collaborate on exciting projects• Next BDW Meetup: April 22.• Topic: Intro to NoSQL Databases
  5. 5. About Caserta Concepts Focused Industries Served Expertise • Financial Services • Big Data Analytics • Healthcare / Insurance • Data Warehousing • Retail / eCommerce • Business Intelligence • Digital Media / Marketing • Strategic Data • K-12 / Higher Education Ecosystems Founded in 2001 • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004)
  6. 6. Client PortfolioFinance& InsuranceRetail/eCommerce& ManufacturingEducation& Services
  7. 7. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Master Data Management
  8. 8. OpportunitiesDoes this word cloud excite you?Speak with us about our open positions:
  9. 9. Contacts Joe Caserta President & Founder, Caserta Concepts P: (855) 755-2246 x227 E: Erik Laurence VP Marketing, Caserta Concepts P: (855) 755-2246 x528 E: 1(855) 755-2246 Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E:
  10. 10. ANALYZING DATA: PIG AND HIVE Elliott Cordo Principal Consultant, Caserta Concepts
  11. 11. Big Data Analysis• Let’s review some tools for analyzing and processing Big Data• We will go over some simple use cases – point out what is interesting about them• Develop a point of view of what each one is well suited for.
  12. 12. Big Data Analysis – Map Reduce?Distributed programming framework – Divide and Conquer! • Master divides work into digestible chunks and distributes to worker nodes – > MAP • Work from nodes is then collected by the master and combined to form an answer -> REDUCEPowerful tool for to solve interesting computational problems at scale
  13. 13. HELP• We are doing low-level language coding to perform low- level operations• For productivity we need higher level tools!• We will get help from a few animals! N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS)
  14. 14. HIVE• The Hadoop “Data Warehouse”• HiveQL is a SQL-Like interface that allows you to abstract “relational-db like” structure on top of non-relational or unstructured data • Flat Files, JSON, Web logs • HBase, Casandra, other NoSQL stores like MongoDB• Thanks to ODBC/JDBC drivers some conventional BI tools can interact with Hive• Ability to integrate custom programming, mappers, reducers
  15. 15. HIVEBut don’t get too excited!• Hive is not a Database, especially in terms of optimizations.• SQL is interpreted to Map Reduce Jobs, expect even simple queries to be around a minute or more. Start query, go get coffee• But now that expectations have been set, it’s still a very useful tool
  16. 16. HIVE DDL– Create and load a tablehive> create table user_movie_ratings( > user_id int, > movie_id int, Looks like a typical > rating int, > time_unix_ts string) table declaration, > row format delimited except we are specify > fields terminated by t the ingested file > stored as textfile; formatOKTime taken: 0.395 secondshive> load data inpath /user/hive/staging/data/ overwrite into tableuser_movie_ratings;Loading data to table default.user_movie_ratingsDeleted hdfs://localhost:54310/user/hive/warehouse/user_movie_ratingsTable default.user_movie_ratings stats: [num_partitions: 0, num_files: 1, num_rows: 0,total_size: 1979173, raw_data_size: 0]OKTime taken: 0.474 seconds
  17. 17. HIVE DDL– Create an external tablehive> create external table user ( > user_id int, > age int, This time we don’t > gender string, want Hive to own this > occupation string, data’s lifecycle > postal_code int ) > row format delimited fields terminated by | > location /user/hive/staging/user;OKTime taken: 0.096 seconds
  18. 18. HIVE – YAY SQL!hive> select occupation, count(1) > from user_movie_ratings m > join user u on u.user_id=m.user_id > group by occupation;Total MapReduce jobs = 2Launching Job 1 out of 2...Total MapReduce CPU Time Spent: 47 seconds 170 msecOKadministrator 7479artist 2308doctor 540educator 9442engineer 8175entertainment 2095….retired 1609salesman 856scientist 2058student 21957technician 3506writer 5536 Hmmm..Time taken: 110.331 seconds
  19. 19. PIG• Powerful High Level Programming Language• SQL-ish, small learning curve for SQL and procedural programmers• Excellent for data transformation, ETL• Not meant to be an ad-hoc query tool, happy with doing grunt work• Plenty of supported file formats, databases, ability to create custom UDF’s
  20. 20. PIG Examplegrunt> lens_users= load /user/movie_lens/u.user using PigStorage(|) as(user_id:int, age:int, gender:chararray, occupation:chararray, postal_code:int);grunt> lens_data= load /user/movie_lens/ using PigStorage(t) as(user_id:int, movie_id:int, rating:int, time_unix_ts:chararray);grunt> joined = join lens_users by user_id, lens_data by user_idgrunt> grouped = group joined by (occupation);grunt> results = FOREACH grouped GENERATE COUNT_STAR(joined),*;grunt> store results into /user/movie_lens_user_summary Interesting, We are doing our aggregate functions after grouping
  21. 21. PIG - Results Grouping in PIG is a fair deviation from SQL -> original elements are preserved in a bag
  22. 22. SummaryHive:• Helpful for ETL• Very good for Ad-Hoc Analysis - Not necessarily suited for front end users but definitely helpful for data analysts• Directly leverages SQL expertise!!PIG:• Great for ETL• Powerful, transformation and processing capabilities• SQL-like, but different in many ways, will take some time to master.
  23. 23. Big Data Warehousing - Meetup