Big Data Warehousing: Pig vs. Hive Comparison

  • 10,504 views
Uploaded on

In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more …

In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.

http://www.casertaconcepts.com

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
10,504
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
16

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Data Warehousing MeetupToday’s Topic: Exploring Big DataAnalytics Techniques with Datameer Sponsored By:
  • 2. WELCOME! Joe Caserta Founder & President, Caserta Concepts
  • 3. Agenda7:00 Networking Grab a slice of pizza and a drink...7:15 Joe Caserta Welcome President, Caserta Concepts About the Meetup and about Caserta Concepts Author, Data Warehouse ETL Toolkit7:30 Elliott Cordo Pig and Hive Principal Consultant, Caserta Concepts Walkthrough of these powerful native Hadoop tools7:50 Adam Gugliciello Datameer Solutions Engineer, Datameer8:10 - More Networking9:00 Tell us what you’re up to…
  • 4. About BDW Meetup• Big Data is a complex, rapidly changing landscape• We want to share our stories and hear about yours• Great networking opportunity for like minded data nerds• Opportunities to collaborate on exciting projects• Next BDW Meetup: April 22.• Topic: Intro to NoSQL Databases
  • 5. About Caserta Concepts Focused Industries Served Expertise • Financial Services • Big Data Analytics • Healthcare / Insurance • Data Warehousing • Retail / eCommerce • Business Intelligence • Digital Media / Marketing • Strategic Data • K-12 / Higher Education Ecosystems Founded in 2001 • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004)
  • 6. Client PortfolioFinance& InsuranceRetail/eCommerce& ManufacturingEducation& Services
  • 7. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Master Data Management
  • 8. OpportunitiesDoes this word cloud excite you?Speak with us about our open positions: jobs@casertaconcepts.com
  • 9. Contacts Joe Caserta President & Founder, Caserta Concepts P: (855) 755-2246 x227 E: joe@casertaconcepts.com Erik Laurence VP Marketing, Caserta Concepts P: (855) 755-2246 x528 info@casertaconcepts.com E: erik@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com
  • 10. ANALYZING DATA: PIG AND HIVE Elliott Cordo Principal Consultant, Caserta Concepts
  • 11. Big Data Analysis• Let’s review some tools for analyzing and processing Big Data• We will go over some simple use cases – point out what is interesting about them• Develop a point of view of what each one is well suited for.
  • 12. Big Data Analysis – Map Reduce?Distributed programming framework – Divide and Conquer! • Master divides work into digestible chunks and distributes to worker nodes – > MAP • Work from nodes is then collected by the master and combined to form an answer -> REDUCEPowerful tool for to solve interesting computational problems at scale
  • 13. HELP• We are doing low-level language coding to perform low- level operations• For productivity we need higher level tools!• We will get help from a few animals! N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS)
  • 14. HIVE• The Hadoop “Data Warehouse”• HiveQL is a SQL-Like interface that allows you to abstract “relational-db like” structure on top of non-relational or unstructured data • Flat Files, JSON, Web logs • HBase, Casandra, other NoSQL stores like MongoDB• Thanks to ODBC/JDBC drivers some conventional BI tools can interact with Hive• Ability to integrate custom programming, mappers, reducers
  • 15. HIVEBut don’t get too excited!• Hive is not a Database, especially in terms of optimizations.• SQL is interpreted to Map Reduce Jobs, expect even simple queries to be around a minute or more. Start query, go get coffee• But now that expectations have been set, it’s still a very useful tool
  • 16. HIVE DDL– Create and load a tablehive> create table user_movie_ratings( > user_id int, > movie_id int, Looks like a typical > rating int, > time_unix_ts string) table declaration, > row format delimited except we are specify > fields terminated by t the ingested file > stored as textfile; formatOKTime taken: 0.395 secondshive> load data inpath /user/hive/staging/data/u.data overwrite into tableuser_movie_ratings;Loading data to table default.user_movie_ratingsDeleted hdfs://localhost:54310/user/hive/warehouse/user_movie_ratingsTable default.user_movie_ratings stats: [num_partitions: 0, num_files: 1, num_rows: 0,total_size: 1979173, raw_data_size: 0]OKTime taken: 0.474 seconds
  • 17. HIVE DDL– Create an external tablehive> create external table user ( > user_id int, > age int, This time we don’t > gender string, want Hive to own this > occupation string, data’s lifecycle > postal_code int ) > row format delimited fields terminated by | > location /user/hive/staging/user;OKTime taken: 0.096 seconds
  • 18. HIVE – YAY SQL!hive> select occupation, count(1) > from user_movie_ratings m > join user u on u.user_id=m.user_id > group by occupation;Total MapReduce jobs = 2Launching Job 1 out of 2...Total MapReduce CPU Time Spent: 47 seconds 170 msecOKadministrator 7479artist 2308doctor 540educator 9442engineer 8175entertainment 2095….retired 1609salesman 856scientist 2058student 21957technician 3506writer 5536 Hmmm..Time taken: 110.331 seconds
  • 19. PIG• Powerful High Level Programming Language• SQL-ish, small learning curve for SQL and procedural programmers• Excellent for data transformation, ETL• Not meant to be an ad-hoc query tool, happy with doing grunt work• Plenty of supported file formats, databases, ability to create custom UDF’s
  • 20. PIG Examplegrunt> lens_users= load /user/movie_lens/u.user using PigStorage(|) as(user_id:int, age:int, gender:chararray, occupation:chararray, postal_code:int);grunt> lens_data= load /user/movie_lens/u.data using PigStorage(t) as(user_id:int, movie_id:int, rating:int, time_unix_ts:chararray);grunt> joined = join lens_users by user_id, lens_data by user_idgrunt> grouped = group joined by (occupation);grunt> results = FOREACH grouped GENERATE COUNT_STAR(joined),*;grunt> store results into /user/movie_lens_user_summary Interesting, We are doing our aggregate functions after grouping
  • 21. PIG - Results Grouping in PIG is a fair deviation from SQL -> original elements are preserved in a bag
  • 22. SummaryHive:• Helpful for ETL• Very good for Ad-Hoc Analysis - Not necessarily suited for front end users but definitely helpful for data analysts• Directly leverages SQL expertise!!PIG:• Great for ETL• Powerful, transformation and processing capabilities• SQL-like, but different in many ways, will take some time to master.
  • 23. Big Data Warehousing - Meetup