Your SlideShare is downloading. ×
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Facebook hadoop-summit
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Facebook hadoop-summit

690

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
690
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HIVEdata warehousing using HadoopFacebook Data Team
  • 2. Motivation Structured log and dimension data – Well known schemas, different serialization formats (binary/text) – Rich data structures – nesting/maps/lists Query language over structured data – SQL helps in easier adoption by business analysts + reduced learning curve for everyone – Developers love streaming and direct access to map-reduce – Query Language brings together SQL and Streaming Data Management – Tables/Partitions for easy data addressability – Abstractions allow optimizations: Organize data for large joins/sampling Add indices/manage compression/replication transparently
  • 3. What is HIVE? Mgmt. Web UI Map Reduce HDFS Hive CLI Browsing Queries DDL Thrift API Parser Execution Planner Hive QL SerDe Thrift Jute JSON.. MetaStore
  • 4. Dealing with Structured Data Type system – Primitive types – Recursively build up using Composition/Maps/Lists Generic (De)Serialization Interface (SerDe) – To recursively list schema – To recursively access fields within a row object Serialization families implement interface – Thrift (Binary and Delimited Text), RecordIO, JSON/PADS(?) XPath like field expressions – profiles.network[@is_primary=1].id Inbuilt DDL – Define schema over delimited text files – Leverages Thrift DDL
  • 5. Data Model #Partitions=32 Schema Sort-key=uid uid Library Hash clicks Partitioning views IPLogical Partitioning userId … AdId/hive/clicks/hive/clicks/ds=2008-03-25 Tables Dimensions/hive/clicks/ds=2008-03-25/0 HDFS MetaStore
  • 6. MetaStore Stores Table/Partition properties: – Table schema and SerDe library – Table Location on HDFS – Logical Partitioning keys and types – Sort column – Mapping from columns to well known Dimensions Thrift API – Current clients in Php (Web Interface), Python (CLI), Java (Query Engine), Perl (Tests) Stores all properties in text files
  • 7. Hive CLI Implemented in Python – uses MetaStore Thrift API DDL: – create table/drop table/rename table – alter table add column etc. Browsing: – show tables – describe table – cat table Loading Data – load data inpath <path1, …> into table <tablename/partition-spec>] [bucketed <N> ways by <dimension>] Queries – Issue queries in Hive QL.
  • 8. Hive Query Language Philosophy – SQL like constructs + Hadoop Streaming Query Operators in initial version – Projections – Equijoins and Cogroups – Group by – Sampling Output of these operators can be: – passed to Streaming mappers/reducers – can be stored in another Hive Table – can be output to HDFS files
  • 9. Hive Query Language Package these capabilities into a more formal SQL like query language in next version Introduce other important constructs: – Views – Multi table inserts – Order bys – Select distincts – SQL like column expressions – A bunch of other builtin functions Still work in progress
  • 10. Query Language - Examples Multi table inserts FROM ad_impressions_stg imps INSERT INTO ad_legals/ds=2008-03-08 select imps.* where imps.legal = 1 INSERT INTO ad_non_legals/ds=2008-03-08 select imps.* where imps.legal = 0 Joins FROM ad_impressions imps, ad_dimensions ads INSERT INTO ad_legals_joined select imps.*, ads.campaignid JOIN ON(imps.adid, ads.adid) WHERE imps.legal = 1
  • 11. Query Language - Examples Group By FROM ad_legals_joined imps INSERT INTO hdfs://hadoop001:9000/user/ads/adid_uu_summary select imps.adid, count_distinct(imps.uid) group by(imps.adid) INSERT INTO hdfs://hadoop001:9000/user/ads/campaignid_uu_summary select imps.campaign_id, count_distinct(imps.uid) group by(imps.campaignid)
  • 12. Query Language – HadoopStreaming APPLY ON TABLE CREATE OPERATOR filter_legal using ‘exec://filter_legal.py’ (ts date, adid long, uid long) FROM (APPLY filter_legal ON TABLE ad_impression) INSERT INTO ad_legals where ts >= ‘2008-03-11’ and ts < ‘2008-03-12’ APPLY can also be applied after JOIN as reducer script FROM ad_impressions imps, ad_dimensions ads INSERT INTO ad_legals_joined select imps.*, ads.campaignid JOIN ON(imps.adid, ads.adid) APPLY filter_legal BEFORE OUTPUT
  • 13. Query Language – Views Used for expressing – Union alls – APPLY operators Example CREATE VIEW actions SELECT photo_views.* UNION ALL SELECT video_views.* UNION ALL SELECT profile_views.* …
  • 14. Hive Usage in Facebook Applications: – Summarization Eg: Daily/Weekly aggregations of impression/click counts – Ad hoc Analysis Eg: how many group admins broken down by state/country – Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Usage statistics: – Total Users: ~40 (about 25% of engineering !) – Hive Data (compressed): 22 TB total, ~200GB incoming per day – Jobs over last 7 days: Total Jobs: 3514, Projections:821, Joins: 152, Aggregates: 800, Loaders: 600 * Aggregates biased because of multi-stage map-reduce
  • 15. Conclusion Release to Open Source in 3-4 months People: – Suresh Anthony (suresh@facebook.com) – Jeff Hammerbacher (jeffh@) – Joydeep Sarma (jssarma@) – Ashish Thusoo (athusoo@) – Pete Wyckoff (pwyckoff@)

×