HIVEdata warehousing using HadoopFacebook Data Team
Motivation Structured log and dimension data  – Well known schemas, different serialization formats (binary/text)  – Rich ...
What is HIVE? Mgmt. Web UI                                                               Map Reduce      HDFS             ...
Dealing with Structured Data Type system  – Primitive types  – Recursively build up using Composition/Maps/Lists Generic (...
Data Model                                                     #Partitions=32                                        Schem...
MetaStore Stores Table/Partition properties:  –   Table schema and SerDe library  –   Table Location on HDFS  –   Logical ...
Hive CLI Implemented in Python  – uses MetaStore Thrift API DDL:  – create table/drop table/rename table  – alter table ad...
Hive Query Language Philosophy  – SQL like constructs + Hadoop Streaming Query Operators in initial version  –   Projectio...
Hive Query Language Package these capabilities into a more formal SQL like query language in next version Introduce other ...
Query Language - Examples  Multi table inserts  FROM ad_impressions_stg imps   INSERT INTO ad_legals/ds=2008-03-08 select ...
Query Language - Examples Group By FROM ad_legals_joined imps   INSERT INTO hdfs://hadoop001:9000/user/ads/adid_uu_summary...
Query Language – HadoopStreaming APPLY ON TABLE CREATE OPERATOR filter_legal using ‘exec://filter_legal.py’        (ts dat...
Query Language – Views Used for expressing  – Union alls  – APPLY operators Example CREATE VIEW actions SELECT photo_views...
Hive Usage in Facebook Applications:  – Summarization       Eg: Daily/Weekly aggregations of impression/click counts  – Ad...
Conclusion Release to Open Source in 3-4 months People:  –   Suresh Anthony (suresh@facebook.com)  –   Jeff Hammerbacher (...
Upcoming SlideShare
Loading in...5
×

Facebook hadoop-summit

706

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
706
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Facebook hadoop-summit

  1. 1. HIVEdata warehousing using HadoopFacebook Data Team
  2. 2. Motivation Structured log and dimension data – Well known schemas, different serialization formats (binary/text) – Rich data structures – nesting/maps/lists Query language over structured data – SQL helps in easier adoption by business analysts + reduced learning curve for everyone – Developers love streaming and direct access to map-reduce – Query Language brings together SQL and Streaming Data Management – Tables/Partitions for easy data addressability – Abstractions allow optimizations: Organize data for large joins/sampling Add indices/manage compression/replication transparently
  3. 3. What is HIVE? Mgmt. Web UI Map Reduce HDFS Hive CLI Browsing Queries DDL Thrift API Parser Execution Planner Hive QL SerDe Thrift Jute JSON.. MetaStore
  4. 4. Dealing with Structured Data Type system – Primitive types – Recursively build up using Composition/Maps/Lists Generic (De)Serialization Interface (SerDe) – To recursively list schema – To recursively access fields within a row object Serialization families implement interface – Thrift (Binary and Delimited Text), RecordIO, JSON/PADS(?) XPath like field expressions – profiles.network[@is_primary=1].id Inbuilt DDL – Define schema over delimited text files – Leverages Thrift DDL
  5. 5. Data Model #Partitions=32 Schema Sort-key=uid uid Library Hash clicks Partitioning views IPLogical Partitioning userId … AdId/hive/clicks/hive/clicks/ds=2008-03-25 Tables Dimensions/hive/clicks/ds=2008-03-25/0 HDFS MetaStore
  6. 6. MetaStore Stores Table/Partition properties: – Table schema and SerDe library – Table Location on HDFS – Logical Partitioning keys and types – Sort column – Mapping from columns to well known Dimensions Thrift API – Current clients in Php (Web Interface), Python (CLI), Java (Query Engine), Perl (Tests) Stores all properties in text files
  7. 7. Hive CLI Implemented in Python – uses MetaStore Thrift API DDL: – create table/drop table/rename table – alter table add column etc. Browsing: – show tables – describe table – cat table Loading Data – load data inpath <path1, …> into table <tablename/partition-spec>] [bucketed <N> ways by <dimension>] Queries – Issue queries in Hive QL.
  8. 8. Hive Query Language Philosophy – SQL like constructs + Hadoop Streaming Query Operators in initial version – Projections – Equijoins and Cogroups – Group by – Sampling Output of these operators can be: – passed to Streaming mappers/reducers – can be stored in another Hive Table – can be output to HDFS files
  9. 9. Hive Query Language Package these capabilities into a more formal SQL like query language in next version Introduce other important constructs: – Views – Multi table inserts – Order bys – Select distincts – SQL like column expressions – A bunch of other builtin functions Still work in progress
  10. 10. Query Language - Examples Multi table inserts FROM ad_impressions_stg imps INSERT INTO ad_legals/ds=2008-03-08 select imps.* where imps.legal = 1 INSERT INTO ad_non_legals/ds=2008-03-08 select imps.* where imps.legal = 0 Joins FROM ad_impressions imps, ad_dimensions ads INSERT INTO ad_legals_joined select imps.*, ads.campaignid JOIN ON(imps.adid, ads.adid) WHERE imps.legal = 1
  11. 11. Query Language - Examples Group By FROM ad_legals_joined imps INSERT INTO hdfs://hadoop001:9000/user/ads/adid_uu_summary select imps.adid, count_distinct(imps.uid) group by(imps.adid) INSERT INTO hdfs://hadoop001:9000/user/ads/campaignid_uu_summary select imps.campaign_id, count_distinct(imps.uid) group by(imps.campaignid)
  12. 12. Query Language – HadoopStreaming APPLY ON TABLE CREATE OPERATOR filter_legal using ‘exec://filter_legal.py’ (ts date, adid long, uid long) FROM (APPLY filter_legal ON TABLE ad_impression) INSERT INTO ad_legals where ts >= ‘2008-03-11’ and ts < ‘2008-03-12’ APPLY can also be applied after JOIN as reducer script FROM ad_impressions imps, ad_dimensions ads INSERT INTO ad_legals_joined select imps.*, ads.campaignid JOIN ON(imps.adid, ads.adid) APPLY filter_legal BEFORE OUTPUT
  13. 13. Query Language – Views Used for expressing – Union alls – APPLY operators Example CREATE VIEW actions SELECT photo_views.* UNION ALL SELECT video_views.* UNION ALL SELECT profile_views.* …
  14. 14. Hive Usage in Facebook Applications: – Summarization Eg: Daily/Weekly aggregations of impression/click counts – Ad hoc Analysis Eg: how many group admins broken down by state/country – Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Usage statistics: – Total Users: ~40 (about 25% of engineering !) – Hive Data (compressed): 22 TB total, ~200GB incoming per day – Jobs over last 7 days: Total Jobs: 3514, Projections:821, Joins: 152, Aggregates: 800, Loaders: 600 * Aggregates biased because of multi-stage map-reduce
  15. 15. Conclusion Release to Open Source in 3-4 months People: – Suresh Anthony (suresh@facebook.com) – Jeff Hammerbacher (jeffh@) – Joydeep Sarma (jssarma@) – Ashish Thusoo (athusoo@) – Pete Wyckoff (pwyckoff@)
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×