Your SlideShare is downloading. ×
0
Toronto Hadoop User GroupFebruary 20, 2013Apache HCatalog Overview& Next-gen HiveAdam Muise© Hortonworks Inc. 2012     Pag...
HCatalog© Hortonworks Inc. 2012   Page 2
Apache HCatalog• Incubator Project at Apache.org• Good adoption• Will likely merge with Hive project as it adds  important...
Great Tooling Options                             MapReduce                             •  Early adopters                 ...
HCatalog Changes the GameApache HCatalog provides flexible metadataservices across tools and external access •  Consistenc...
Options == Complexity Feature                        MapReduce         Pig                   Hive Record format           ...
Hcatalog == Simple, Consistent Feature                         MapReduce +            Pig + HCatalog        Hive          ...
Hadoop Ecosystem                                   Hive                       Pig                 MapReduce               ...
Opening up Metadata to MR & Pig                                   Hive             Pig                 MapReduce          ...
Pig ExampleAssume you want to count how many time each of your users went to each of your URLsraw     = load /data/raweven...
Working with HCatalog in MapReduceSetting input:             table to read fromHCatInputFormat.setInput(job,  InputJobInfo...
Managing Metadata•  If you are a Hive user, you can use your Hive metastore with no modifications•  If not, you can use th...
Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GE...
HCatalog is in Hortonworks Data Platform…  OPERATIONAL	  SERVICES	                                                DATA	   ...
Key 2013 “Enterprise Hadoop” Initiatives                                                                                  ...
Hive/Stinger: Interactive QueryNear-realtime queries in good old Hive…© Hortonworks Inc. 2012                   Page 16
Top BI Vendors Support Hive Today   © Hortonworks Inc. 2012          Page 17
Goal: Enhance Hive for BI Use CasesEnterprise Reports                                                       Parameterized ...
Differing Needs For Scale / Interaction       Interactivity               Scalability and Reliability          is key     ...
Stinger: Make Hive Best for All Needs                                             Non-                  Interactive       ...
Analytic Function Use Cases• OVER  – Rankings, top 10, bottom 10  – Running balances  – Statistics within time windows (e....
Stinger 2013 Roadmap Summary• HDP 1.x (aka Hadoop 1.x …)  – Additional SQL Types  – SQL Analytic Functions (OVER, Subqueri...
Questions?© Hortonworks Inc. 2012   Page 23
Upcoming SlideShare
Loading in...5
×

2013 feb 20_thug_h_catalog

554

Published on

Toronto Hadoop User Group
THUG
February 20 2013
HCatalog
Hive - Stinger Initiative

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
554
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
29
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "2013 feb 20_thug_h_catalog"

  1. 1. Toronto Hadoop User GroupFebruary 20, 2013Apache HCatalog Overview& Next-gen HiveAdam Muise© Hortonworks Inc. 2012 Page 1
  2. 2. HCatalog© Hortonworks Inc. 2012 Page 2
  3. 3. Apache HCatalog• Incubator Project at Apache.org• Good adoption• Will likely merge with Hive project as it adds important functionality to metastore• Allows for a “schema-on-read” approach to Big Data in HDFS• Seat of a lot of innovation in Data Management• Used by platform partners to enhance Hadoop integration• Will likely be used to enhance existing Data Management products in the Enterprise & create new products © Hortonworks Inc. 2012 Page 3
  4. 4. Great Tooling Options MapReduce •  Early adopters •  Non-relational algorithms •  Performance sensitive applications Pig •  ETL •  Data modeling •  Iterative algorithms Hive •  Analysis •  Connectors to BI tools Strength: Right tool for right application Weakness: Hard to share their data © Hortonworks Inc. 2012 Page 4
  5. 5. HCatalog Changes the GameApache HCatalog provides flexible metadataservices across tools and external access •  Consistency of metadata and data models across the Enterprise (MapReduce, Pig, Hbase, Hive, External Systems) •  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API HCatalog Shared table and schema management •  Raw Hadoop data Table access opens the •  Inconsistent, unknown Aligned metadata platform •  Tool specific access REST API © Hortonworks Inc. 2012 Page 5
  6. 6. Options == Complexity Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string, int, float, string, bytes, maps, maps, structs, lists tuples, bags Schema Encoded in app Declared in script Read from or read by loader metadata Data location Encoded in app Declared in script Read from metadata Data format Encoded in app Declared in script Read from metadata•  Pig and MR users need to know a lot to write their apps•  When data schema, location, or format change Pig and MR apps must be rewritten, retested, and redeployed•  Hive users have to load data from Pig/MR users to have access to it © Hortonworks 2012 Page 6
  7. 7. Hcatalog == Simple, Consistent Feature MapReduce + Pig + HCatalog Hive HCatalog Record format Record Tuple Record Data model int, float, string, int, float, string, int, float, string, maps, structs, lists bytes, maps, maps, structs, lists tuples, bags Schema Read from Read from Read from metadata metadata metadata Data location Read from Read from Read from metadata metadata metadata Data format Read from Read from Read from metadata metadata metadata•  Pig/MR users can read schema from metadata•  Pig/MR users are insulated from schema, location, and format changes•  All users have access to other users’ data as soon as it is committed © Hortonworks 2012 Page 7
  8. 8. Hadoop Ecosystem Hive Pig MapReduce (SQL) (scripting) (Java) Interface: Interface: Interface: SQL SerDe Load/Store DML Input/ Input/ Input/ OutputFormat OutputFormat OutputFormat DDL metastore dn1 dn2 dn3 . . . - tables - partitions - files . . . . . . - types . . . . dnN HDFS © Hortonworks Inc. 2012
  9. 9. Opening up Metadata to MR & Pig Hive Pig MapReduce (SQL) (scripting) (Java) HCat Metadata layer Interface: Interface: SQL HCatLoad/Store Interface: SerDe HCatInput/OutputFormat metastore dn1 dn2 dn3 . . . - tables - partitions - files . . . . . . - types . . . . dnN HDFS © Hortonworks Inc. 2012
  10. 10. Pig ExampleAssume you want to count how many time each of your users went to each of your URLsraw = load /data/rawevents/20120530 as (url, user);botless = filter raw by myudfs.NotABot(user);grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless); No need to know No need tostore cntd into /data/counted/20120530; file location declare schemaUsing HCatalog:raw = load rawevents using HCatLoader();botless = filter raw by myudfs.NotABot(user) and ds == 20120530;grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless);store cntd into counted using HCatStorer(); Partition filter © Hortonworks 2012 Page 10
  11. 11. Working with HCatalog in MapReduceSetting input: table to read fromHCatInputFormat.setInput(job, InputJobInfo.create(dbname, tableName, filter)); database to read specify whichSetting output: from partitions to readHCatOutputFormat.setOutput(job, OutputJobInfo.create(dbname, tableName, partitionSpec)); specify whichObtaining schema: partition to writeschema = HCatInputFormat.getOutputSchema(); access fields byKey is unused, Value is HCatRecord: nameString url = value.get("url", schema);output.set("cnt", schema, cnt); © Hortonworks 2012 Page 11
  12. 12. Managing Metadata•  If you are a Hive user, you can use your Hive metastore with no modifications•  If not, you can use the HCatalog command line tool to issue Hive DDL (Data Definition Language) commands:> /usr/bin/hcat -e ”create table rawevents (url string,user string) partitioned by (ds string);";•  Starting in Pig 0.11, you will be able to issue DDL commands from Pig © Hortonworks 2012 Page 12
  13. 13. Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: Create new table “rawevents” Hadoop/ HCatalog Describe table “rawevents” © Hortonworks 2012 Page 13
  14. 14. HCatalog is in Hortonworks Data Platform… OPERATIONAL  SERVICES   DATA   SERVICES   AMBARI   FLUME   PIG   HIVE   HBASE   SQOOP   HCATALOG   OOZIE   WEBHDFS   MAP  REDUCE   HADOOP  CORE   HDFS   YARN  (in  2.0)   Enterprise Readiness PLATFORM  SERVICES   High Availability, Disaster Recovery, Snapshots, Security, etc… HORTONWORKS     DATA  PLATFORM  (HDP)   OS   Cloud   VM   Appliance   © Hortonworks Inc. 2012 Page 14
  15. 15. Key 2013 “Enterprise Hadoop” Initiatives Invest In: Hive / “Stinger” Interactive Query – Platform Services Ambari HBase – DR, Snapshot, …Manage & Operate Online Data OPERATIONAL   DATA   SERVICES   SERVICES   HADOOP  CORE   – Data Services PLATFORM  SERVICES   – In support of Refine, “Knox” HORTONWORKS     “Herd” Explore, Enrich Secure Access DATA  PLATFORM  (HDP)   Data Integration – Operational Services “Continuum” – Manageability, Biz Continuity Security, … © Hortonworks Inc. 2012 Page 15
  16. 16. Hive/Stinger: Interactive QueryNear-realtime queries in good old Hive…© Hortonworks Inc. 2012 Page 16
  17. 17. Top BI Vendors Support Hive Today © Hortonworks Inc. 2012 Page 17
  18. 18. Goal: Enhance Hive for BI Use CasesEnterprise Reports Parameterized Reports Dashboard / Scorecard Data Mining Visualization More SQL & Better Performance Batch Interactive © Hortonworks Inc. 2012 Page 18
  19. 19. Differing Needs For Scale / Interaction Interactivity Scalability and Reliability is key are key Non- Interactive Batch Interactive •  Parameterized •  Data preparation •  Operational batch Reports •  Incremental batch processing •  Drilldown processing •  Enterprise •  Visualization •  Dashboards / Reports •  Exploration Scorecards •  Data Mining 5s – 1m 1m – 1h 1h+ Data Size © Hortonworks Inc. 2012 Page 19
  20. 20. Stinger: Make Hive Best for All Needs Non- Interactive Batch Interactive •  Parameterized •  Data preparation •  Operational batch Reports •  Incremental batch processing •  Drilldown processing •  Enterprise •  Visualization •  Dashboards / Reports •  Exploration Scorecards •  Data Mining 5s – 1m 1m – 1h 1h+ Data SizeImprove Latency & Throughput Extend Deep Analytical Ability•  Query engine improvements •  Analytics functions•  New “Optimized RCFile” column store •  Improved SQL coverage•  Next-gen runtime (elim’s M/R latency) •  Continued focus on core Hive use cases © Hortonworks Inc. 2012 Page 20
  21. 21. Analytic Function Use Cases• OVER – Rankings, top 10, bottom 10 – Running balances – Statistics within time windows (e.g. last 3 months, last 6 months)• LEAD / LAG – Trend identification – Sessionization – Forecasting / prediction• Distributions – Histograms and bucketing• Good for Enterprise Reports, Dashboards, Data Mining and Business Processing. © Hortonworks Inc. 2012 Page 21
  22. 22. Stinger 2013 Roadmap Summary• HDP 1.x (aka Hadoop 1.x …) – Additional SQL Types – SQL Analytic Functions (OVER, Subqueries in WHERE, etc.) – Modern Optimized Column Store (ORC file) – Hive Query Enhancements – Startup time, star joins, optimize M/R DAGs, vectorization, etc.• HDP 2.x (aka Hadoop 2.x …) – Features in HDP 1.3 & 1.4 – Next-gen runtime that eliminates startup time – Persistent function registry – Other features © Hortonworks Inc. 2012 Page 22
  23. 23. Questions?© Hortonworks Inc. 2012 Page 23
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×