2013 feb 20_thug_h_catalog

  • 477 views
Uploaded on

Toronto Hadoop User Group …

Toronto Hadoop User Group
THUG
February 20 2013
HCatalog
Hive - Stinger Initiative

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
477
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
28
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Toronto Hadoop User GroupFebruary 20, 2013Apache HCatalog Overview& Next-gen HiveAdam Muise© Hortonworks Inc. 2012 Page 1
  • 2. HCatalog© Hortonworks Inc. 2012 Page 2
  • 3. Apache HCatalog• Incubator Project at Apache.org• Good adoption• Will likely merge with Hive project as it adds important functionality to metastore• Allows for a “schema-on-read” approach to Big Data in HDFS• Seat of a lot of innovation in Data Management• Used by platform partners to enhance Hadoop integration• Will likely be used to enhance existing Data Management products in the Enterprise & create new products © Hortonworks Inc. 2012 Page 3
  • 4. Great Tooling Options MapReduce •  Early adopters •  Non-relational algorithms •  Performance sensitive applications Pig •  ETL •  Data modeling •  Iterative algorithms Hive •  Analysis •  Connectors to BI tools Strength: Right tool for right application Weakness: Hard to share their data © Hortonworks Inc. 2012 Page 4
  • 5. HCatalog Changes the GameApache HCatalog provides flexible metadataservices across tools and external access •  Consistency of metadata and data models across the Enterprise (MapReduce, Pig, Hbase, Hive, External Systems) •  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API HCatalog Shared table and schema management •  Raw Hadoop data Table access opens the •  Inconsistent, unknown Aligned metadata platform •  Tool specific access REST API © Hortonworks Inc. 2012 Page 5
  • 6. Options == Complexity Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string, int, float, string, bytes, maps, maps, structs, lists tuples, bags Schema Encoded in app Declared in script Read from or read by loader metadata Data location Encoded in app Declared in script Read from metadata Data format Encoded in app Declared in script Read from metadata•  Pig and MR users need to know a lot to write their apps•  When data schema, location, or format change Pig and MR apps must be rewritten, retested, and redeployed•  Hive users have to load data from Pig/MR users to have access to it © Hortonworks 2012 Page 6
  • 7. Hcatalog == Simple, Consistent Feature MapReduce + Pig + HCatalog Hive HCatalog Record format Record Tuple Record Data model int, float, string, int, float, string, int, float, string, maps, structs, lists bytes, maps, maps, structs, lists tuples, bags Schema Read from Read from Read from metadata metadata metadata Data location Read from Read from Read from metadata metadata metadata Data format Read from Read from Read from metadata metadata metadata•  Pig/MR users can read schema from metadata•  Pig/MR users are insulated from schema, location, and format changes•  All users have access to other users’ data as soon as it is committed © Hortonworks 2012 Page 7
  • 8. Hadoop Ecosystem Hive Pig MapReduce (SQL) (scripting) (Java) Interface: Interface: Interface: SQL SerDe Load/Store DML Input/ Input/ Input/ OutputFormat OutputFormat OutputFormat DDL metastore dn1 dn2 dn3 . . . - tables - partitions - files . . . . . . - types . . . . dnN HDFS © Hortonworks Inc. 2012
  • 9. Opening up Metadata to MR & Pig Hive Pig MapReduce (SQL) (scripting) (Java) HCat Metadata layer Interface: Interface: SQL HCatLoad/Store Interface: SerDe HCatInput/OutputFormat metastore dn1 dn2 dn3 . . . - tables - partitions - files . . . . . . - types . . . . dnN HDFS © Hortonworks Inc. 2012
  • 10. Pig ExampleAssume you want to count how many time each of your users went to each of your URLsraw = load /data/rawevents/20120530 as (url, user);botless = filter raw by myudfs.NotABot(user);grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless); No need to know No need tostore cntd into /data/counted/20120530; file location declare schemaUsing HCatalog:raw = load rawevents using HCatLoader();botless = filter raw by myudfs.NotABot(user) and ds == 20120530;grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless);store cntd into counted using HCatStorer(); Partition filter © Hortonworks 2012 Page 10
  • 11. Working with HCatalog in MapReduceSetting input: table to read fromHCatInputFormat.setInput(job, InputJobInfo.create(dbname, tableName, filter)); database to read specify whichSetting output: from partitions to readHCatOutputFormat.setOutput(job, OutputJobInfo.create(dbname, tableName, partitionSpec)); specify whichObtaining schema: partition to writeschema = HCatInputFormat.getOutputSchema(); access fields byKey is unused, Value is HCatRecord: nameString url = value.get("url", schema);output.set("cnt", schema, cnt); © Hortonworks 2012 Page 11
  • 12. Managing Metadata•  If you are a Hive user, you can use your Hive metastore with no modifications•  If not, you can use the HCatalog command line tool to issue Hive DDL (Data Definition Language) commands:> /usr/bin/hcat -e ”create table rawevents (url string,user string) partitioned by (ds string);";•  Starting in Pig 0.11, you will be able to issue DDL commands from Pig © Hortonworks 2012 Page 12
  • 13. Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: Create new table “rawevents” Hadoop/ HCatalog Describe table “rawevents” © Hortonworks 2012 Page 13
  • 14. HCatalog is in Hortonworks Data Platform… OPERATIONAL  SERVICES   DATA   SERVICES   AMBARI   FLUME   PIG   HIVE   HBASE   SQOOP   HCATALOG   OOZIE   WEBHDFS   MAP  REDUCE   HADOOP  CORE   HDFS   YARN  (in  2.0)   Enterprise Readiness PLATFORM  SERVICES   High Availability, Disaster Recovery, Snapshots, Security, etc… HORTONWORKS     DATA  PLATFORM  (HDP)   OS   Cloud   VM   Appliance   © Hortonworks Inc. 2012 Page 14
  • 15. Key 2013 “Enterprise Hadoop” Initiatives Invest In: Hive / “Stinger” Interactive Query – Platform Services Ambari HBase – DR, Snapshot, …Manage & Operate Online Data OPERATIONAL   DATA   SERVICES   SERVICES   HADOOP  CORE   – Data Services PLATFORM  SERVICES   – In support of Refine, “Knox” HORTONWORKS     “Herd” Explore, Enrich Secure Access DATA  PLATFORM  (HDP)   Data Integration – Operational Services “Continuum” – Manageability, Biz Continuity Security, … © Hortonworks Inc. 2012 Page 15
  • 16. Hive/Stinger: Interactive QueryNear-realtime queries in good old Hive…© Hortonworks Inc. 2012 Page 16
  • 17. Top BI Vendors Support Hive Today © Hortonworks Inc. 2012 Page 17
  • 18. Goal: Enhance Hive for BI Use CasesEnterprise Reports Parameterized Reports Dashboard / Scorecard Data Mining Visualization More SQL & Better Performance Batch Interactive © Hortonworks Inc. 2012 Page 18
  • 19. Differing Needs For Scale / Interaction Interactivity Scalability and Reliability is key are key Non- Interactive Batch Interactive •  Parameterized •  Data preparation •  Operational batch Reports •  Incremental batch processing •  Drilldown processing •  Enterprise •  Visualization •  Dashboards / Reports •  Exploration Scorecards •  Data Mining 5s – 1m 1m – 1h 1h+ Data Size © Hortonworks Inc. 2012 Page 19
  • 20. Stinger: Make Hive Best for All Needs Non- Interactive Batch Interactive •  Parameterized •  Data preparation •  Operational batch Reports •  Incremental batch processing •  Drilldown processing •  Enterprise •  Visualization •  Dashboards / Reports •  Exploration Scorecards •  Data Mining 5s – 1m 1m – 1h 1h+ Data SizeImprove Latency & Throughput Extend Deep Analytical Ability•  Query engine improvements •  Analytics functions•  New “Optimized RCFile” column store •  Improved SQL coverage•  Next-gen runtime (elim’s M/R latency) •  Continued focus on core Hive use cases © Hortonworks Inc. 2012 Page 20
  • 21. Analytic Function Use Cases• OVER – Rankings, top 10, bottom 10 – Running balances – Statistics within time windows (e.g. last 3 months, last 6 months)• LEAD / LAG – Trend identification – Sessionization – Forecasting / prediction• Distributions – Histograms and bucketing• Good for Enterprise Reports, Dashboards, Data Mining and Business Processing. © Hortonworks Inc. 2012 Page 21
  • 22. Stinger 2013 Roadmap Summary• HDP 1.x (aka Hadoop 1.x …) – Additional SQL Types – SQL Analytic Functions (OVER, Subqueries in WHERE, etc.) – Modern Optimized Column Store (ORC file) – Hive Query Enhancements – Startup time, star joins, optimize M/R DAGs, vectorization, etc.• HDP 2.x (aka Hadoop 2.x …) – Features in HDP 1.3 & 1.4 – Next-gen runtime that eliminates startup time – Persistent function registry – Other features © Hortonworks Inc. 2012 Page 22
  • 23. Questions?© Hortonworks Inc. 2012 Page 23