Toronto Hadoop User Group
February 20, 2013

Apache HCatalog Overview
& Next-gen Hive
Adam Muise




© Hortonworks Inc. 2012     Page 1
HCatalog




© Hortonworks Inc. 2012   Page 2
Apache HCatalog
• Incubator Project at Apache.org
• Good adoption
• Will likely merge with Hive project as it adds
  important functionality to metastore
• Allows for a “schema-on-read” approach to Big Data
  in HDFS
• Seat of a lot of innovation in Data Management
• Used by platform partners to enhance Hadoop
  integration
• Will likely be used to enhance existing Data
  Management products in the Enterprise & create new
  products

     © Hortonworks Inc. 2012                       Page 3
Great Tooling Options
                             MapReduce
                             •  Early adopters
                             •  Non-relational algorithms
                             •  Performance sensitive applications

                             Pig
                             •  ETL
                             •  Data modeling
                             •  Iterative algorithms

                             Hive
                             •  Analysis
                             •  Connectors to BI tools


              Strength: Right tool for right application
               Weakness: Hard to share their data
   © Hortonworks Inc. 2012                                           Page 4
HCatalog Changes the Game
Apache HCatalog provides flexible metadata
services across tools and external access
 •  Consistency of metadata and data models across the
    Enterprise (MapReduce, Pig, Hbase, Hive, External Systems)
 •  Accessibility: share data as tables in and out of HDFS
 •  Availability: enables flexible, thin-client access via REST API




                                  HCatalog                        Shared table
                                                                  and schema
                                                                  management
   •  Raw Hadoop data                        Table access         opens the
   •  Inconsistent, unknown                  Aligned metadata     platform
   •  Tool specific access                   REST API



        © Hortonworks Inc. 2012                                             Page 5
Options == Complexity
 Feature                        MapReduce         Pig                   Hive
 Record format                  Key value pairs   Tuple                 Record
 Data model                     User defined      int, float, string,   int, float, string,
                                                  bytes, maps,          maps, structs, lists
                                                  tuples, bags
 Schema                         Encoded in app    Declared in script    Read from
                                                  or read by loader     metadata
 Data location                  Encoded in app    Declared in script    Read from
                                                                        metadata
 Data format                    Encoded in app    Declared in script    Read from
                                                                        metadata


•    Pig and MR users need to know a lot to write their apps
•    When data schema, location, or format change Pig and MR apps must be
     rewritten, retested, and redeployed
•    Hive users have to load data from Pig/MR users to have access to it
           © Hortonworks 2012
                                                                                       Page 6
Hcatalog == Simple, Consistent
 Feature                         MapReduce +            Pig + HCatalog        Hive
                                 HCatalog
 Record format                   Record                 Tuple                 Record
 Data model                      int, float, string,    int, float, string,   int, float, string,
                                 maps, structs, lists   bytes, maps,          maps, structs, lists
                                                        tuples, bags
 Schema                          Read from              Read from             Read from
                                 metadata               metadata              metadata
 Data location                   Read from              Read from             Read from
                                 metadata               metadata              metadata
 Data format                     Read from              Read from             Read from
                                 metadata               metadata              metadata

•    Pig/MR users can read schema from metadata
•    Pig/MR users are insulated from schema, location, and format changes
•    All users have access to other users’ data as soon as it is committed
            © Hortonworks 2012
                                                                                            Page 7
Hadoop Ecosystem

                                   Hive                       Pig                 MapReduce
                                  (SQL)                   (scripting)               (Java)




                    Interface:             Interface:      Interface:
                       SQL                   SerDe        Load/Store

                                           DML

                                             Input/         Input/                   Input/
                                          OutputFormat   OutputFormat             OutputFormat
                   DDL




                      metastore                          dn1   dn2      dn3   .       .    .
                 - tables
                 - partitions
                 - files                                   .     .        .    .       .    .
                 - types

                                                          .              .    .       .   dnN


                                                                         HDFS



   © Hortonworks Inc. 2012
Opening up Metadata to MR & Pig

                                   Hive             Pig                 MapReduce
                                  (SQL)         (scripting)               (Java)




                                                       HCat Metadata layer
                                Interface:      Interface:
                                   SQL        HCatLoad/Store


                                                      Interface:
                                                        SerDe

                                                     HCatInput/OutputFormat




                                  metastore    dn1   dn2      dn3   .     .    .
                             - tables
                             - partitions
                             - files             .      .       .    .     .    .
                             - types

                                                .              .    .     .   dnN


                                                               HDFS



   © Hortonworks Inc. 2012
Pig Example
Assume you want to count how many time each of your users went to each of your URLs


raw     = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd    = group botless by (url, user);
cntd    = foreach grpd generate flatten(url, user), COUNT(botless);
               No need to know            No need to
store cntd into '/data/counted/20120530';
                                                    file location                     declare schema
Using HCatalog:


raw     = load 'rawevents' using HCatLoader();
botless = filter raw by myudfs.NotABot(user) and ds == '20120530';
grpd    = group botless by (url, user);
cntd    = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into 'counted' using HCatStorer();
                                                                                                       Partition filter




                       © Hortonworks 2012
                                                                                                                 Page 10
Working with HCatalog in MapReduce
Setting input:             table to read from
HCatInputFormat.setInput(job,
  InputJobInfo.create(dbname, tableName, filter));

                    database to read      specify which
Setting output:
                         from            partitions to read
HCatOutputFormat.setOutput(job,
  OutputJobInfo.create(dbname, tableName, partitionSpec));

                                                 specify which
Obtaining schema:                               partition to write
schema = HCatInputFormat.getOutputSchema();
                                    access fields by
Key is unused, Value is HCatRecord:     name
String url = value.get("url", schema);
output.set("cnt", schema, cnt);



       © Hortonworks 2012
                                                                     Page 11
Managing Metadata
•  If you are a Hive user, you can use your Hive metastore with no modifications

•  If not, you can use the HCatalog command line tool to issue Hive DDL (Data
   Definition Language) commands:
> /usr/bin/hcat -e ”create table rawevents (url string,
user string) partitioned by (ds string);";

•  Starting in Pig 0.11, you will be able to issue DDL commands from Pig




        © Hortonworks 2012
                                                                           Page 12
Templeton - REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop




                                Get a list of all tables in the default database:

                                        Create new table “rawevents”
                                                                                    Hadoop/
                                                                                    HCatalog
                                       Describe table “rawevents”




           © Hortonworks 2012
                                                                                               Page 13
HCatalog is in Hortonworks Data Platform…

  OPERATIONAL	
  SERVICES	
                                                DATA	
  
                                                                         SERVICES	
  

                  AMBARI	
                  FLUME	
                  PIG	
                HIVE	
  
                                                                                                                    HBASE	
  
                                            SQOOP	
                            HCATALOG	
  
                   OOZIE	
  


                                                   WEBHDFS	
                                  MAP	
  REDUCE	
  
   HADOOP	
  CORE	
  
                                                         HDFS	
                               YARN	
  (in	
  2.0)	
  


                                                        Enterprise Readiness
   PLATFORM	
  SERVICES	
                               High Availability, Disaster Recovery, Snapshots, Security, etc…




                                                                                                HORTONWORKS	
  	
  
                                                                                       DATA	
  PLATFORM	
  (HDP)	
  

         OS	
                   Cloud	
                                  VM	
                                      Appliance	
  


      © Hortonworks Inc. 2012                                                                                                      Page 14
Key 2013 “Enterprise Hadoop” Initiatives

                                                                                         Invest In:
                            Hive / “Stinger”
                                Interactive Query
                                                                                      – Platform Services
   Ambari                                                             HBase               – DR, Snapshot, …
Manage & Operate                                                     Online Data
                           OPERATIONAL	
              DATA	
  
                             SERVICES	
             SERVICES	
  



                                         HADOOP	
  CORE	
  
                                                                                      – Data Services
                                      PLATFORM	
  SERVICES	
                              – In support of Refine,
   “Knox”                             HORTONWORKS	
  	
               “Herd”                Explore, Enrich
 Secure Access
                                   DATA	
  PLATFORM	
  (HDP)	
     Data Integration


                                                                                      – Operational Services
                               “Continuum”                                                – Manageability,
                                   Biz Continuity
                                                                                            Security, …




            © Hortonworks Inc. 2012                                                                       Page 15
Hive/Stinger: Interactive Query
Near-realtime queries in good old Hive…




© Hortonworks Inc. 2012                   Page 16
Top BI Vendors Support Hive Today




   © Hortonworks Inc. 2012          Page 17
Goal: Enhance Hive for BI Use Cases
Enterprise Reports                                                       Parameterized Reports



                                         Dashboard / Scorecard




   Data Mining                                                               Visualization




                                             More SQL
                                                 &
                                        Better Performance

                                Batch                      Interactive

      © Hortonworks Inc. 2012                                                                Page 18
Differing Needs For Scale / Interaction
       Interactivity               Scalability and Reliability
          is key                             are key

                                    Non-
         Interactive                                       Batch
                                 Interactive
   •  Parameterized           •  Data preparation    •  Operational batch
      Reports                 •  Incremental batch      processing
   •  Drilldown                  processing          •  Enterprise
   •  Visualization           •  Dashboards /           Reports
   •  Exploration                Scorecards          •  Data Mining




              5s – 1m              1m – 1h                   1h+

                                   Data Size


    © Hortonworks Inc. 2012                                                 Page 19
Stinger: Make Hive Best for All Needs

                                             Non-
                  Interactive                                       Batch
                                          Interactive
             •  Parameterized          •  Data preparation    •  Operational batch
                Reports                •  Incremental batch      processing
             •  Drilldown                 processing          •  Enterprise
             •  Visualization          •  Dashboards /           Reports
             •  Exploration               Scorecards          •  Data Mining




                       5s – 1m              1m – 1h                   1h+

                                            Data Size

Improve Latency & Throughput                     Extend Deep Analytical Ability
•  Query engine improvements                     •  Analytics functions
•  New “Optimized RCFile” column store           •  Improved SQL coverage
•  Next-gen runtime (elim’s M/R latency)         •  Continued focus on core Hive use cases

             © Hortonworks Inc. 2012                                                    Page 20
Analytic Function Use Cases
• OVER
  – Rankings, top 10, bottom 10
  – Running balances
  – Statistics within time windows (e.g. last 3 months, last 6 months)
• LEAD / LAG
  – Trend identification
  – Sessionization
  – Forecasting / prediction
• Distributions
  – Histograms and bucketing
• Good for Enterprise Reports, Dashboards, Data
  Mining and Business Processing.


     © Hortonworks Inc. 2012                                        Page 21
Stinger 2013 Roadmap Summary
• HDP 1.x (aka Hadoop 1.x …)
  – Additional SQL Types
  – SQL Analytic Functions (OVER, Subqueries in WHERE, etc.)
  – Modern Optimized Column Store (ORC file)
  – Hive Query Enhancements
       – Startup time, star joins, optimize M/R DAGs, vectorization, etc.


• HDP 2.x (aka Hadoop 2.x …)
  – Features in HDP 1.3 & 1.4
  – Next-gen runtime that eliminates startup time
  – Persistent function registry
  – Other features



     © Hortonworks Inc. 2012                                                Page 22
Questions?




© Hortonworks Inc. 2012   Page 23

2013 feb 20_thug_h_catalog

  • 1.
    Toronto Hadoop UserGroup February 20, 2013 Apache HCatalog Overview & Next-gen Hive Adam Muise © Hortonworks Inc. 2012 Page 1
  • 2.
  • 3.
    Apache HCatalog • Incubator Projectat Apache.org • Good adoption • Will likely merge with Hive project as it adds important functionality to metastore • Allows for a “schema-on-read” approach to Big Data in HDFS • Seat of a lot of innovation in Data Management • Used by platform partners to enhance Hadoop integration • Will likely be used to enhance existing Data Management products in the Enterprise & create new products © Hortonworks Inc. 2012 Page 3
  • 4.
    Great Tooling Options MapReduce •  Early adopters •  Non-relational algorithms •  Performance sensitive applications Pig •  ETL •  Data modeling •  Iterative algorithms Hive •  Analysis •  Connectors to BI tools Strength: Right tool for right application Weakness: Hard to share their data © Hortonworks Inc. 2012 Page 4
  • 5.
    HCatalog Changes theGame Apache HCatalog provides flexible metadata services across tools and external access •  Consistency of metadata and data models across the Enterprise (MapReduce, Pig, Hbase, Hive, External Systems) •  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API HCatalog Shared table and schema management •  Raw Hadoop data Table access opens the •  Inconsistent, unknown Aligned metadata platform •  Tool specific access REST API © Hortonworks Inc. 2012 Page 5
  • 6.
    Options == Complexity Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string, int, float, string, bytes, maps, maps, structs, lists tuples, bags Schema Encoded in app Declared in script Read from or read by loader metadata Data location Encoded in app Declared in script Read from metadata Data format Encoded in app Declared in script Read from metadata •  Pig and MR users need to know a lot to write their apps •  When data schema, location, or format change Pig and MR apps must be rewritten, retested, and redeployed •  Hive users have to load data from Pig/MR users to have access to it © Hortonworks 2012 Page 6
  • 7.
    Hcatalog == Simple,Consistent Feature MapReduce + Pig + HCatalog Hive HCatalog Record format Record Tuple Record Data model int, float, string, int, float, string, int, float, string, maps, structs, lists bytes, maps, maps, structs, lists tuples, bags Schema Read from Read from Read from metadata metadata metadata Data location Read from Read from Read from metadata metadata metadata Data format Read from Read from Read from metadata metadata metadata •  Pig/MR users can read schema from metadata •  Pig/MR users are insulated from schema, location, and format changes •  All users have access to other users’ data as soon as it is committed © Hortonworks 2012 Page 7
  • 8.
    Hadoop Ecosystem Hive Pig MapReduce (SQL) (scripting) (Java) Interface: Interface: Interface: SQL SerDe Load/Store DML Input/ Input/ Input/ OutputFormat OutputFormat OutputFormat DDL metastore dn1 dn2 dn3 . . . - tables - partitions - files . . . . . . - types . . . . dnN HDFS © Hortonworks Inc. 2012
  • 9.
    Opening up Metadatato MR & Pig Hive Pig MapReduce (SQL) (scripting) (Java) HCat Metadata layer Interface: Interface: SQL HCatLoad/Store Interface: SerDe HCatInput/OutputFormat metastore dn1 dn2 dn3 . . . - tables - partitions - files . . . . . . - types . . . . dnN HDFS © Hortonworks Inc. 2012
  • 10.
    Pig Example Assume youwant to count how many time each of your users went to each of your URLs raw = load '/data/rawevents/20120530' as (url, user); botless = filter raw by myudfs.NotABot(user); grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); No need to know No need to store cntd into '/data/counted/20120530'; file location declare schema Using HCatalog: raw = load 'rawevents' using HCatLoader(); botless = filter raw by myudfs.NotABot(user) and ds == '20120530'; grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into 'counted' using HCatStorer(); Partition filter © Hortonworks 2012 Page 10
  • 11.
    Working with HCatalogin MapReduce Setting input: table to read from HCatInputFormat.setInput(job, InputJobInfo.create(dbname, tableName, filter)); database to read specify which Setting output: from partitions to read HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbname, tableName, partitionSpec)); specify which Obtaining schema: partition to write schema = HCatInputFormat.getOutputSchema(); access fields by Key is unused, Value is HCatRecord: name String url = value.get("url", schema); output.set("cnt", schema, cnt); © Hortonworks 2012 Page 11
  • 12.
    Managing Metadata •  Ifyou are a Hive user, you can use your Hive metastore with no modifications •  If not, you can use the HCatalog command line tool to issue Hive DDL (Data Definition Language) commands: > /usr/bin/hcat -e ”create table rawevents (url string, user string) partitioned by (ds string);"; •  Starting in Pig 0.11, you will be able to issue DDL commands from Pig © Hortonworks 2012 Page 12
  • 13.
    Templeton - RESTAPI •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: Create new table “rawevents” Hadoop/ HCatalog Describe table “rawevents” © Hortonworks 2012 Page 13
  • 14.
    HCatalog is inHortonworks Data Platform… OPERATIONAL  SERVICES   DATA   SERVICES   AMBARI   FLUME   PIG   HIVE   HBASE   SQOOP   HCATALOG   OOZIE   WEBHDFS   MAP  REDUCE   HADOOP  CORE   HDFS   YARN  (in  2.0)   Enterprise Readiness PLATFORM  SERVICES   High Availability, Disaster Recovery, Snapshots, Security, etc… HORTONWORKS     DATA  PLATFORM  (HDP)   OS   Cloud   VM   Appliance   © Hortonworks Inc. 2012 Page 14
  • 15.
    Key 2013 “EnterpriseHadoop” Initiatives Invest In: Hive / “Stinger” Interactive Query – Platform Services Ambari HBase – DR, Snapshot, … Manage & Operate Online Data OPERATIONAL   DATA   SERVICES   SERVICES   HADOOP  CORE   – Data Services PLATFORM  SERVICES   – In support of Refine, “Knox” HORTONWORKS     “Herd” Explore, Enrich Secure Access DATA  PLATFORM  (HDP)   Data Integration – Operational Services “Continuum” – Manageability, Biz Continuity Security, … © Hortonworks Inc. 2012 Page 15
  • 16.
    Hive/Stinger: Interactive Query Near-realtimequeries in good old Hive… © Hortonworks Inc. 2012 Page 16
  • 17.
    Top BI VendorsSupport Hive Today © Hortonworks Inc. 2012 Page 17
  • 18.
    Goal: Enhance Hivefor BI Use Cases Enterprise Reports Parameterized Reports Dashboard / Scorecard Data Mining Visualization More SQL & Better Performance Batch Interactive © Hortonworks Inc. 2012 Page 18
  • 19.
    Differing Needs ForScale / Interaction Interactivity Scalability and Reliability is key are key Non- Interactive Batch Interactive •  Parameterized •  Data preparation •  Operational batch Reports •  Incremental batch processing •  Drilldown processing •  Enterprise •  Visualization •  Dashboards / Reports •  Exploration Scorecards •  Data Mining 5s – 1m 1m – 1h 1h+ Data Size © Hortonworks Inc. 2012 Page 19
  • 20.
    Stinger: Make HiveBest for All Needs Non- Interactive Batch Interactive •  Parameterized •  Data preparation •  Operational batch Reports •  Incremental batch processing •  Drilldown processing •  Enterprise •  Visualization •  Dashboards / Reports •  Exploration Scorecards •  Data Mining 5s – 1m 1m – 1h 1h+ Data Size Improve Latency & Throughput Extend Deep Analytical Ability •  Query engine improvements •  Analytics functions •  New “Optimized RCFile” column store •  Improved SQL coverage •  Next-gen runtime (elim’s M/R latency) •  Continued focus on core Hive use cases © Hortonworks Inc. 2012 Page 20
  • 21.
    Analytic Function UseCases • OVER – Rankings, top 10, bottom 10 – Running balances – Statistics within time windows (e.g. last 3 months, last 6 months) • LEAD / LAG – Trend identification – Sessionization – Forecasting / prediction • Distributions – Histograms and bucketing • Good for Enterprise Reports, Dashboards, Data Mining and Business Processing. © Hortonworks Inc. 2012 Page 21
  • 22.
    Stinger 2013 RoadmapSummary • HDP 1.x (aka Hadoop 1.x …) – Additional SQL Types – SQL Analytic Functions (OVER, Subqueries in WHERE, etc.) – Modern Optimized Column Store (ORC file) – Hive Query Enhancements – Startup time, star joins, optimize M/R DAGs, vectorization, etc. • HDP 2.x (aka Hadoop 2.x …) – Features in HDP 1.3 & 1.4 – Next-gen runtime that eliminates startup time – Persistent function registry – Other features © Hortonworks Inc. 2012 Page 22
  • 23.