SlideShare a Scribd company logo
Toronto Hadoop User Group
February 20, 2013

Apache HCatalog Overview
& Next-gen Hive
Adam Muise




© Hortonworks Inc. 2012     Page 1
HCatalog




© Hortonworks Inc. 2012   Page 2
Apache HCatalog
• Incubator Project at Apache.org
• Good adoption
• Will likely merge with Hive project as it adds
  important functionality to metastore
• Allows for a “schema-on-read” approach to Big Data
  in HDFS
• Seat of a lot of innovation in Data Management
• Used by platform partners to enhance Hadoop
  integration
• Will likely be used to enhance existing Data
  Management products in the Enterprise & create new
  products

     © Hortonworks Inc. 2012                       Page 3
Great Tooling Options
                             MapReduce
                             •  Early adopters
                             •  Non-relational algorithms
                             •  Performance sensitive applications

                             Pig
                             •  ETL
                             •  Data modeling
                             •  Iterative algorithms

                             Hive
                             •  Analysis
                             •  Connectors to BI tools


              Strength: Right tool for right application
               Weakness: Hard to share their data
   © Hortonworks Inc. 2012                                           Page 4
HCatalog Changes the Game
Apache HCatalog provides flexible metadata
services across tools and external access
 •  Consistency of metadata and data models across the
    Enterprise (MapReduce, Pig, Hbase, Hive, External Systems)
 •  Accessibility: share data as tables in and out of HDFS
 •  Availability: enables flexible, thin-client access via REST API




                                  HCatalog                        Shared table
                                                                  and schema
                                                                  management
   •  Raw Hadoop data                        Table access         opens the
   •  Inconsistent, unknown                  Aligned metadata     platform
   •  Tool specific access                   REST API



        © Hortonworks Inc. 2012                                             Page 5
Options == Complexity
 Feature                        MapReduce         Pig                   Hive
 Record format                  Key value pairs   Tuple                 Record
 Data model                     User defined      int, float, string,   int, float, string,
                                                  bytes, maps,          maps, structs, lists
                                                  tuples, bags
 Schema                         Encoded in app    Declared in script    Read from
                                                  or read by loader     metadata
 Data location                  Encoded in app    Declared in script    Read from
                                                                        metadata
 Data format                    Encoded in app    Declared in script    Read from
                                                                        metadata


•    Pig and MR users need to know a lot to write their apps
•    When data schema, location, or format change Pig and MR apps must be
     rewritten, retested, and redeployed
•    Hive users have to load data from Pig/MR users to have access to it
           © Hortonworks 2012
                                                                                       Page 6
Hcatalog == Simple, Consistent
 Feature                         MapReduce +            Pig + HCatalog        Hive
                                 HCatalog
 Record format                   Record                 Tuple                 Record
 Data model                      int, float, string,    int, float, string,   int, float, string,
                                 maps, structs, lists   bytes, maps,          maps, structs, lists
                                                        tuples, bags
 Schema                          Read from              Read from             Read from
                                 metadata               metadata              metadata
 Data location                   Read from              Read from             Read from
                                 metadata               metadata              metadata
 Data format                     Read from              Read from             Read from
                                 metadata               metadata              metadata

•    Pig/MR users can read schema from metadata
•    Pig/MR users are insulated from schema, location, and format changes
•    All users have access to other users’ data as soon as it is committed
            © Hortonworks 2012
                                                                                            Page 7
Hadoop Ecosystem

                                   Hive                       Pig                 MapReduce
                                  (SQL)                   (scripting)               (Java)




                    Interface:             Interface:      Interface:
                       SQL                   SerDe        Load/Store

                                           DML

                                             Input/         Input/                   Input/
                                          OutputFormat   OutputFormat             OutputFormat
                   DDL




                      metastore                          dn1   dn2      dn3   .       .    .
                 - tables
                 - partitions
                 - files                                   .     .        .    .       .    .
                 - types

                                                          .              .    .       .   dnN


                                                                         HDFS



   © Hortonworks Inc. 2012
Opening up Metadata to MR & Pig

                                   Hive             Pig                 MapReduce
                                  (SQL)         (scripting)               (Java)




                                                       HCat Metadata layer
                                Interface:      Interface:
                                   SQL        HCatLoad/Store


                                                      Interface:
                                                        SerDe

                                                     HCatInput/OutputFormat




                                  metastore    dn1   dn2      dn3   .     .    .
                             - tables
                             - partitions
                             - files             .      .       .    .     .    .
                             - types

                                                .              .    .     .   dnN


                                                               HDFS



   © Hortonworks Inc. 2012
Pig Example
Assume you want to count how many time each of your users went to each of your URLs


raw     = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd    = group botless by (url, user);
cntd    = foreach grpd generate flatten(url, user), COUNT(botless);
               No need to know            No need to
store cntd into '/data/counted/20120530';
                                                    file location                     declare schema
Using HCatalog:


raw     = load 'rawevents' using HCatLoader();
botless = filter raw by myudfs.NotABot(user) and ds == '20120530';
grpd    = group botless by (url, user);
cntd    = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into 'counted' using HCatStorer();
                                                                                                       Partition filter




                       © Hortonworks 2012
                                                                                                                 Page 10
Working with HCatalog in MapReduce
Setting input:             table to read from
HCatInputFormat.setInput(job,
  InputJobInfo.create(dbname, tableName, filter));

                    database to read      specify which
Setting output:
                         from            partitions to read
HCatOutputFormat.setOutput(job,
  OutputJobInfo.create(dbname, tableName, partitionSpec));

                                                 specify which
Obtaining schema:                               partition to write
schema = HCatInputFormat.getOutputSchema();
                                    access fields by
Key is unused, Value is HCatRecord:     name
String url = value.get("url", schema);
output.set("cnt", schema, cnt);



       © Hortonworks 2012
                                                                     Page 11
Managing Metadata
•  If you are a Hive user, you can use your Hive metastore with no modifications

•  If not, you can use the HCatalog command line tool to issue Hive DDL (Data
   Definition Language) commands:
> /usr/bin/hcat -e ”create table rawevents (url string,
user string) partitioned by (ds string);";

•  Starting in Pig 0.11, you will be able to issue DDL commands from Pig




        © Hortonworks 2012
                                                                           Page 12
Templeton - REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop




                                Get a list of all tables in the default database:

                                        Create new table “rawevents”
                                                                                    Hadoop/
                                                                                    HCatalog
                                       Describe table “rawevents”




           © Hortonworks 2012
                                                                                               Page 13
HCatalog is in Hortonworks Data Platform…

  OPERATIONAL	
  SERVICES	
                                                DATA	
  
                                                                         SERVICES	
  

                  AMBARI	
                  FLUME	
                  PIG	
                HIVE	
  
                                                                                                                    HBASE	
  
                                            SQOOP	
                            HCATALOG	
  
                   OOZIE	
  


                                                   WEBHDFS	
                                  MAP	
  REDUCE	
  
   HADOOP	
  CORE	
  
                                                         HDFS	
                               YARN	
  (in	
  2.0)	
  


                                                        Enterprise Readiness
   PLATFORM	
  SERVICES	
                               High Availability, Disaster Recovery, Snapshots, Security, etc…




                                                                                                HORTONWORKS	
  	
  
                                                                                       DATA	
  PLATFORM	
  (HDP)	
  

         OS	
                   Cloud	
                                  VM	
                                      Appliance	
  


      © Hortonworks Inc. 2012                                                                                                      Page 14
Key 2013 “Enterprise Hadoop” Initiatives

                                                                                         Invest In:
                            Hive / “Stinger”
                                Interactive Query
                                                                                      – Platform Services
   Ambari                                                             HBase               – DR, Snapshot, …
Manage & Operate                                                     Online Data
                           OPERATIONAL	
              DATA	
  
                             SERVICES	
             SERVICES	
  



                                         HADOOP	
  CORE	
  
                                                                                      – Data Services
                                      PLATFORM	
  SERVICES	
                              – In support of Refine,
   “Knox”                             HORTONWORKS	
  	
               “Herd”                Explore, Enrich
 Secure Access
                                   DATA	
  PLATFORM	
  (HDP)	
     Data Integration


                                                                                      – Operational Services
                               “Continuum”                                                – Manageability,
                                   Biz Continuity
                                                                                            Security, …




            © Hortonworks Inc. 2012                                                                       Page 15
Hive/Stinger: Interactive Query
Near-realtime queries in good old Hive…




© Hortonworks Inc. 2012                   Page 16
Top BI Vendors Support Hive Today




   © Hortonworks Inc. 2012          Page 17
Goal: Enhance Hive for BI Use Cases
Enterprise Reports                                                       Parameterized Reports



                                         Dashboard / Scorecard




   Data Mining                                                               Visualization




                                             More SQL
                                                 &
                                        Better Performance

                                Batch                      Interactive

      © Hortonworks Inc. 2012                                                                Page 18
Differing Needs For Scale / Interaction
       Interactivity               Scalability and Reliability
          is key                             are key

                                    Non-
         Interactive                                       Batch
                                 Interactive
   •  Parameterized           •  Data preparation    •  Operational batch
      Reports                 •  Incremental batch      processing
   •  Drilldown                  processing          •  Enterprise
   •  Visualization           •  Dashboards /           Reports
   •  Exploration                Scorecards          •  Data Mining




              5s – 1m              1m – 1h                   1h+

                                   Data Size


    © Hortonworks Inc. 2012                                                 Page 19
Stinger: Make Hive Best for All Needs

                                             Non-
                  Interactive                                       Batch
                                          Interactive
             •  Parameterized          •  Data preparation    •  Operational batch
                Reports                •  Incremental batch      processing
             •  Drilldown                 processing          •  Enterprise
             •  Visualization          •  Dashboards /           Reports
             •  Exploration               Scorecards          •  Data Mining




                       5s – 1m              1m – 1h                   1h+

                                            Data Size

Improve Latency & Throughput                     Extend Deep Analytical Ability
•  Query engine improvements                     •  Analytics functions
•  New “Optimized RCFile” column store           •  Improved SQL coverage
•  Next-gen runtime (elim’s M/R latency)         •  Continued focus on core Hive use cases

             © Hortonworks Inc. 2012                                                    Page 20
Analytic Function Use Cases
• OVER
  – Rankings, top 10, bottom 10
  – Running balances
  – Statistics within time windows (e.g. last 3 months, last 6 months)
• LEAD / LAG
  – Trend identification
  – Sessionization
  – Forecasting / prediction
• Distributions
  – Histograms and bucketing
• Good for Enterprise Reports, Dashboards, Data
  Mining and Business Processing.


     © Hortonworks Inc. 2012                                        Page 21
Stinger 2013 Roadmap Summary
• HDP 1.x (aka Hadoop 1.x …)
  – Additional SQL Types
  – SQL Analytic Functions (OVER, Subqueries in WHERE, etc.)
  – Modern Optimized Column Store (ORC file)
  – Hive Query Enhancements
       – Startup time, star joins, optimize M/R DAGs, vectorization, etc.


• HDP 2.x (aka Hadoop 2.x …)
  – Features in HDP 1.3 & 1.4
  – Next-gen runtime that eliminates startup time
  – Persistent function registry
  – Other features



     © Hortonworks Inc. 2012                                                Page 22
Questions?




© Hortonworks Inc. 2012   Page 23

More Related Content

What's hot

2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
Adam Muise
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
Edureka!
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
Anil Reddy
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
rightsize
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ran Ziv
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Continuent
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Stanley Wang
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
David Kaiser
 
Hadoop
HadoopHadoop
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
Hadoop
Hadoop Hadoop
Hadoop
ABHIJEET RAJ
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 

What's hot (20)

2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop
Hadoop Hadoop
Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 

Viewers also liked

Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
J On The Beach
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
Adam Muise
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
Adam Muise
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
Albert Bifet
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
Adam Muise
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
Adam Muise
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
Adam Muise
 

Viewers also liked (9)

Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Good touch bad touch ppt
Good touch bad touch pptGood touch bad touch ppt
Good touch bad touch ppt
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Good touch bad touch ppt
Good touch bad touch pptGood touch bad touch ppt
Good touch bad touch ppt
 

Similar to 2013 feb 20_thug_h_catalog

HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
H cat berlinbuzzwords2012
H cat berlinbuzzwords2012H cat berlinbuzzwords2012
H cat berlinbuzzwords2012
Hortonworks
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Richard McDougall
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Big Data Spain
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
 
Nov 2011 HUG: HParser
Nov 2011 HUG: HParserNov 2011 HUG: HParser
Nov 2011 HUG: HParser
Yahoo Developer Network
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
lucenerevolution
 
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Cloudera, Inc.
 
SQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsSQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsDataWorks Summit
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
Richard McDougall
 
Jan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalogJan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalog
Yahoo Developer Network
 
PhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond BatchPhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batch
boorad
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013
alanfgates
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Hortonworks
 
Enterprise linked data clouds
Enterprise linked data cloudsEnterprise linked data clouds
Enterprise linked data clouds
damienjoyce
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
DataWorks Summit
 

Similar to 2013 feb 20_thug_h_catalog (20)

HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
H cat berlinbuzzwords2012
H cat berlinbuzzwords2012H cat berlinbuzzwords2012
H cat berlinbuzzwords2012
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Nov 2011 HUG: HParser
Nov 2011 HUG: HParserNov 2011 HUG: HParser
Nov 2011 HUG: HParser
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
 
SQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsSQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analytics
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Jan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalogJan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalog
 
PhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond BatchPhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batch
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
 
Enterprise linked data clouds
Enterprise linked data cloudsEnterprise linked data clouds
Enterprise linked data clouds
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
 

More from Adam Muise

Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
Adam Muise
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
Adam Muise
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
Adam Muise
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
Adam Muise
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
Adam Muise
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
Adam Muise
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
Adam Muise
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
Adam Muise
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Adam Muise
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
Adam Muise
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
Adam Muise
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
Adam Muise
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012Adam Muise
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
Adam Muise
 
hadoop 101 aug 21 2012 tohug
 hadoop 101 aug 21 2012 tohug hadoop 101 aug 21 2012 tohug
hadoop 101 aug 21 2012 tohug
Adam Muise
 

More from Adam Muise (16)

Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
hadoop 101 aug 21 2012 tohug
 hadoop 101 aug 21 2012 tohug hadoop 101 aug 21 2012 tohug
hadoop 101 aug 21 2012 tohug
 

Recently uploaded

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 

Recently uploaded (20)

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 

2013 feb 20_thug_h_catalog

  • 1. Toronto Hadoop User Group February 20, 2013 Apache HCatalog Overview & Next-gen Hive Adam Muise © Hortonworks Inc. 2012 Page 1
  • 3. Apache HCatalog • Incubator Project at Apache.org • Good adoption • Will likely merge with Hive project as it adds important functionality to metastore • Allows for a “schema-on-read” approach to Big Data in HDFS • Seat of a lot of innovation in Data Management • Used by platform partners to enhance Hadoop integration • Will likely be used to enhance existing Data Management products in the Enterprise & create new products © Hortonworks Inc. 2012 Page 3
  • 4. Great Tooling Options MapReduce •  Early adopters •  Non-relational algorithms •  Performance sensitive applications Pig •  ETL •  Data modeling •  Iterative algorithms Hive •  Analysis •  Connectors to BI tools Strength: Right tool for right application Weakness: Hard to share their data © Hortonworks Inc. 2012 Page 4
  • 5. HCatalog Changes the Game Apache HCatalog provides flexible metadata services across tools and external access •  Consistency of metadata and data models across the Enterprise (MapReduce, Pig, Hbase, Hive, External Systems) •  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API HCatalog Shared table and schema management •  Raw Hadoop data Table access opens the •  Inconsistent, unknown Aligned metadata platform •  Tool specific access REST API © Hortonworks Inc. 2012 Page 5
  • 6. Options == Complexity Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string, int, float, string, bytes, maps, maps, structs, lists tuples, bags Schema Encoded in app Declared in script Read from or read by loader metadata Data location Encoded in app Declared in script Read from metadata Data format Encoded in app Declared in script Read from metadata •  Pig and MR users need to know a lot to write their apps •  When data schema, location, or format change Pig and MR apps must be rewritten, retested, and redeployed •  Hive users have to load data from Pig/MR users to have access to it © Hortonworks 2012 Page 6
  • 7. Hcatalog == Simple, Consistent Feature MapReduce + Pig + HCatalog Hive HCatalog Record format Record Tuple Record Data model int, float, string, int, float, string, int, float, string, maps, structs, lists bytes, maps, maps, structs, lists tuples, bags Schema Read from Read from Read from metadata metadata metadata Data location Read from Read from Read from metadata metadata metadata Data format Read from Read from Read from metadata metadata metadata •  Pig/MR users can read schema from metadata •  Pig/MR users are insulated from schema, location, and format changes •  All users have access to other users’ data as soon as it is committed © Hortonworks 2012 Page 7
  • 8. Hadoop Ecosystem Hive Pig MapReduce (SQL) (scripting) (Java) Interface: Interface: Interface: SQL SerDe Load/Store DML Input/ Input/ Input/ OutputFormat OutputFormat OutputFormat DDL metastore dn1 dn2 dn3 . . . - tables - partitions - files . . . . . . - types . . . . dnN HDFS © Hortonworks Inc. 2012
  • 9. Opening up Metadata to MR & Pig Hive Pig MapReduce (SQL) (scripting) (Java) HCat Metadata layer Interface: Interface: SQL HCatLoad/Store Interface: SerDe HCatInput/OutputFormat metastore dn1 dn2 dn3 . . . - tables - partitions - files . . . . . . - types . . . . dnN HDFS © Hortonworks Inc. 2012
  • 10. Pig Example Assume you want to count how many time each of your users went to each of your URLs raw = load '/data/rawevents/20120530' as (url, user); botless = filter raw by myudfs.NotABot(user); grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); No need to know No need to store cntd into '/data/counted/20120530'; file location declare schema Using HCatalog: raw = load 'rawevents' using HCatLoader(); botless = filter raw by myudfs.NotABot(user) and ds == '20120530'; grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into 'counted' using HCatStorer(); Partition filter © Hortonworks 2012 Page 10
  • 11. Working with HCatalog in MapReduce Setting input: table to read from HCatInputFormat.setInput(job, InputJobInfo.create(dbname, tableName, filter)); database to read specify which Setting output: from partitions to read HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbname, tableName, partitionSpec)); specify which Obtaining schema: partition to write schema = HCatInputFormat.getOutputSchema(); access fields by Key is unused, Value is HCatRecord: name String url = value.get("url", schema); output.set("cnt", schema, cnt); © Hortonworks 2012 Page 11
  • 12. Managing Metadata •  If you are a Hive user, you can use your Hive metastore with no modifications •  If not, you can use the HCatalog command line tool to issue Hive DDL (Data Definition Language) commands: > /usr/bin/hcat -e ”create table rawevents (url string, user string) partitioned by (ds string);"; •  Starting in Pig 0.11, you will be able to issue DDL commands from Pig © Hortonworks 2012 Page 12
  • 13. Templeton - REST API •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: Create new table “rawevents” Hadoop/ HCatalog Describe table “rawevents” © Hortonworks 2012 Page 13
  • 14. HCatalog is in Hortonworks Data Platform… OPERATIONAL  SERVICES   DATA   SERVICES   AMBARI   FLUME   PIG   HIVE   HBASE   SQOOP   HCATALOG   OOZIE   WEBHDFS   MAP  REDUCE   HADOOP  CORE   HDFS   YARN  (in  2.0)   Enterprise Readiness PLATFORM  SERVICES   High Availability, Disaster Recovery, Snapshots, Security, etc… HORTONWORKS     DATA  PLATFORM  (HDP)   OS   Cloud   VM   Appliance   © Hortonworks Inc. 2012 Page 14
  • 15. Key 2013 “Enterprise Hadoop” Initiatives Invest In: Hive / “Stinger” Interactive Query – Platform Services Ambari HBase – DR, Snapshot, … Manage & Operate Online Data OPERATIONAL   DATA   SERVICES   SERVICES   HADOOP  CORE   – Data Services PLATFORM  SERVICES   – In support of Refine, “Knox” HORTONWORKS     “Herd” Explore, Enrich Secure Access DATA  PLATFORM  (HDP)   Data Integration – Operational Services “Continuum” – Manageability, Biz Continuity Security, … © Hortonworks Inc. 2012 Page 15
  • 16. Hive/Stinger: Interactive Query Near-realtime queries in good old Hive… © Hortonworks Inc. 2012 Page 16
  • 17. Top BI Vendors Support Hive Today © Hortonworks Inc. 2012 Page 17
  • 18. Goal: Enhance Hive for BI Use Cases Enterprise Reports Parameterized Reports Dashboard / Scorecard Data Mining Visualization More SQL & Better Performance Batch Interactive © Hortonworks Inc. 2012 Page 18
  • 19. Differing Needs For Scale / Interaction Interactivity Scalability and Reliability is key are key Non- Interactive Batch Interactive •  Parameterized •  Data preparation •  Operational batch Reports •  Incremental batch processing •  Drilldown processing •  Enterprise •  Visualization •  Dashboards / Reports •  Exploration Scorecards •  Data Mining 5s – 1m 1m – 1h 1h+ Data Size © Hortonworks Inc. 2012 Page 19
  • 20. Stinger: Make Hive Best for All Needs Non- Interactive Batch Interactive •  Parameterized •  Data preparation •  Operational batch Reports •  Incremental batch processing •  Drilldown processing •  Enterprise •  Visualization •  Dashboards / Reports •  Exploration Scorecards •  Data Mining 5s – 1m 1m – 1h 1h+ Data Size Improve Latency & Throughput Extend Deep Analytical Ability •  Query engine improvements •  Analytics functions •  New “Optimized RCFile” column store •  Improved SQL coverage •  Next-gen runtime (elim’s M/R latency) •  Continued focus on core Hive use cases © Hortonworks Inc. 2012 Page 20
  • 21. Analytic Function Use Cases • OVER – Rankings, top 10, bottom 10 – Running balances – Statistics within time windows (e.g. last 3 months, last 6 months) • LEAD / LAG – Trend identification – Sessionization – Forecasting / prediction • Distributions – Histograms and bucketing • Good for Enterprise Reports, Dashboards, Data Mining and Business Processing. © Hortonworks Inc. 2012 Page 21
  • 22. Stinger 2013 Roadmap Summary • HDP 1.x (aka Hadoop 1.x …) – Additional SQL Types – SQL Analytic Functions (OVER, Subqueries in WHERE, etc.) – Modern Optimized Column Store (ORC file) – Hive Query Enhancements – Startup time, star joins, optimize M/R DAGs, vectorization, etc. • HDP 2.x (aka Hadoop 2.x …) – Features in HDP 1.3 & 1.4 – Next-gen runtime that eliminates startup time – Persistent function registry – Other features © Hortonworks Inc. 2012 Page 22