HCatalog                     (and friends)Sushanth SowmyanCommitter, Apache HCatalogsush@hortonworks.com@khorgath© Hortonw...
Lets think about data for a bit...    From Wikipedia:    Data ( /ˈdeɪtəә/ day-təә, /ˈdætəә/ da-təә, or /ˈdɑːtəә/ dah-təә) ...
So what is needed to make Data useful?             l    Arguably, tools to convert data into information.               A...
So what are the characteristics of a Data Warehouse?           l    Data is present, organized, recorded, and catalogued....
Finding it                            Photo credit :                            © dkeats on flickr  © Hortonworks Inc. 2011
Finding it   l    Knowing where data is.   l    Evolve : Knowing which data is where – “naming” data,    Evolve : Organi...
Reading itPhoto credit :          Architecting the Future of Big Data© kylesteed on flickr   © Hortonworks Inc. 2011      ...
Reading it    Each tool having its own “storage space”, its own private   l    world     Evolve : Abstracting away storag...
Who are the various actors in a data ecosystem? l    Analyst – uses sql (hive) and/or jdbc-based tools l    Programmer –...
(stealing slide from Alans TriHUG talk) Architecting the Future of Big Data © Hortonworks Inc. 2011
Also :   People who help aforementioned people:    Tool Writer - wants abstractions to deal with variances,   l    wants ...
What do they all want?   l    Need it Working – Correctness   l    Speed, Efficiency   l    Interoperability, Convenien...
Did somebody say Interoperability?  © Hortonworks Inc. 2011
Making Your Structured DataAvailable to the MapReduce Engine                               MapReduce           Pig   Hive ...
Hcatalog underlying architecture         HCatLoader                          HCatStorer   HCatInputFormat                 ...
Problem: Need to Know Where Data Is                                         PIG                                           ...
Solution: Register Through HCatalog                                         PIG                                           ...
Problem: Data in variety of formats  •  Data files maybe organized in different formats  •  Data files may contain differe...
Solution: HCat provides commonabstraction                                                         Hadoop Application•  Reg...
Getting InvolvedIncubator site : http://incubator.apache.org/hcatalogUser list: hcatalog-user@incubator.apache.orgDev list...
TODO                 HCATALOG-8 : HCatalog needs a logo                 HBase integration, trying to nail down a better ta...
Waitaminnit... what was that                                                 about “friends” ?Architecting the Future of B...
Templeton        A Webservices API        for HadoopPhoto credit :       Architecting the Future of Big Data© PKMousie on ...
Templeton: ISV Front-door for Hadoop •  Insulation from interface changes release to release •  Opens the door to language...
Templeton Specific SupportMove data directly into/out-of HDFS through WebHDFSWebservice calls to HCatalog   –    Register ...
ANY QUESTIONS ?Architecting the Future of Big Data© Hortonworks Inc. 2011
Upcoming SlideShare
Loading in...5
×

Jan 2012 HUG: HCatalog

3,801

Published on

HCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. A common buzzword in the NoSQL world today is that of polyglot persistence. Basically, what that comes down to is that you pick the right tool for the job. In the hadoop ecosystem, you have many tools that might be used for data processing - you might use pig or hive, or your own custom mapreduce program, or that shiny new GUI-based tool that's just come out. And which one to use might depend on the user, or on the type of query you're interested in, or the type of job we want to run. From another perspective, you might want to store your data in columnar storage for efficient storage and retrieval for particular query types, or in text so that users can write data producers in scripting languages like perl or python, or you may want to hook up that hbase table as a data source. As a end-user, I want to use whatever data processing tool is available to me. As a data designer, I want to optimize how data is stored. As a cluster manager / data architect, I want the ability to share pieces of information across the board, and move data back and forth fluidly. HCatalog's hopes and promises are the realization of all of the above.

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,801
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

Transcript of "Jan 2012 HUG: HCatalog"

  1. 1. HCatalog (and friends)Sushanth SowmyanCommitter, Apache HCatalogsush@hortonworks.com@khorgath© Hortonworks Inc. 2011 Page 1
  2. 2. Lets think about data for a bit... From Wikipedia: Data ( /ˈdeɪtəә/ day-təә, /ˈdætəә/ da-təә, or /ˈdɑːtəә/ dah-təә) Qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived. Raw data, i.e., unprocessed data, refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols. Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  3. 3. So what is needed to make Data useful? l  Arguably, tools to convert data into information. Arguably also, knowledge about the data, so that the l  tools can then make use of the data in a meaningful sense, to extract information from it. Architecting the Future of Big Data © Hortonworks Inc. 2011
  4. 4. So what are the characteristics of a Data Warehouse? l  Data is present, organized, recorded, and catalogued. l  Tools exist that are able to operate on the data. So what do tools need to be able to operate on data?Architecting the Future of Big Data© Hortonworks Inc. 2011
  5. 5. Finding it Photo credit : © dkeats on flickr © Hortonworks Inc. 2011
  6. 6. Finding it l  Knowing where data is. l  Evolve : Knowing which data is where – “naming” data, Evolve : Organization to support various data modeling l  concepts (table, partitions, columns, records) l  Evolve : “done” semantics, existence semantics Architecting the Future of Big Data © Hortonworks Inc. 2011
  7. 7. Reading itPhoto credit : Architecting the Future of Big Data© kylesteed on flickr © Hortonworks Inc. 2011 Page 7
  8. 8. Reading it Each tool having its own “storage space”, its own private l  world Evolve : Abstracting away storage mechanism and having l  tools sit on top of file formats and mechanisms, so now, suddenly, tools have interoperability. Evolve : Having a storage abstraction that adapts to existing l  storage mechanisms in an easy to develop manner Architecting the Future of Big Data © Hortonworks Inc. 2011
  9. 9. Who are the various actors in a data ecosystem? l  Analyst – uses sql (hive) and/or jdbc-based tools l  Programmer – cares about data transformation - uses Pig or M/R Project owner - cares about amount of resources used, data l  portability, data connectors Ops - needs to manage data storage, cluster management, need l  to control data expiry, replication, import and export. Architecting the Future of Big Data © Hortonworks Inc. 2011
  10. 10. (stealing slide from Alans TriHUG talk) Architecting the Future of Big Data © Hortonworks Inc. 2011
  11. 11. Also : People who help aforementioned people: Tool Writer - wants abstractions to deal with variances, l  wants to be able to store and retrieve relevant metadata and data, so they can focus on their user Storage subsystem writer - wants standardization so that l  they can be used by other actors. Architecting the Future of Big Data © Hortonworks Inc. 2011
  12. 12. What do they all want? l  Need it Working – Correctness l  Speed, Efficiency l  Interoperability, Convenience. Architecting the Future of Big Data © Hortonworks Inc. 2011
  13. 13. Did somebody say Interoperability? © Hortonworks Inc. 2011
  14. 14. Making Your Structured DataAvailable to the MapReduce Engine MapReduce Pig Hive HCatalog MPP HDFS HBase Store•  Users can query data with Pig, Hive, or custom MapReduce jobs•  Standard HDFS formats available Q1 2012•  HBase data by early Q2 2012 Architecting the Future of Big Data 14 © Hortonworks Inc. 2011
  15. 15. Hcatalog underlying architecture HCatLoader HCatStorer HCatInputFormat HCatOutputFormat CLI Notification Hive MetaStore Client Generated Thrift Client Hive MetaStore RDBMS Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  16. 16. Problem: Need to Know Where Data Is PIG HIVE B hdfs:///use ouse/queen MapReduce t1 r/wilber/pro l/datase hive/wareh outh/fou jectA hdfs:///user/ ser/mam hdfs:///u Storage Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  17. 17. Solution: Register Through HCatalog PIG HIVE Proje MapReduce ctA Foul1 HCatalog Storage Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  18. 18. Problem: Data in variety of formats •  Data files maybe organized in different formats •  Data files may contain different formats in different partitions Storage (HDFS, HBASE , etc) Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  19. 19. Solution: HCat provides commonabstraction Hadoop Application•  Registered Data w/ Schema•  HCat normalizes data to application HCatalog Storage Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  20. 20. Getting InvolvedIncubator site : http://incubator.apache.org/hcatalogUser list: hcatalog-user@incubator.apache.orgDev list: hcatalog-dev@incubator.apache.org Architecting the Future of Big Data © Hortonworks Inc. 2011
  21. 21. TODO HCATALOG-8 : HCatalog needs a logo HBase integration, trying to nail down a better table metaphor Hive integration – interoperability between the notion of StorageDriver and StorageHandler, project dependency management HCATALOG-182 : Improve the “and friends” bit.Architecting the Future of Big Data© Hortonworks Inc. 2011
  22. 22. Waitaminnit... what was that about “friends” ?Architecting the Future of Big Data© Hortonworks Inc. 2011
  23. 23. Templeton A Webservices API for HadoopPhoto credit : Architecting the Future of Big Data© PKMousie on flickr © Hortonworks Inc. 2011
  24. 24. Templeton: ISV Front-door for Hadoop •  Insulation from interface changes release to release •  Opens the door to languages other than Java •  Thin clients through webservices vs forced fat-clients in gateway •  Still prototyping! But see a common need. Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  25. 25. Templeton Specific SupportMove data directly into/out-of HDFS through WebHDFSWebservice calls to HCatalog –  Register table relationships for data (e.g., createTable, createDatabase) –  Adjust tables (e.g., AlterTable) –  Look at a statistics (e.g., ShowTable)Webservice calls to start work –  MapReduce, Pig, Hive –  Poll for job status –  Notification URL when job completes (optional)Stateless Server –  Horizontally scale for load –  Configurable for HA –  Currently Requires ZooKeeper to track job status info Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  26. 26. ANY QUESTIONS ?Architecting the Future of Big Data© Hortonworks Inc. 2011

×