• Save
Jan 2012 HUG: HCatalog
Upcoming SlideShare
Loading in...5
×
 

Jan 2012 HUG: HCatalog

on

  • 4,145 views

HCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. A common buzzword in the NoSQL world today is that of ...

HCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. A common buzzword in the NoSQL world today is that of polyglot persistence. Basically, what that comes down to is that you pick the right tool for the job. In the hadoop ecosystem, you have many tools that might be used for data processing - you might use pig or hive, or your own custom mapreduce program, or that shiny new GUI-based tool that's just come out. And which one to use might depend on the user, or on the type of query you're interested in, or the type of job we want to run. From another perspective, you might want to store your data in columnar storage for efficient storage and retrieval for particular query types, or in text so that users can write data producers in scripting languages like perl or python, or you may want to hook up that hbase table as a data source. As a end-user, I want to use whatever data processing tool is available to me. As a data designer, I want to optimize how data is stored. As a cluster manager / data architect, I want the ability to share pieces of information across the board, and move data back and forth fluidly. HCatalog's hopes and promises are the realization of all of the above.

Statistics

Views

Total Views
4,145
Views on SlideShare
4,145
Embed Views
0

Actions

Likes
12
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Jan 2012 HUG: HCatalog Jan 2012 HUG: HCatalog Presentation Transcript

  • HCatalog (and friends)Sushanth SowmyanCommitter, Apache HCatalogsush@hortonworks.com@khorgath© Hortonworks Inc. 2011 Page 1
  • Lets think about data for a bit... From Wikipedia: Data ( /ˈdeɪtəә/ day-təә, /ˈdætəә/ da-təә, or /ˈdɑːtəә/ dah-təә) Qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived. Raw data, i.e., unprocessed data, refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols. Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • So what is needed to make Data useful? l  Arguably, tools to convert data into information. Arguably also, knowledge about the data, so that the l  tools can then make use of the data in a meaningful sense, to extract information from it. Architecting the Future of Big Data © Hortonworks Inc. 2011
  • So what are the characteristics of a Data Warehouse? l  Data is present, organized, recorded, and catalogued. l  Tools exist that are able to operate on the data. So what do tools need to be able to operate on data?Architecting the Future of Big Data© Hortonworks Inc. 2011
  • Finding it Photo credit : © dkeats on flickr © Hortonworks Inc. 2011
  • Finding it l  Knowing where data is. l  Evolve : Knowing which data is where – “naming” data, Evolve : Organization to support various data modeling l  concepts (table, partitions, columns, records) l  Evolve : “done” semantics, existence semantics Architecting the Future of Big Data © Hortonworks Inc. 2011
  • Reading itPhoto credit : Architecting the Future of Big Data© kylesteed on flickr © Hortonworks Inc. 2011 Page 7
  • Reading it Each tool having its own “storage space”, its own private l  world Evolve : Abstracting away storage mechanism and having l  tools sit on top of file formats and mechanisms, so now, suddenly, tools have interoperability. Evolve : Having a storage abstraction that adapts to existing l  storage mechanisms in an easy to develop manner Architecting the Future of Big Data © Hortonworks Inc. 2011
  • Who are the various actors in a data ecosystem? l  Analyst – uses sql (hive) and/or jdbc-based tools l  Programmer – cares about data transformation - uses Pig or M/R Project owner - cares about amount of resources used, data l  portability, data connectors Ops - needs to manage data storage, cluster management, need l  to control data expiry, replication, import and export. Architecting the Future of Big Data © Hortonworks Inc. 2011
  • (stealing slide from Alans TriHUG talk) Architecting the Future of Big Data © Hortonworks Inc. 2011
  • Also : People who help aforementioned people: Tool Writer - wants abstractions to deal with variances, l  wants to be able to store and retrieve relevant metadata and data, so they can focus on their user Storage subsystem writer - wants standardization so that l  they can be used by other actors. Architecting the Future of Big Data © Hortonworks Inc. 2011
  • What do they all want? l  Need it Working – Correctness l  Speed, Efficiency l  Interoperability, Convenience. Architecting the Future of Big Data © Hortonworks Inc. 2011
  • Did somebody say Interoperability? © Hortonworks Inc. 2011
  • Making Your Structured DataAvailable to the MapReduce Engine MapReduce Pig Hive HCatalog MPP HDFS HBase Store•  Users can query data with Pig, Hive, or custom MapReduce jobs•  Standard HDFS formats available Q1 2012•  HBase data by early Q2 2012 Architecting the Future of Big Data 14 © Hortonworks Inc. 2011
  • Hcatalog underlying architecture HCatLoader HCatStorer HCatInputFormat HCatOutputFormat CLI Notification Hive MetaStore Client Generated Thrift Client Hive MetaStore RDBMS Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • Problem: Need to Know Where Data Is PIG HIVE B hdfs:///use ouse/queen MapReduce t1 r/wilber/pro l/datase hive/wareh outh/fou jectA hdfs:///user/ ser/mam hdfs:///u Storage Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • Solution: Register Through HCatalog PIG HIVE Proje MapReduce ctA Foul1 HCatalog Storage Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • Problem: Data in variety of formats •  Data files maybe organized in different formats •  Data files may contain different formats in different partitions Storage (HDFS, HBASE , etc) Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • Solution: HCat provides commonabstraction Hadoop Application•  Registered Data w/ Schema•  HCat normalizes data to application HCatalog Storage Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • Getting InvolvedIncubator site : http://incubator.apache.org/hcatalogUser list: hcatalog-user@incubator.apache.orgDev list: hcatalog-dev@incubator.apache.org Architecting the Future of Big Data © Hortonworks Inc. 2011
  • TODO HCATALOG-8 : HCatalog needs a logo HBase integration, trying to nail down a better table metaphor Hive integration – interoperability between the notion of StorageDriver and StorageHandler, project dependency management HCATALOG-182 : Improve the “and friends” bit.Architecting the Future of Big Data© Hortonworks Inc. 2011
  • Waitaminnit... what was that about “friends” ?Architecting the Future of Big Data© Hortonworks Inc. 2011
  • Templeton A Webservices API for HadoopPhoto credit : Architecting the Future of Big Data© PKMousie on flickr © Hortonworks Inc. 2011
  • Templeton: ISV Front-door for Hadoop •  Insulation from interface changes release to release •  Opens the door to languages other than Java •  Thin clients through webservices vs forced fat-clients in gateway •  Still prototyping! But see a common need. Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • Templeton Specific SupportMove data directly into/out-of HDFS through WebHDFSWebservice calls to HCatalog –  Register table relationships for data (e.g., createTable, createDatabase) –  Adjust tables (e.g., AlterTable) –  Look at a statistics (e.g., ShowTable)Webservice calls to start work –  MapReduce, Pig, Hive –  Poll for job status –  Notification URL when job completes (optional)Stateless Server –  Horizontally scale for load –  Configurable for HA –  Currently Requires ZooKeeper to track job status info Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • ANY QUESTIONS ?Architecting the Future of Big Data© Hortonworks Inc. 2011