Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

January 2011 HUG: Howl Presentation


Published on

Published in: Technology
  • Login to see the comments

January 2011 HUG: Howl Presentation

  1. 1. Howl: Table Management Service for Hadoop<br />Devaraj Das<br />(<br />
  2. 2. Introductions<br /><ul><li>Who I am
  3. 3. Apache Hadoop committer and PMC member
  4. 4. Principal Engineer at Yahoo!
  5. 5. Past - MapReduce & Hadoop Security developer</li></ul> Howl Team<br />Architecture & Development<br />AshutoshChauhan<br />Devaraj Das<br />Alan Gates<br />SushanthSowmyan<br />Mac Yang<br />QE<br />EgilSørensen<br />
  6. 6. Howl Motivation<br />Provide a table management layer for Hadoop. This includes:<br />providing a shared schema and data type system across tools (collaboration)<br />providing a table abstraction so users need not worry about where or in what format their data is stored (operability)<br />providing users that have different data processing tools (MR, Pig, Hive), the ability to share data (interoperability)<br />providing a way to define new data storage formats / codecs, etc. (evolvability)<br />Pig<br />Hive<br />Map Reduce<br />Streaming<br />Howl<br />RCFile<br />Sequence File<br />Text File<br />
  7. 7. Logical Architecture<br />HowlLoader<br />HowlStorage<br />HowlInputFormat<br />HowOutputFormat<br />CLI<br />Notification<br />HiveMetaStore Client<br />Generated Thrift client<br />Hive MetaStore<br />RDBMS<br />= Added by Howl<br />= Taken from Hive<br />= Taken from Hive and modified by Howl<br />
  8. 8. Data Model<br /><ul><li>Users presented with a relational view of the data in hdfs
  9. 9. Data stored in tables
  10. 10. Optionally partitioned tables (for example, partitioned by datestamp)
  11. 11. Partitions contain records
  12. 12. Records are divided into (named & typed) columns
  13. 13. Howl supports the same datatypes as Hive</li></li></ul><li>Example, Data Flow at XYZ Corp<br />Robert, in client management, uses Hive to analyze his clients’ results <br />Sally, in data processing, uses Pig to cleanse and<br />prepare data <br />Joe, in data acquisition, <br />uses distcp to get data <br />onto grid<br />
  14. 14. Data Collection<br />hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data<br />howl “alter table rawevents add partition 20100819 hdfs://data/rawevents/20100819/data”<br />
  15. 15. Data Processing<br />A = load ‘/data/rawevents/20100819/data’ as (alpha:int, beta:chararray, …);<br />B = filter A by bot_finder(zeta) = 0;<br />…<br />store Z into ‘data/processedevents/20100819/data’;<br />Sally must be manually informed by Joe data is available, or use Oozie and poll on HDFS<br />A = load ‘rawevents’ using HowlLoader();<br />B = filter A by date = ‘20100819’ and by bot_finder(zeta) = 0;<br />…<br />store Z into ‘processedevents’ using HowlStorage(“date=20100819”);<br />Oozie will be notified by Howl data is available and can then start the Pig job<br />
  16. 16. Data Analysis<br />alter table rawevents add partition 20100819 hdfs://data/processedevents/20100819/data<br />select advertiser_id, count(clicks)<br />from processedevents<br />where date = ‘20100819’ <br />group by adverstiser_id;<br />select advertiser_id, count(clicks)<br />from processedevents<br />where date = ‘20100819’ <br />group by adverstiser_id;<br />
  17. 17. In summary ..<br />Data pipeline use case<br />written in some combination of Pig and MR (writes data that is stored in fact/dimension model)<br />read by Hive <br />no need to export data from Pig/MR into Hive<br />Tools such as Oozie are able to operate on data based on notifications provided by Howl<br />Collaboration & Interoperability at work!<br />
  18. 18. Data evolvability at XYZ Corp <br />Let’s say that XYZ Corp decides to move from text files to RCFile to store its processed data<br />Without Howl<br />Pig scripts have to be changed to store in RCFile<br />Hive table has to be altered to use RCFile<br />All existing data must be restated to RCFile<br />With Howl<br />Howl table must be altered to use RCFile for new partitions<br />Existing data need NOT be restated<br />Operations can decide to compact the data<br />
  19. 19. Interfaces<br />HowlInputFormat and HowlOutputFormat for MR<br />HowlLoader and HowlStorage for Pig<br />HowlSerDe for Hive (future)<br />Command line interface that provides DDL (matches Hive DDL)<br />Notification service (format TBD) (future)<br />Java API for tools that need to do bulk operations (future)<br />
  20. 20. Pig<br />metadata & data<br />HowlLoader<br />HowlInputFormat<br />Thrift Client<br />HowlInputStorageDriver X<br />Input<br />Format X<br />metadata<br />data<br />Thrift Server<br />HDFS<br />meta<br />store<br />Data Flow Diagram – Reading in Pig<br />
  21. 21. Roadmap<br />Initial release Q1 2011<br />Table abstraction to tools processing data on Hadoop.<br />The ability to read and write data in Pig & Map Reduce.<br />The ability to read data in Hive.<br />Partition pruning so that when a user asks for partitions in a table he can provide a selection predicate that determines which partitions are returned.<br />Integration with Hadoop security, including Howl authenticating and authorizing users.<br />JMX based monitoring<br />Oozie workflow integration (users can submit workflow that talks to Howl)<br />Support for writing data in RCFile, reading data from PigStorage, RCFile, Jute ULT (Yahoo! format) [Growl tool]<br />Hive 0.7 release will contain the Hive MetaStore related changes<br />
  22. 22. Roadmap .. contd.<br />V2 and beyond..<br />Notification (for tools like Oozie)<br />Dynamic partitioning<br />Non-partition filter pushdowns<br />Howl Import/Export tool (under dev)<br />Schema evolution<br />Utilities API (for tools, e.g., Grid replication service, to use Howl easily)<br />Authorization enhancements<br />Details at<br />Howl Project in the Apache Incubator<br />Starting the process<br />
  23. 23. Some Links<br />About Howl<br /><br />Security in Howl<br /><br /><br />Sources<br /><br />Roadmap<br /><br />Mailing list<br /><br /><br />
  24. 24. Backup slides<br />
  25. 25. Hive<br />data<br />metadata<br />Thrift Client<br />Input<br />Format X<br />metadata<br />data<br />Thrift Server<br />HDFS<br />meta<br />store<br />Data Flow Diagram – Reading in Hive<br />
  26. 26. Howl InputFormat & InputStorageDriver<br />HowlInputFormat<br />Fundamentally, not a data format<br />A generic input format that users can use to write data format agnostic code<br />Provides database table like semantics<br />Allows for specifying projections, predicates<br />Uses HowlInputStorageDriver underneath<br />HowlInputStorageDriver<br />A wrapper over the underlying input format<br />Converts the underlying record to a generic HowlRecord<br />HowlRecord<br />Implemented as a List of objects<br />
  27. 27. Security in Howl<br />User(CLI) – Howl Server<br />Authentication using Kerberos<br />HDFS operations are done as the authenticating user<br />Map/Reduce task – Howl Server<br />Authentication using Howl Delegation Tokens (based on Hadoop’s Delegation Token)<br />Authorization<br />Users can control permissions & group ownership on the Table<br />Uses HDFS permissions to authorize metadata operations<br />New Partitions inherit the table’s permissions and group ownership<br />