Your SlideShare is downloading. ×
0
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

January 2011 HUG: Howl Presentation

3,816

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,816
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Howl: Table Management Service for Hadoop<br />Devaraj Das<br />(ddas@apache.org)<br />
  • 2. Introductions<br /><ul><li>Who I am
  • 3. Apache Hadoop committer and PMC member
  • 4. Principal Engineer at Yahoo!
  • 5. Past - MapReduce & Hadoop Security developer</li></ul> Howl Team<br />Architecture & Development<br />AshutoshChauhan<br />Devaraj Das<br />Alan Gates<br />SushanthSowmyan<br />Mac Yang<br />QE<br />EgilSørensen<br />
  • 6. Howl Motivation<br />Provide a table management layer for Hadoop. This includes:<br />providing a shared schema and data type system across tools (collaboration)<br />providing a table abstraction so users need not worry about where or in what format their data is stored (operability)<br />providing users that have different data processing tools (MR, Pig, Hive), the ability to share data (interoperability)<br />providing a way to define new data storage formats / codecs, etc. (evolvability)<br />Pig<br />Hive<br />Map Reduce<br />Streaming<br />Howl<br />RCFile<br />Sequence File<br />Text File<br />
  • 7. Logical Architecture<br />HowlLoader<br />HowlStorage<br />HowlInputFormat<br />HowOutputFormat<br />CLI<br />Notification<br />HiveMetaStore Client<br />Generated Thrift client<br />Hive MetaStore<br />RDBMS<br />= Added by Howl<br />= Taken from Hive<br />= Taken from Hive and modified by Howl<br />
  • 8. Data Model<br /><ul><li>Users presented with a relational view of the data in hdfs
  • 9. Data stored in tables
  • 10. Optionally partitioned tables (for example, partitioned by datestamp)
  • 11. Partitions contain records
  • 12. Records are divided into (named & typed) columns
  • 13. Howl supports the same datatypes as Hive</li></li></ul><li>Example, Data Flow at XYZ Corp<br />Robert, in client management, uses Hive to analyze his clients’ results <br />Sally, in data processing, uses Pig to cleanse and<br />prepare data <br />Joe, in data acquisition, <br />uses distcp to get data <br />onto grid<br />
  • 14. Data Collection<br />hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data<br />howl “alter table rawevents add partition 20100819 hdfs://data/rawevents/20100819/data”<br />
  • 15. Data Processing<br />A = load ‘/data/rawevents/20100819/data’ as (alpha:int, beta:chararray, …);<br />B = filter A by bot_finder(zeta) = 0;<br />…<br />store Z into ‘data/processedevents/20100819/data’;<br />Sally must be manually informed by Joe data is available, or use Oozie and poll on HDFS<br />A = load ‘rawevents’ using HowlLoader();<br />B = filter A by date = ‘20100819’ and by bot_finder(zeta) = 0;<br />…<br />store Z into ‘processedevents’ using HowlStorage(“date=20100819”);<br />Oozie will be notified by Howl data is available and can then start the Pig job<br />
  • 16. Data Analysis<br />alter table rawevents add partition 20100819 hdfs://data/processedevents/20100819/data<br />select advertiser_id, count(clicks)<br />from processedevents<br />where date = ‘20100819’ <br />group by adverstiser_id;<br />select advertiser_id, count(clicks)<br />from processedevents<br />where date = ‘20100819’ <br />group by adverstiser_id;<br />
  • 17. In summary ..<br />Data pipeline use case<br />written in some combination of Pig and MR (writes data that is stored in fact/dimension model)<br />read by Hive <br />no need to export data from Pig/MR into Hive<br />Tools such as Oozie are able to operate on data based on notifications provided by Howl<br />Collaboration & Interoperability at work!<br />
  • 18. Data evolvability at XYZ Corp <br />Let’s say that XYZ Corp decides to move from text files to RCFile to store its processed data<br />Without Howl<br />Pig scripts have to be changed to store in RCFile<br />Hive table has to be altered to use RCFile<br />All existing data must be restated to RCFile<br />With Howl<br />Howl table must be altered to use RCFile for new partitions<br />Existing data need NOT be restated<br />Operations can decide to compact the data<br />
  • 19. Interfaces<br />HowlInputFormat and HowlOutputFormat for MR<br />HowlLoader and HowlStorage for Pig<br />HowlSerDe for Hive (future)<br />Command line interface that provides DDL (matches Hive DDL)<br />Notification service (format TBD) (future)<br />Java API for tools that need to do bulk operations (future)<br />
  • 20. Pig<br />metadata & data<br />HowlLoader<br />HowlInputFormat<br />Thrift Client<br />HowlInputStorageDriver X<br />Input<br />Format X<br />metadata<br />data<br />Thrift Server<br />HDFS<br />meta<br />store<br />Data Flow Diagram – Reading in Pig<br />
  • 21. Roadmap<br />Initial release Q1 2011<br />Table abstraction to tools processing data on Hadoop.<br />The ability to read and write data in Pig & Map Reduce.<br />The ability to read data in Hive.<br />Partition pruning so that when a user asks for partitions in a table he can provide a selection predicate that determines which partitions are returned.<br />Integration with Hadoop security, including Howl authenticating and authorizing users.<br />JMX based monitoring<br />Oozie workflow integration (users can submit workflow that talks to Howl)<br />Support for writing data in RCFile, reading data from PigStorage, RCFile, Jute ULT (Yahoo! format) [Growl tool]<br />Hive 0.7 release will contain the Hive MetaStore related changes<br />
  • 22. Roadmap .. contd.<br />V2 and beyond..<br />Notification (for tools like Oozie)<br />Dynamic partitioning<br />Non-partition filter pushdowns<br />Howl Import/Export tool (under dev)<br />Schema evolution<br />Utilities API (for tools, e.g., Grid replication service, to use Howl easily)<br />Authorization enhancements<br />Details at http://wiki.apache.org/pig/HowlJournal<br />Howl Project in the Apache Incubator<br />Starting the process<br />
  • 23. Some Links<br />About Howl<br />http://wiki.apache.org/pig/Howl<br />Security in Howl<br />http://wiki.apache.org/pig/Howl/HowlAuthentication<br />http://wiki.apache.org/pig/Howl/HowlAuthorizationProposal<br />Sources<br />https://github.com/yahoo/howl<br />Roadmap<br />http://wiki.apache.org/pig/HowlJournal<br />Mailing list<br />howldev@yahoogroups.com<br />ddas@apache.org<br />
  • 24. Backup slides<br />
  • 25. Hive<br />data<br />metadata<br />Thrift Client<br />Input<br />Format X<br />metadata<br />data<br />Thrift Server<br />HDFS<br />meta<br />store<br />Data Flow Diagram – Reading in Hive<br />
  • 26. Howl InputFormat & InputStorageDriver<br />HowlInputFormat<br />Fundamentally, not a data format<br />A generic input format that users can use to write data format agnostic code<br />Provides database table like semantics<br />Allows for specifying projections, predicates<br />Uses HowlInputStorageDriver underneath<br />HowlInputStorageDriver<br />A wrapper over the underlying input format<br />Converts the underlying record to a generic HowlRecord<br />HowlRecord<br />Implemented as a List of objects<br />
  • 27. Security in Howl<br />User(CLI) – Howl Server<br />Authentication using Kerberos<br />HDFS operations are done as the authenticating user<br />Map/Reduce task – Howl Server<br />Authentication using Howl Delegation Tokens (based on Hadoop’s Delegation Token)<br />Authorization<br />Users can control permissions & group ownership on the Table<br />Uses HDFS permissions to authorize metadata operations<br />New Partitions inherit the table’s permissions and group ownership<br />

×