January 2011 HUG: Howl Presentation

  • 3,419 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,419
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Howl: Table Management Service for Hadoop
    Devaraj Das
    (ddas@apache.org)
  • 2. Introductions
    • Who I am
    • 3. Apache Hadoop committer and PMC member
    • 4. Principal Engineer at Yahoo!
    • 5. Past - MapReduce & Hadoop Security developer
    Howl Team
    Architecture & Development
    AshutoshChauhan
    Devaraj Das
    Alan Gates
    SushanthSowmyan
    Mac Yang
    QE
    EgilSørensen
  • 6. Howl Motivation
    Provide a table management layer for Hadoop. This includes:
    providing a shared schema and data type system across tools (collaboration)
    providing a table abstraction so users need not worry about where or in what format their data is stored (operability)
    providing users that have different data processing tools (MR, Pig, Hive), the ability to share data (interoperability)
    providing a way to define new data storage formats / codecs, etc. (evolvability)
    Pig
    Hive
    Map Reduce
    Streaming
    Howl
    RCFile
    Sequence File
    Text File
  • 7. Logical Architecture
    HowlLoader
    HowlStorage
    HowlInputFormat
    HowOutputFormat
    CLI
    Notification
    HiveMetaStore Client
    Generated Thrift client
    Hive MetaStore
    RDBMS
    = Added by Howl
    = Taken from Hive
    = Taken from Hive and modified by Howl
  • 8. Data Model
    • Users presented with a relational view of the data in hdfs
    • 9. Data stored in tables
    • 10. Optionally partitioned tables (for example, partitioned by datestamp)
    • 11. Partitions contain records
    • 12. Records are divided into (named & typed) columns
    • 13. Howl supports the same datatypes as Hive
  • Example, Data Flow at XYZ Corp
    Robert, in client management, uses Hive to analyze his clients’ results
    Sally, in data processing, uses Pig to cleanse and
    prepare data
    Joe, in data acquisition,
    uses distcp to get data
    onto grid
  • 14. Data Collection
    hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data
    howl “alter table rawevents add partition 20100819 hdfs://data/rawevents/20100819/data”
  • 15. Data Processing
    A = load ‘/data/rawevents/20100819/data’ as (alpha:int, beta:chararray, …);
    B = filter A by bot_finder(zeta) = 0;

    store Z into ‘data/processedevents/20100819/data’;
    Sally must be manually informed by Joe data is available, or use Oozie and poll on HDFS
    A = load ‘rawevents’ using HowlLoader();
    B = filter A by date = ‘20100819’ and by bot_finder(zeta) = 0;

    store Z into ‘processedevents’ using HowlStorage(“date=20100819”);
    Oozie will be notified by Howl data is available and can then start the Pig job
  • 16. Data Analysis
    alter table rawevents add partition 20100819 hdfs://data/processedevents/20100819/data
    select advertiser_id, count(clicks)
    from processedevents
    where date = ‘20100819’
    group by adverstiser_id;
    select advertiser_id, count(clicks)
    from processedevents
    where date = ‘20100819’
    group by adverstiser_id;
  • 17. In summary ..
    Data pipeline use case
    written in some combination of Pig and MR (writes data that is stored in fact/dimension model)
    read by Hive
    no need to export data from Pig/MR into Hive
    Tools such as Oozie are able to operate on data based on notifications provided by Howl
    Collaboration & Interoperability at work!
  • 18. Data evolvability at XYZ Corp
    Let’s say that XYZ Corp decides to move from text files to RCFile to store its processed data
    Without Howl
    Pig scripts have to be changed to store in RCFile
    Hive table has to be altered to use RCFile
    All existing data must be restated to RCFile
    With Howl
    Howl table must be altered to use RCFile for new partitions
    Existing data need NOT be restated
    Operations can decide to compact the data
  • 19. Interfaces
    HowlInputFormat and HowlOutputFormat for MR
    HowlLoader and HowlStorage for Pig
    HowlSerDe for Hive (future)
    Command line interface that provides DDL (matches Hive DDL)
    Notification service (format TBD) (future)
    Java API for tools that need to do bulk operations (future)
  • 20. Pig
    metadata & data
    HowlLoader
    HowlInputFormat
    Thrift Client
    HowlInputStorageDriver X
    Input
    Format X
    metadata
    data
    Thrift Server
    HDFS
    meta
    store
    Data Flow Diagram – Reading in Pig
  • 21. Roadmap
    Initial release Q1 2011
    Table abstraction to tools processing data on Hadoop.
    The ability to read and write data in Pig & Map Reduce.
    The ability to read data in Hive.
    Partition pruning so that when a user asks for partitions in a table he can provide a selection predicate that determines which partitions are returned.
    Integration with Hadoop security, including Howl authenticating and authorizing users.
    JMX based monitoring
    Oozie workflow integration (users can submit workflow that talks to Howl)
    Support for writing data in RCFile, reading data from PigStorage, RCFile, Jute ULT (Yahoo! format) [Growl tool]
    Hive 0.7 release will contain the Hive MetaStore related changes
  • 22. Roadmap .. contd.
    V2 and beyond..
    Notification (for tools like Oozie)
    Dynamic partitioning
    Non-partition filter pushdowns
    Howl Import/Export tool (under dev)
    Schema evolution
    Utilities API (for tools, e.g., Grid replication service, to use Howl easily)
    Authorization enhancements
    Details at http://wiki.apache.org/pig/HowlJournal
    Howl Project in the Apache Incubator
    Starting the process
  • 23. Some Links
    About Howl
    http://wiki.apache.org/pig/Howl
    Security in Howl
    http://wiki.apache.org/pig/Howl/HowlAuthentication
    http://wiki.apache.org/pig/Howl/HowlAuthorizationProposal
    Sources
    https://github.com/yahoo/howl
    Roadmap
    http://wiki.apache.org/pig/HowlJournal
    Mailing list
    howldev@yahoogroups.com
    ddas@apache.org
  • 24. Backup slides
  • 25. Hive
    data
    metadata
    Thrift Client
    Input
    Format X
    metadata
    data
    Thrift Server
    HDFS
    meta
    store
    Data Flow Diagram – Reading in Hive
  • 26. Howl InputFormat & InputStorageDriver
    HowlInputFormat
    Fundamentally, not a data format
    A generic input format that users can use to write data format agnostic code
    Provides database table like semantics
    Allows for specifying projections, predicates
    Uses HowlInputStorageDriver underneath
    HowlInputStorageDriver
    A wrapper over the underlying input format
    Converts the underlying record to a generic HowlRecord
    HowlRecord
    Implemented as a List of objects
  • 27. Security in Howl
    User(CLI) – Howl Server
    Authentication using Kerberos
    HDFS operations are done as the authenticating user
    Map/Reduce task – Howl Server
    Authentication using Howl Delegation Tokens (based on Hadoop’s Delegation Token)
    Authorization
    Users can control permissions & group ownership on the Table
    Uses HDFS permissions to authorize metadata operations
    New Partitions inherit the table’s permissions and group ownership