Falcon - Data Management Platform on Hadoop (Beyond ETL)


Published on

Hadoop and its ecosystem of products have made storing and processing massive amounts of data common place. This has enabled numerous businesses to gain valuable foresights that they never could have in the past. While it is easy to leverage Hadoop for crunching large volumes of data, organizing data, managing life cycle of data and processing data is fairly involved. This is solved adequately well in a traditional data platform involving data warehouses and standard ETL (extract-transform-load) tools, but remains largely unsolved today. Besides data processing complexities, Hadoop presents new set of challenges relating to management of data. Data Management on Hadoop encompasses data motion (import/export), process orchestration (data pipelines, late/re-processing, scheduling), lifecycle management (retention, replication, DR, anonymization, archival), data discovery (data classification, Lineage), etc. among other concerns that are beyond ETL. The presentation focuses on a new data processing and management platform for Hadoop, Falcon that attempts to solve this problem by leveraging existing stacks in the Hadoop ecosystem. Falcon has been in production for nearly a year at InMobi and has been managing hundreds of feeds and processes.

Published in: Technology, Business
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • In a typical big data environment involving Hadoop, the use cases tend to be around processing very large volumes of data either for machine or human consumption. Some of the data that gets to the hadoop platform can contain critical business & financial information. The data processing team in such an environment is often distracted by the multitude of data management and process orchestration challenges. To name a fewIngesting large volumes of events/streamsIngesting slowly changing data typically available on a traditional databaseCreating a pipeline / sequence of processing logic to extract the desired piece of insight / informationHandling processing complexities relating to change of data / failuresManaging eviction of older data elementsBackup the data in an alternate location or archive it in a cheaper storage for DR/BCP & Compliance requirementsShip data out of the hadoop environment periodically for machine or human consumption etcThese tend to be standard challenges that are better handled in a platform and this might allow the data processing team to focus on their core business application. A platform approach to this also allows us to adopt best practices in solving each of these for subsequent users of the platform to leverage.========================What do we mean by DMPlatform should provide these as services to users so users worry about business processingCaptures common themes and follows best practicesFrees users from such
  • As we just noted that there are numerous data and process management services when made available to the data processing team, can reduce their day-to-day complexities significantly and allow them to focus on their business application. This is an enumeration of such services, which we intend to cover in adequate detail as we go along.
  • More often than not pipelines are sequence of data processing or data movement tasks that need to happen before raw data can be transformed into a meaningfully consumable form. Normally the end stage of the pipeline where the final sets of data are produced is in the critical path and may be subject to tight SLA bounds. Any step in the sequence/pipeline if either delayed or failed could cause the pipeline to stall. It is important that each step in the pipeline handoff to the next step to avoid any buffering of time and to allow seamless progression of the pipeline. People who are familiar with Apache Oozie might be able to appreciate this feature provided through the Coordinator.As the pipelines gets more and more time critical and time sensitive, this becomes very very critical and this ought to be available off the shelf for application developers. It is also important for this feature to scalable to support the needs of concurrent pipelines.
  • From our experience there are typically two reasons why large volumes of data are processed, namelySLA critical machine consumable data (with some tolerance to error)Factual reporting with a “Close of Books” notion for human consumable (not always but frequently enough)While the first class of application doesn’t get affected much if some small percentage of data arrives late. Some examples of these class of applications include forecasting, predictions, risk management etc.However the second class of application are used for factual reporting, results of which may be subject to audit. For these use cases, it is not acceptable to ignore data that arrived out of order or late. The platform in such cases need to provide an option to the application author the ability to detect arrival of late data and enable re-processing. This might also require a cascading reprocess flow of all downstream apps. This service being available off the shelf to the application developer would relieve him/her of the pain of having to manage this themselves.
  • A fact that data volumes are large and increasing by the day is the reason one adopts a big data platform like Hadoop and that would automatically mean that we would run of space pretty soon, if we didn’t take care of evicting & purging older instances of data. Few problems to consider for retention areShould avoid using a general purpose super user with world writable privileges to delete old data (for obvious reasons)Different types of data may require different criteria for aging and hence purgingOther life cycle functions like Archival of old data if defined ought to be scheduled before eviction kicks in
  • Hadoop is being increasingly critical for many businesses and for some users the raw data volumes are too large for them to be shipped to one place for processing, for others data needs to be redundantly available for business continuity reasons. In either scenarios replication of data from one cluster to another plays a vital role. This being available as a service would again free up the cycles from the application developer of these responsibilities. The key challenges to consider while offering this as a service areBandwidth consumption and managementChunking/bulking strategyCorrectness guaranteesHDFS version compatibility issues =========================2 Dimensions:BCP/DRLocal/Global Agg – ship local aggs as part of a pipeline
  • Integrated view of what is happening currently in the system based on the holistic information about all the elements in the system (data, associated management functions, processing logic and the location) provide for a compelling view of the “State of the system” at any time. This is a much needed platform feature for the larger goal of “allowing data application developer to focus on the business or processing logic”.Adding alerting & notifications to this will complete the operability story.===============================DashboardAlertsNotifications
  • Some of the things we have spoken about so far can be done if we took a silo-ed approach. For instance it is possible to process few data sets and produce a few more through a scheduler. However if there are two other consumers of the data produced by the first workflow then the same will be repeatedly defined by the other two consumers and so on. There is a serious duplication of metadata information of what data is ingested, processed or produced and where they are processed and how they are produced. A single system which creates a complete view of this would be able to provide a fairly complete picture of what is happening in the system compared to collection to independent scheduled applications. Both the production support and application development team on Hadoop platform have to scramble and write custom script and monitoring system to get a broader and holistic view of what is happening. An approach where this information is systemically collected and used for seamless management can alleviate much of the pains of folks operating or developing data processing application on hadoop.
  • The entity graph at the core is what makes Falcon what it is and that in a way enables all the unique features that Falcon has to offer or can potentially make available in future. At the coreDependency between Data Processing logic andCluster end pointsRules governing Data managementProcessing managementMetadata management
  • System accepts entities using DSLInfrastructure, Datasets, Pipeline/Processing logicTransforms the input into automated and scheduled workflowsSystem orchestrates workflowsInstruments execution of configured policiesHandles retry logic and late data processingRecords audit, lineage Seamless integration with metastore/catalog (WIP)Provides notifications based on availabilityIntegrated Seamless experience to usersAutomates processing and tracks the end to end progressData Set management (Replication, Retention, etc.) offered as a serviceUsers can cherry pick, No coupling between primitivesProvides hooks for monitoring, metrics collection
  • Ad Request, Click, Impression, Conversion feedMinutely (with identical location, retention configuration, but with many data centers)Summary dataHourly (with multiple partitions – one per dc, each configured as source and one target which is global datacenter)Click, Impression Conversion enrichment & Summarizer processesSingle definition with multiple data centersIdentical periodicity and scheduling configuration
  • Falcon - Data Management Platform on Hadoop (Beyond ETL)

    1. 1. Data Management Platform on Hadoop Srikanth Sundarrajan Venkatesh Seetharam (Incubating)
    2. 2. whoami Principal Architect InMobi Apache Hadoop Contributor Hadoop Team @Yahoo! Srikanth Sundarrajan Architect/Developer Hortonworks Apache Hadoop Contributor Data Management @ Yahoo! Venkatesh Seetharam
    3. 3. Agenda 2 Falcon Overview 1 Motivation 3 Case Studies 4 Questions & Answers
    4. 4. MOTIVATION
    5. 5. Data Processing Landscape External data source Acquire (Import) Data Processing (Transform/Pipeline ) Eviction Archive Replicate (Copy) Export
    6. 6. Core Services Process • Late data management • Relays Data management • Acquisition • Replication • Retention Operability • SLA • Lineage
    7. 7. Process Management – Relays picture courtersy: http://istockphoto.com/
    8. 8. Late Data Management picture courtersy: http://iwebask.com
    9. 9. Data Retention As Service picture courtersy: http://vimeo.com/
    10. 10. Data Replication As Service picture courtersy: http://boylesmedia.com
    11. 11. Data Acquisition As Service picture courtersy: http://wmpu.org
    12. 12. Operability – Dashboard picture courtersy: http://www.opentrack.ch/
    14. 14. Holistic Declaration of Intent picture courtersy: http://bigboxdetox.com
    15. 15. Entity Dependency Graph Hadoop / Hbase … Cluster External data source feed Process depends depends
    16. 16. High Level Architecture Apache Falcon Oozie Messaging HCatalog Hadoop Entity Entity status Process status / notification CLI/RES T JMS Config store
    17. 17. Feed Schedule Cluster xml Feed xml Falcon Falcon config store / Graph Retention / Replication workflow Oozie Scheduler HDFS JMS Notification per action Catalog service Instance Management
    18. 18. Process Schedule Cluster/fe ed xml Process xml Falcon Falcon config store / Graph Process workflow Oozie Scheduler HDFS JMS Notification per available feed Catalog service Instance Management
    19. 19. Physical Architecture Falcon Colo 1 Falcon Colo 2 Falcon Colo 3 Scheduler Scheduler Scheduler Falcon – Prism Global view
    20. 20. CASE STUDY Multi Cluster Failover
    21. 21. CASE STUDY Distributed Processing Example: Digital Advertising @ InMobi
    22. 22. Hadoop @ InMobi  About InMobi  Worlds leading independent mobile advertising company  Hadoop usage at InMobi  ~ 6 Clusters  > 1PB of storage  > 5TB new data ingested each day  > 20TB data crunched each day  > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase  > 175K hadoop jobs / day  > 60K Oozie workflows / day  300+ Falcon feed definitions  100+ Falcon process definitions
    23. 23. Processing – Single Data Center Ad Request data Impression render event Click event Conversion event Continuou s Streaming (minutely) Hourly summary Enrichment (minutely/5 minutely) Summarizer
    24. 24. Global Aggregation Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer …….. DataCenter1 DataCenterN Consumable global aggregate
    25. 25. HIGHLIGHTS
    26. 26. Future Security Embed Pig/Hive scripts Data Acquisition – file-based Monitoring/Management Dashboard
    27. 27. Summary
    28. 28. Questions?  Apache Falcon  http://falcon.incubator.apache.org  mailto: dev@falcon.incubator.apache.org  Srikanth Sundarrajan  sriksun@apache.org  #sriksun  Venkatesh Seetharam  venkatesh@apache.org  #innerzeal