Hadoop first ETL on Apache Falcon


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • We basically are going to look at general applications & use cases of ETL and what are specific challenges with respect to ETL over Big data
    Then we see how Apache Falcon attempts to address these in the upcoming feature
    Pipeline Designer is a new feature being added to Falcon to support ETL authoring capabilities and we look into specifics of this feature and the designer internals
    Finally we look at some mocks of this feature to get a sense of how this would shape.
  • As data is further refined, curated and processed into meaningful information and insights/intelligence, higher order value is derived out of it. ETL play a pivot role in this derivation process. Decades ago, data used to reside in just one or very few systems and data integration / ETL weren’t domainant problems, but as the system got broken down into numerous sub system this has assumed a lot of significance. With a explosion and focus on data, the needs and complexity are only to increase further.
  • Data warehousing is probably the one of the most common use case one might have come across in the context of ETL, but there are other use cases besides data warehousing and business intelligence.
    Data Migration – When migrating one data model to another or migrating from one system to another
    Data Consolidation – Often times during Mergers & Acquisition one might end up with a need to consolidate
    Data Archiving – Moving data to low cost storage mostly to support compliance requirements
    Master Data Management – To support single source of truth for master data across all system within an organization
    Data Synchronization – To support cross data center for DR and BCP purposes
  • ETL have for the longest period in history been authored through hand coded scripts, in house tools specifically catering to the context of a business or through general purpose off-shelf tools with possibly wide variety of connectors and plugins.
  • When it comes to large scale or big data the challenges are further compounded.

    Volume – Scale & Size
    Variety – Diverse sources, dynamic schema / unstructured
    Velocity – Freshness, Cycle turn around time
  • Hadoop first ETL on Apache Falcon

    1. 1. Hadoop First ETL On Apache Falcon Srikanth Sundarrajan Naresh Agarwal
    2. 2. About Authors  Srikanth Sundarrajan  Principal Architect, InMobi Technology Services  Naresh Agarwal  Director – Engineering, InMobi Technology Services
    3. 3. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals
    4. 4. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals
    5. 5. ETL (Extract Transform Load) Intelligence Information Data Value
    6. 6. ETL Use cases Data Warehouse Data Migration Data Consolidation Master Data Management Data Synchronization Data Archiving
    7. 7. ETL Authoring Hand coded In-house tools Off- shelf tools
    8. 8. ETL & Big Data – Challenges Challenges Volume VarietyVelocity
    9. 9. Big Data ETL  Mostly Hand coded (High Cost – Implementation + Maintenance)  Map Reduce  Hive (i.e. SQL)  Pig  Crunch / Cascading  Spark  Off-shelf tools (Scale/Performance)  Mostly Retrofitted
    10. 10. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals
    11. 11. Apache Falcon  Off the shelf, Falcon provides standard data management functions through declarative constructs  Data movement recipes  Cross data center replication  Cross cluster data synchronization  Data retention recipes  Eviction  Archival
    12. 12. Apache Falcon  However ETL related functions are still largely left to the developer to implement. Falcon today manages only  Orchestration  Late data handling / Change data capture  Retries  Monitoring
    13. 13. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals
    14. 14. Pipeline Designer – Basics
    15. 15. Pipeline Designer – Basics  Feed  Is a data entity that Falcon manages and is physically present in a cluster.  Data present in this feed conforms to a schema and partitions of the same are registered with Hcatalog  Data Management functions such as eviction, archival etc are declaratively specified through Falcon Feed definitions
    16. 16. Pipeline Designer – Basics
    17. 17. Pipeline Designer – Basics  Process  Workflow that defines various actions that needs to be performed along with control flow  Executes at a specified frequency on one or more clusters  Pipelines  Logical grouping of Falcon processes owned and operated together
    18. 18. Pipeline Designer – Basics
    19. 19. Pipeline Designer – Basics  Actions  Actions in designer are the building blocks for the process workflows.  Actions have access to output variables earlier in the flow and can emit output variables  Actions can transition to other actions  Default / Success Transition  Failure Transition  Conditional Transition  Transformation action is a special action that further is a collection of transforms
    20. 20. Pipeline Designer – Basics
    21. 21. Pipeline Designer – Basics  Transforms  Is a data manipulation function that accepts one or more inputs with well defined schema and produces ore or more outputs  Multiple transform elements can be stitched together to compose a single transformation action which can further be used to build a flow  Composite Transformations  Transforms that are built through a combination of multiple primitive transforms  Possible to add more transforms and extend the system
    22. 22. Pipeline Designer – Basics  Deployment & Monitoring  Once a process and the pipeline is composed, the same is deployed in Falcon as a standard process
    23. 23. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals
    24. 24. Pipeline Designer Service Pipeline Designer Pipeline Designer Service REST API Versioned Storage Flow / Action /Transforms Compiler + Optimizer Falcon Server Hcatalog Service DesignerUI FalconDashboard Process Feed Schema
    25. 25. Pipeline Designer – Internals  Transformation actions are compiled into PIG scripts  Actions and Flows are compiled into Falcon Process definitions
    26. 26. Mocks
    27. 27. Q & A
    28. 28. Thanks mailto:sriksun@apache.org mailto:naresh.agarwal@inmobi.com