Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

750
-1

Published on

Pipeline designer allows users to author their processes and provision them on Falcon. This should make building applications on Falcon over Hadoop fairly trivial. Falcon has the ability to operate with HCatalog tables natively. This means that there is a one to one correspondence between a Falcon feed and an HCatalog table. Between the feed definition in Falcon and the underlying table definition in HCatalog, there is adequate metadata about the data stored underneath. This data (sets of them) can then be operated over by a collection of transformations to extract more refined dataset/feed. This logic (currently represented via Oozie workflow / pig scripts / map-reduce jobs) is typically represented through the Falcon process. In this talk we walk through the details of the pipeline designer and the current state of this feature.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
750
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
41
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

  1. 1. Hadoop First ETL On Apache Falcon Srikanth Sundarrajan Naresh Agarwal
  2. 2. About Authors !  Srikanth Sundarrajan !  Principal Architect, InMobi Technology Services !  Naresh Agarwal !  Director – Engineering, InMobi Technology Services
  3. 3. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  4. 4. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  5. 5. ETL (Extract Transform Load) Intelligence Information Data Value
  6. 6. ETL Use cases Data Warehouse Data Migration Data Consolidation Master Data Management Data Synchronization Data Archiving
  7. 7. ETL Authoring Hand coded In-house tools Off- shelf tools
  8. 8. ETL & Big Data – Challenges Challenges Volume VarietyVelocity
  9. 9. Big Data ETL !  Mostly Hand coded (High Cost – Implementation + Maintenance) !  Map Reduce !  Hive (i.e. SQL) !  Pig !  Crunch / Cascading !  Spark !  Off-shelf tools (Scale/Performance) !  Mostly Retrofitted
  10. 10. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  11. 11. Apache Falcon !  Off the shelf, Falcon provides standard data management functions through declarative constructs !  Data movement recipes !  Cross data center replication !  Cross cluster data synchronization !  Data retention recipes !  Eviction !  Archival
  12. 12. Apache Falcon !  However ETL related functions are still largely left to the developer to implement. Falcon today manages only !  Orchestration !  Late data handling / Change data capture !  Retries !  Monitoring
  13. 13. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  14. 14. Pipeline Designer – Basics
  15. 15. Pipeline Designer – Basics !  Feed !  Is a data entity that Falcon manages and is physically present in a cluster. !  Data present in this feed conforms to a schema and partitions of the same are registered with Hcatalog !  Data Management functions such as eviction, archival etc are declaratively specified through Falcon Feed definitions
  16. 16. Pipeline Designer – Basics
  17. 17. Pipeline Designer – Basics !  Process !  Workflow that defines various actions that needs to be performed along with control flow !  Executes at a specified frequency on one or more clusters !  Pipelines !  Logical grouping of Falcon processes owned and operated together
  18. 18. Pipeline Designer – Basics
  19. 19. Pipeline Designer – Basics !  Actions !  Actions in designer are the building blocks for the process workflows. !  Actions have access to output variables earlier in the flow and can emit output variables !  Actions can transition to other actions !  Default / Success Transition !  Failure Transition !  Conditional Transition !  Transformation action is a special action that further is a collection of transforms
  20. 20. Pipeline Designer – Basics
  21. 21. Pipeline Designer – Basics !  Transforms !  Is a data manipulation function that accepts one or more inputs with well defined schema and produces ore or more outputs !  Multiple transform elements can be stitched together to compose a single transformation action which can further be used to build a flow !  Composite Transformations !  Transforms that are built through a combination of multiple primitive transforms !  Possible to add more transforms and extend the system
  22. 22. Pipeline Designer – Basics !  Deployment & Monitoring !  Once a process and the pipeline is composed, the same is deployed in Falcon as a standard process
  23. 23. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  24. 24. Pipeline Designer Service Pipeline Designer Pipeline Designer Service REST API Versioned Storage Flow / Action / Transforms Compiler + Optimizer Falcon Server Hcatalog Service DesignerUI FalconDashboard Process Feed Schema
  25. 25. Pipeline Designer – Internals !  Transformation actions are compiled into PIG scripts !  Actions and Flows are compiled into Falcon Process definitions
  26. 26. Text
  27. 27. Q & A
  28. 28. Thanks mailto:sriksun@apache.org mailto:naresh.agarwal@inmobi.com

×