Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

  • 377 views
Uploaded on

Pipeline designer allows users to author their processes and provision them on Falcon. This should make building applications on Falcon over Hadoop fairly trivial. Falcon has the ability to operate …

Pipeline designer allows users to author their processes and provision them on Falcon. This should make building applications on Falcon over Hadoop fairly trivial. Falcon has the ability to operate with HCatalog tables natively. This means that there is a one to one correspondence between a Falcon feed and an HCatalog table. Between the feed definition in Falcon and the underlying table definition in HCatalog, there is adequate metadata about the data stored underneath. This data (sets of them) can then be operated over by a collection of transformations to extract more refined dataset/feed. This logic (currently represented via Oozie workflow / pig scripts / map-reduce jobs) is typically represented through the Falcon process. In this talk we walk through the details of the pipeline designer and the current state of this feature.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
377
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
30
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop First ETL On Apache Falcon Srikanth Sundarrajan Naresh Agarwal
  • 2. About Authors !  Srikanth Sundarrajan !  Principal Architect, InMobi Technology Services !  Naresh Agarwal !  Director – Engineering, InMobi Technology Services
  • 3. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  • 4. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  • 5. ETL (Extract Transform Load) Intelligence Information Data Value
  • 6. ETL Use cases Data Warehouse Data Migration Data Consolidation Master Data Management Data Synchronization Data Archiving
  • 7. ETL Authoring Hand coded In-house tools Off- shelf tools
  • 8. ETL & Big Data – Challenges Challenges Volume VarietyVelocity
  • 9. Big Data ETL !  Mostly Hand coded (High Cost – Implementation + Maintenance) !  Map Reduce !  Hive (i.e. SQL) !  Pig !  Crunch / Cascading !  Spark !  Off-shelf tools (Scale/Performance) !  Mostly Retrofitted
  • 10. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  • 11. Apache Falcon !  Off the shelf, Falcon provides standard data management functions through declarative constructs !  Data movement recipes !  Cross data center replication !  Cross cluster data synchronization !  Data retention recipes !  Eviction !  Archival
  • 12. Apache Falcon !  However ETL related functions are still largely left to the developer to implement. Falcon today manages only !  Orchestration !  Late data handling / Change data capture !  Retries !  Monitoring
  • 13. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  • 14. Pipeline Designer – Basics
  • 15. Pipeline Designer – Basics !  Feed !  Is a data entity that Falcon manages and is physically present in a cluster. !  Data present in this feed conforms to a schema and partitions of the same are registered with Hcatalog !  Data Management functions such as eviction, archival etc are declaratively specified through Falcon Feed definitions
  • 16. Pipeline Designer – Basics
  • 17. Pipeline Designer – Basics !  Process !  Workflow that defines various actions that needs to be performed along with control flow !  Executes at a specified frequency on one or more clusters !  Pipelines !  Logical grouping of Falcon processes owned and operated together
  • 18. Pipeline Designer – Basics
  • 19. Pipeline Designer – Basics !  Actions !  Actions in designer are the building blocks for the process workflows. !  Actions have access to output variables earlier in the flow and can emit output variables !  Actions can transition to other actions !  Default / Success Transition !  Failure Transition !  Conditional Transition !  Transformation action is a special action that further is a collection of transforms
  • 20. Pipeline Designer – Basics
  • 21. Pipeline Designer – Basics !  Transforms !  Is a data manipulation function that accepts one or more inputs with well defined schema and produces ore or more outputs !  Multiple transform elements can be stitched together to compose a single transformation action which can further be used to build a flow !  Composite Transformations !  Transforms that are built through a combination of multiple primitive transforms !  Possible to add more transforms and extend the system
  • 22. Pipeline Designer – Basics !  Deployment & Monitoring !  Once a process and the pipeline is composed, the same is deployed in Falcon as a standard process
  • 23. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  • 24. Pipeline Designer Service Pipeline Designer Pipeline Designer Service REST API Versioned Storage Flow / Action / Transforms Compiler + Optimizer Falcon Server Hcatalog Service DesignerUI FalconDashboard Process Feed Schema
  • 25. Pipeline Designer – Internals !  Transformation actions are compiled into PIG scripts !  Actions and Flows are compiled into Falcon Process definitions
  • 26. Text
  • 27. Q & A
  • 28. Thanks mailto:sriksun@apache.org mailto:naresh.agarwal@inmobi.com