Hadoop first ETL on Apache Falcon

•

3 likes•2,560 views

DataWorks Summit

Technology Business

Hadoop First ETL On
Apache Falcon
Srikanth Sundarrajan
Naresh Agarwal

About Authors
 Srikanth Sundarrajan
 Principal Architect, InMobi Technology Services
 Naresh Agarwal
 Director – Engineering, InMobi Technology Services

Agenda
 ETL & Challenges with Big Data
 Apache Falcon – Background
 Pipeline Designer – Overview
 Pipeline Designer – Internals

ETL (Extract Transform Load)
Intelligence
Information
Data
Value

ETL Use cases
Data
Warehouse
Data
Migration
Data
Consolidation
Master Data
Management
Data
Synchronization
Data Archiving

ETL Authoring
Hand
coded
In-house
tools
Off-
shelf
tools

ETL & Big Data – Challenges
Challenges
Volume
VarietyVelocity

Big Data ETL
 Mostly Hand coded (High Cost – Implementation +
Maintenance)
 Map Reduce
 Hive (i.e. SQL)
 Pig
 Crunch / Cascading
 Spark
 Off-shelf tools (Scale/Performance)
 Mostly Retrofitted

Apache Falcon
 Off the shelf, Falcon provides standard data
management functions through declarative constructs
 Data movement recipes
 Cross data center replication
 Cross cluster data synchronization
 Data retention recipes
 Eviction
 Archival

Apache Falcon
 However ETL related functions are still largely left to
the developer to implement. Falcon today manages
only
 Orchestration
 Late data handling / Change data capture
 Retries
 Monitoring

Pipeline Designer – Basics
 Feed
 Is a data entity that Falcon manages and is physically
present in a cluster.
 Data present in this feed conforms to a schema and
partitions of the same are registered with Hcatalog
 Data Management functions such as eviction, archival etc
are declaratively specified through Falcon Feed
definitions

Pipeline Designer – Basics
 Process
 Workflow that defines various actions that needs to be
performed along with control flow
 Executes at a specified frequency on one or more
clusters
 Pipelines
 Logical grouping of Falcon processes owned and
operated together

Pipeline Designer – Basics
 Actions
 Actions in designer are the building blocks for the process
workflows.
 Actions have access to output variables earlier in the flow
and can emit output variables
 Actions can transition to other actions
 Default / Success Transition
 Failure Transition
 Conditional Transition
 Transformation action is a special action that further is a
collection of transforms

Pipeline Designer – Basics
 Transforms
 Is a data manipulation function that accepts one or more
inputs with well defined schema and produces ore or
more outputs
 Multiple transform elements can be stitched together to
compose a single transformation action which can further
be used to build a flow
 Composite Transformations
 Transforms that are built through a combination of multiple
primitive transforms
 Possible to add more transforms and extend the system

Pipeline Designer – Basics
 Deployment & Monitoring
 Once a process and the pipeline is composed, the same
is deployed in Falcon as a standard process

Pipeline Designer Service
Pipeline Designer
Pipeline
Designer
Service
REST API
Versioned
Storage
Flow / Action
/Transforms
Compiler +
Optimizer
Falcon
Server
Hcatalog
Service
DesignerUI
FalconDashboard
Process
Feed
Schema

Pipeline Designer – Internals
 Transformation actions are compiled into PIG scripts
 Actions and Flows are compiled into Falcon Process
definitions

Thanks
mailto:sriksun@apache.org
mailto:naresh.agarwal@inmobi.com

What's hot

The Future of Apache Hadoop an Enterprise Architecture ViewDataWorks Summit/Hadoop Summit

Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies DataWorks Summit/Hadoop Summit

Internet of things Crash Course WorkshopDataWorks Summit

A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks

Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks

Hp Converged Systems and Hortonworks - Webinar SlidesHortonworks

Hortonworks tech workshop in-memory processing with sparkHortonworks

Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingDataWorks Summit/Hadoop Summit

Integrating Apache Spark and NiFi for Data LakesDataWorks Summit/Hadoop Summit

Running Zeppelin in EnterpriseDataWorks Summit

Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Hortonworks

Enterprise Data Classification and ProvenanceDataWorks Summit/Hadoop Summit

Falcon Meetup Hortonworks

Apache Hive 2.0: SQL, Speed, ScaleDataWorks Summit/Hadoop Summit

LEGO: Data Driven Growth Hacking Powered by Big Data DataWorks Summit/Hadoop Summit

Enabling Diverse Workload Scheduling in YARNDataWorks Summit

Deploying Docker applications on YARN via SliderHortonworks

Delivering Apache Hadoop for the Modern Data Architecture Hortonworks

Hortonworks Technical Workshop - HDP Search Hortonworks

Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit

What's hot (20)

The Future of Apache Hadoop an Enterprise Architecture View

Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies

Internet of things Crash Course Workshop

A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...

Hortonworks Technical Workshop: What's New in HDP 2.3

Hp Converged Systems and Hortonworks - Webinar Slides

Hortonworks tech workshop in-memory processing with spark

Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting

Integrating Apache Spark and NiFi for Data Lakes

Running Zeppelin in Enterprise

Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Enterprise Data Classification and Provenance

Falcon Meetup

Apache Hive 2.0: SQL, Speed, Scale

LEGO: Data Driven Growth Hacking Powered by Big Data

Enabling Diverse Workload Scheduling in YARN

Deploying Docker applications on YARN via Slider

Delivering Apache Hadoop for the Modern Data Architecture

Hortonworks Technical Workshop - HDP Search

Hadoop & Cloud Storage: Object Store Integration in Production

Viewers also liked

Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmDataWorks Summit

Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks

Selective Data Replication with Geographically Distributed HadoopDataWorks Summit

Hadoop概要説明Satoshi Noto

分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向（OSC2015 Kansai発表資料）NTT DATA OSS Professional Services

Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS

Viewers also liked (6)

Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm

Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Selective Data Replication with Geographically Distributed Hadoop

Hadoop概要説明

分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向（OSC2015 Kansai発表資料）

Testing Big Data: Automated Testing of Hadoop with QuerySurge

Similar to Hadoop first ETL on Apache Falcon

Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-DesignerSrikanth Sundarrajan

Griffith Bi Migration & Source ControlDavid Waters

LeedsSharp May 2023 - Azure Integration ServicesMichael Stephenson

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus

Pentaho etl-toolSreenivas Kappala

Report From Oracle Open World 2008 AMIS 2 October2008Lucas Jellema

Oracle To Sql Server migration processharirk1986

Data Governance - Atlas 7.12.2015Hortonworks

Synergy 7.0 Sales 10312008Bill Duncan

Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minssparkflows

HBaseCon2015-finalMaryann Xue

ebs-adapter-webcast12345678900000000.pdfBrighton26

Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen

2007 SAPTech EdMichelle Crapo

oracle_soultion_oracledataintegrator_goldengate_2021ssuser8ccb5a

CERN_DIS_ODI_OGG_final_oracle_golde.pptxcamyla81

Migrating to SharePoint 2013 - Business and Technical PerspectiveJohn Calvert

Apache AirflowKnoldus Inc.

Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFlink Forward

Similar to Hadoop first ETL on Apache Falcon (20)

Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

Griffith Bi Migration & Source Control

LeedsSharp May 2023 - Azure Integration Services

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...

Pentaho etl-tool

Report From Oracle Open World 2008 AMIS 2 October2008

Oracle To Sql Server migration process

Data Governance - Atlas 7.12.2015

Synergy 7.0 Sales 10312008

Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins

HBaseCon2015-final

ebs-adapter-webcast12345678900000000.pdf

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform

2007 SAPTech Ed

oracle_soultion_oracledataintegrator_goldengate_2021

CERN_DIS_ODI_OGG_final_oracle_golde.pptx

Migrating to SharePoint 2013 - Business and Technical Perspective

Apache Airflow

Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Corporate and higher education May webinar.pptxRustici Software

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

MINDCTI Revenue Release Quarter One 2024MIND CTI

Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

AI in Action: Real World Use Cases by AnitarajAnitaRaj43

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

DBX First Quarter 2024 Investor PresentationDropbox

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Understanding the FAA Part 107 License ..

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Corporate and higher education May webinar.pptx

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

MINDCTI Revenue Release Quarter One 2024

Vector Search -An Introduction in Oracle Database 23ai.pptx

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

WSO2's API Vision: Unifying Control, Empowering Developers

AI in Action: Real World Use Cases by Anitaraj

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

AWS Community Day CPH - Three problems of Terraform

DBX First Quarter 2024 Investor Presentation

Hadoop first ETL on Apache Falcon

1. Hadoop First ETL On Apache Falcon Srikanth Sundarrajan Naresh Agarwal

2. About Authors  Srikanth Sundarrajan  Principal Architect, InMobi Technology Services  Naresh Agarwal  Director – Engineering, InMobi Technology Services

3. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals

4. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals

5. ETL (Extract Transform Load) Intelligence Information Data Value

6. ETL Use cases Data Warehouse Data Migration Data Consolidation Master Data Management Data Synchronization Data Archiving

7. ETL Authoring Hand coded In-house tools Off- shelf tools

8. ETL & Big Data – Challenges Challenges Volume VarietyVelocity

9. Big Data ETL  Mostly Hand coded (High Cost – Implementation + Maintenance)  Map Reduce  Hive (i.e. SQL)  Pig  Crunch / Cascading  Spark  Off-shelf tools (Scale/Performance)  Mostly Retrofitted

10. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals

11. Apache Falcon  Off the shelf, Falcon provides standard data management functions through declarative constructs  Data movement recipes  Cross data center replication  Cross cluster data synchronization  Data retention recipes  Eviction  Archival

12. Apache Falcon  However ETL related functions are still largely left to the developer to implement. Falcon today manages only  Orchestration  Late data handling / Change data capture  Retries  Monitoring

13. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals

14. Pipeline Designer – Basics

15. Pipeline Designer – Basics  Feed  Is a data entity that Falcon manages and is physically present in a cluster.  Data present in this feed conforms to a schema and partitions of the same are registered with Hcatalog  Data Management functions such as eviction, archival etc are declaratively specified through Falcon Feed definitions

16. Pipeline Designer – Basics

17. Pipeline Designer – Basics  Process  Workflow that defines various actions that needs to be performed along with control flow  Executes at a specified frequency on one or more clusters  Pipelines  Logical grouping of Falcon processes owned and operated together

18. Pipeline Designer – Basics

19. Pipeline Designer – Basics  Actions  Actions in designer are the building blocks for the process workflows.  Actions have access to output variables earlier in the flow and can emit output variables  Actions can transition to other actions  Default / Success Transition  Failure Transition  Conditional Transition  Transformation action is a special action that further is a collection of transforms

20. Pipeline Designer – Basics

21. Pipeline Designer – Basics  Transforms  Is a data manipulation function that accepts one or more inputs with well defined schema and produces ore or more outputs  Multiple transform elements can be stitched together to compose a single transformation action which can further be used to build a flow  Composite Transformations  Transforms that are built through a combination of multiple primitive transforms  Possible to add more transforms and extend the system

22. Pipeline Designer – Basics  Deployment & Monitoring  Once a process and the pipeline is composed, the same is deployed in Falcon as a standard process

23. Agenda  ETL & Challenges with Big Data  Apache Falcon – Background  Pipeline Designer – Overview  Pipeline Designer – Internals

24. Pipeline Designer Service Pipeline Designer Pipeline Designer Service REST API Versioned Storage Flow / Action /Transforms Compiler + Optimizer Falcon Server Hcatalog Service DesignerUI FalconDashboard Process Feed Schema

25. Pipeline Designer – Internals  Transformation actions are compiled into PIG scripts  Actions and Flows are compiled into Falcon Process definitions

26. Mocks

27. Q & A

28. Thanks mailto:sriksun@apache.org mailto:naresh.agarwal@inmobi.com

Editor's Notes

We basically are going to look at general applications & use cases of ETL and what are specific challenges with respect to ETL over Big data Then we see how Apache Falcon attempts to address these in the upcoming feature Pipeline Designer is a new feature being added to Falcon to support ETL authoring capabilities and we look into specifics of this feature and the designer internals Finally we look at some mocks of this feature to get a sense of how this would shape.
As data is further refined, curated and processed into meaningful information and insights/intelligence, higher order value is derived out of it. ETL play a pivot role in this derivation process. Decades ago, data used to reside in just one or very few systems and data integration / ETL weren’t domainant problems, but as the system got broken down into numerous sub system this has assumed a lot of significance. With a explosion and focus on data, the needs and complexity are only to increase further.
Data warehousing is probably the one of the most common use case one might have come across in the context of ETL, but there are other use cases besides data warehousing and business intelligence. Data Migration – When migrating one data model to another or migrating from one system to another Data Consolidation – Often times during Mergers & Acquisition one might end up with a need to consolidate Data Archiving – Moving data to low cost storage mostly to support compliance requirements Master Data Management – To support single source of truth for master data across all system within an organization Data Synchronization – To support cross data center for DR and BCP purposes
ETL have for the longest period in history been authored through hand coded scripts, in house tools specifically catering to the context of a business or through general purpose off-shelf tools with possibly wide variety of connectors and plugins.
When it comes to large scale or big data the challenges are further compounded. Volume – Scale & Size Variety – Diverse sources, dynamic schema / unstructured Velocity – Freshness, Cycle turn around time

Hadoop first ETL on Apache Falcon

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Hadoop first ETL on Apache Falcon

Similar to Hadoop first ETL on Apache Falcon (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Hadoop first ETL on Apache Falcon

Editor's Notes