10. Agenda
ETL & Challenges with Big Data
Apache Falcon – Background
Pipeline Designer – Overview
Pipeline Designer – Internals
11. Apache Falcon
Off the shelf, Falcon provides standard data
management functions through declarative constructs
Data movement recipes
Cross data center replication
Cross cluster data synchronization
Data retention recipes
Eviction
Archival
12. Apache Falcon
However ETL related functions are still largely left to
the developer to implement. Falcon today manages
only
Orchestration
Late data handling / Change data capture
Retries
Monitoring
13. Agenda
ETL & Challenges with Big Data
Apache Falcon – Background
Pipeline Designer – Overview
Pipeline Designer – Internals
15. Pipeline Designer – Basics
Feed
Is a data entity that Falcon manages and is physically
present in a cluster.
Data present in this feed conforms to a schema and
partitions of the same are registered with Hcatalog
Data Management functions such as eviction, archival etc
are declaratively specified through Falcon Feed
definitions
17. Pipeline Designer – Basics
Process
Workflow that defines various actions that needs to be
performed along with control flow
Executes at a specified frequency on one or more
clusters
Pipelines
Logical grouping of Falcon processes owned and
operated together
19. Pipeline Designer – Basics
Actions
Actions in designer are the building blocks for the process
workflows.
Actions have access to output variables earlier in the flow
and can emit output variables
Actions can transition to other actions
Default / Success Transition
Failure Transition
Conditional Transition
Transformation action is a special action that further is a
collection of transforms
21. Pipeline Designer – Basics
Transforms
Is a data manipulation function that accepts one or more
inputs with well defined schema and produces ore or
more outputs
Multiple transform elements can be stitched together to
compose a single transformation action which can further
be used to build a flow
Composite Transformations
Transforms that are built through a combination of multiple
primitive transforms
Possible to add more transforms and extend the system
22. Pipeline Designer – Basics
Deployment & Monitoring
Once a process and the pipeline is composed, the same
is deployed in Falcon as a standard process
23. Agenda
ETL & Challenges with Big Data
Apache Falcon – Background
Pipeline Designer – Overview
Pipeline Designer – Internals
24. Pipeline Designer Service
Pipeline Designer
Pipeline
Designer
Service
REST API
Versioned
Storage
Flow / Action
/Transforms
Compiler +
Optimizer
Falcon
Server
Hcatalog
Service
DesignerUI
FalconDashboard
Process
Feed
Schema
25. Pipeline Designer – Internals
Transformation actions are compiled into PIG scripts
Actions and Flows are compiled into Falcon Process
definitions
We basically are going to look at general applications & use cases of ETL and what are specific challenges with respect to ETL over Big data
Then we see how Apache Falcon attempts to address these in the upcoming feature
Pipeline Designer is a new feature being added to Falcon to support ETL authoring capabilities and we look into specifics of this feature and the designer internals
Finally we look at some mocks of this feature to get a sense of how this would shape.
As data is further refined, curated and processed into meaningful information and insights/intelligence, higher order value is derived out of it. ETL play a pivot role in this derivation process. Decades ago, data used to reside in just one or very few systems and data integration / ETL weren’t domainant problems, but as the system got broken down into numerous sub system this has assumed a lot of significance. With a explosion and focus on data, the needs and complexity are only to increase further.
Data warehousing is probably the one of the most common use case one might have come across in the context of ETL, but there are other use cases besides data warehousing and business intelligence.
Data Migration – When migrating one data model to another or migrating from one system to another
Data Consolidation – Often times during Mergers & Acquisition one might end up with a need to consolidate
Data Archiving – Moving data to low cost storage mostly to support compliance requirements
Master Data Management – To support single source of truth for master data across all system within an organization
Data Synchronization – To support cross data center for DR and BCP purposes
ETL have for the longest period in history been authored through hand coded scripts, in house tools specifically catering to the context of a business or through general purpose off-shelf tools with possibly wide variety of connectors and plugins.
When it comes to large scale or big data the challenges are further compounded.
Volume – Scale & Size
Variety – Diverse sources, dynamic schema / unstructured
Velocity – Freshness, Cycle turn around time