Driving Enterprise Data Governance for Big Data Systems through Apache Falcon

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Falcon
Hadoop Data Governance
Hortonworks. We do Hadoop.

Venkatesh Seetharam
Architect, Data Management
Hortonworks Inc.
PMC, Apache Falcon
PMC, Apache Knox
Proposed Apache Atlas

Agenda
Overview Components Features Governance

Motivation for Apache Falcon

Simple Data Pipeline…
Page 5
HDFS
YARN
Landing Materialized Views
Oozie Workﬂow
source_db.raw_input_table
Partition
2014-01-01-10
Partition
2014-01-01-12
Partition
2014-01-01-12
Partition
N
Pig JobHive Job
source_db.input_table
Partition
2014-01-01-10
Partition
2014-01-01-12
Partition
N

Add Data Management Capability to the Pipeline
Page 6
HDFS
YARN
Landing Materialized Views
Oozie Workﬂow
source_db.raw_input_table
Partition
2014-01-01-10
Partition
2014-01-01-12
Partition
2014-01-01-12
Partition
N
Pig JobHive Job
source_db.input_table
Partition
2014-01-01-10
Partition
2014-01-01-12
Partition
N
Frequent
Feeds
Late Data
Arrival
Replication
Rentention
Archival
Exception
Handling
Lineage
Audit
Monitoring

Pipeline Becomes Considerably More Complex
Oozie Workﬂow
Pig JobHive Job
Results in Many Complex Oozie
Workflows
Frequent
Feeds
Late Data
Arrival
Replication RententionArchival
Exception
Handling
Lineage AuditMonitoring Data Management
Requirements

Introduction to Apache Falcon

Falcon Overview
Centrally Manage Data Lifecycle
– Centralized definition & management of pipelines for data ingest, process &
export
Business Continuity & Disaster Recovery
– Out of the box policies for data replication & retention
– End to end monitoring of data pipelines
Address audit & compliance
requirements
– Visualize data pipeline lineage
– Track data pipeline audit logs
– Tag data with business metadata
The data traffic cop

Complicated Pipeline Simplified with Apache Falcon
Falcon Generates and Instruments
Oozie Workflows
Falcon Engine
Lineage AuditMonitoring
Frequent
Feeds
Late Data
Arrival
Replication RententionArchival
Exception
Handling
Frequent
Feeds
Submit & Schedule Falcon Entities
Cluster
Cluster
Feed
Feed Feed
Process

Falcon Architecture
Centralized Falcon Orchestration Framework
Hadoop ecosystem tools
Falcon Server JMS
API
&
UI
AMBARI
HDFS / Hive
Oozie
Entity
Specs Scheduled Jobs
Process
Status
MapRed / Pig / Hive / Sqoop /
Flume / DistCP
Data
stewards
+
Hadoop
admins

Falcon Basic Concepts
• Cluster: Represents the “interfaces” to a Hadoop cluster
• Feed: Defines a “dataset” File, Hive Table or Stream
• Process: Consumes feeds, invokes processing logic & produces feeds
Page 12
All these put together represent ‘Data Pipelines’ in Hadoop
CLUSTER
FEED
aka
DATASET
PROCESS
INPUT TO
CREATES

Data Pipeline: Definition
• Flexible based pipeline specification
–JAXB / JSON / JAVA / XML
–Modular - Clusters, feeds & processes defined separately and then linked together
–Easy to re-use across multiple pipelines
• Out of the box policies
–Predefined policies for replication, late data handling & eviction
–Easily customization of policies
• Extensible
–Plug in external solutions at any step of the pipeline
–Eg. Invoke third party data obfuscation components

Flexibility in Processing
Common types of processing engines can be tied to Falcon processes
Oozie workflows Pig scripts HQL scripts

Data Pipeline: Monitoring
DATA
Primary site DR site
Centralized monitoring of data pipeline
With Falcon + Ambari
Pipeline run
alerts
Hadoop Cluster-1 Hadoop Cluster-2
Pipeline run
history
Pipeline
Scheduling
raw clean prep raw clean prep

Replication with Falcon
Staged Data
Presented
Data
Cleansed
Data
Conformed
Data
Staged Data
Presented
Data
Replication
Failover Hadoop Cluster
Primary Hadoop Cluster
Replication
BI / Analytics
BusinessObjects BI
• Falcon manages workflow and replication
• Enables business continuity without requiring full data reprocessing
• Failover clusters can be smaller than primary clusters

Data Retention with Falcon
Staged Data
Presented
Data
Cleansed
Data
Conformed
Data
Retain 5
Years
Retain Last
Copy Only
Retain 3
Years
Retain 3
Years
• Sophisticated retention policies expressed in one place
• Simplify data retention for audit, compliance, or for data re-processing
Retention
Policy

Late Data Handling with Falcon
Staged Data Combined Data
Online
Transaction Data
(via Sqoop)
Web Log Data
(via FTP)
Wait up to 4
hours for FTP
data to arrive
• Processing waits until all required input data is available
• Checks for late data arrivals, issues retrigger processing as necessary
• Eliminates writing complex data handling rules within applications

HCatalog
Table access
Aligned metadata
REST API
• Raw Hadoop data
• Inconsistent, unknown
• Tool specific access
Apache Falcon provides metadata services via HCatalog
Metadata Services with HCatalog
• Consistency of metadata and data models across tools (MapReduce, Pig, Hbase,
and Hive)
• Accessibility: share data as tables in and out of HDFS
• Availability: enables flexible, thin-client access via REST API
Shared table and
schema management
opens the platform
Page 19

Data Governance in Apache Falcon

Data Pipeline: Tracing
.
Purchase
feed
Customer
feed
Product
feed
Store
feed
View dependencies
between clusters,
datasets and processes
Data pipeline
dependencies
Add arbitrary tags to
feeds & processes
Data pipeline
tagging
Coming Soon
Know who modified a
dataset when and into
what
Data pipeline
audits
Analyze how a
dataset reached a
particular state
Data pipeline
lineage

Custom Metadata in Falcon
• Metadata on Ingest (Content)
– What is the format I expect my data to be in?
– What source systems did the data come from, owners?
– Answer: ingest descriptors + Hcat schema versioning
• Metadata for Security (Access Controls)
– How is each column blinded or encrypted?
– Can I trust that I can join data across tables? What if email is encrypted differently?
– Answer: security descriptors
• Metadata for lineage (Source, History)
– How do I chase down sources of data leading to reports and data?
– Answer: lineage carried forward per workflow
• Metadata for marts (Usage Constraints, Enrichment)
– How do I materialize views and drop views as needed?

Entity Dependency in Falcon
• Dependencies between Falcon entity definitions: cluster, feed & process
– Lineage attributes: workflows, input/output feed windows, user, input and output paths, workflow engine,
input/output size

Lineage in Falcon

Audit, Tagging and Access Control
• Tagging
– Allows custom tags in entities
– Can decorate process entities pipeline names
• Access Control
– Support for ACL in entities
– Authorization driven based on ACLs in entities
• Audit
– Each execution is controlled by Falcon and runs are audited
– Correlate the execution with Lineage (Design)
• Search
– Search based on Tags, Pipelines, etc.
– Full-text search

Technology
• Metadata Repository
– Titan Graph Database
– Pluggable backing store, berkelydbje, Hbase
• Entity Metadata
– Tags, Entities are stored in the repository
• Execution Metadata
– Execution metadata are stored in the repository as well – this is unique to Falcon
– Optional inputs
• Search
– Pluggable backend – Solr or Elastic Search

New in Apache Falcon 0.6.0
What is coming soon?

DR Mirroring of HDFS with Recipes
•Mirroring for Disaster
Recovery and Business
continuity use cases.
•Customizable for multiple
targets and frequency of
synchronization
•Recipes: Template model
re-use of complex workflows
Recipe
Reduce
Cleanse
Replicate
Propertie
s
Workflow
Template
RecipePropertie
s
RecipePropertie
s
Workflow
Template

Replication to Cloud
•Seemlessly replicate to Cloud
targets
•Replicate from Cloud as a source.
•Support for Amazon S3 and
Microsoft Azure
Azure
Amazon S3
On Prem Cluster

Q & A

Thank you!
Learn more at:
hortonworks.com/hadoop/falcon/

Driving Enterprise Data Governance for Big Data Systems through Apache Falcon

More Related Content

What's hot

Viewers also liked

Similar to Driving Enterprise Data Governance for Big Data Systems through Apache Falcon

More from DataWorks Summit

Recently uploaded

Driving Enterprise Data Governance for Big Data Systems through Apache Falcon

Editor's Notes