Apache Falcon at Hadoop Summit Europe 2014

Data Management Platform
on Hadoop
Venkatesh Seetharam
(Incubating)

© Hortonworks Inc. 2011
whoami
Hortonworks Inc.
–Architect/Developer
–Lead Data Management efforts
Apache
–Apache Falcon Committer, IPMC
–Apache Knox Committer
–Apache Hadoop, Sqoop, Oozie Contributor
Part of the Hadoop team at Yahoo! since 2007
–Senior Principal Architect of Hadoop Data at Yahoo!
–Built 2 generations of Data Management at Yahoo!
Page 2
Architecting the Future of Big Data

Agenda
2 Falcon Overview
1 Motivation
3 Falcon Architecture
4 Case Studies

Data Processing Landscape
External
data
source
Acquire
(Import)
Data Processing
(Transform/Pipeline
)
Eviction Archive
Replicate
(Copy)
Export

Core Services
Process
Management
• Relays
• Late data handling
• Retries
Data
Management
• Import/Export
• Replication
• Retention
Data
Governance
• Lineage
• Audit
• SLA

Holistic Declaration of Intent
picture courtersy: http://bigboxdetox.com

Entity Dependency Graph
Hadoop /
Hbase …
Cluster
External
data
source
feed Process
depends
depends

<?xml version="1.0"?>
<cluster colo=”NJ-datacenter" description="" name=”prod-cluster">
<interfaces>
<interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" />
<interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" />
<interface type="execute" endpoint=”rm:8050" version="2.2.0" />
<interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" />
<interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" />
<interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" />
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/prod-cluster/staging" />
<location name="temp" path="/tmp" />
<location name="working" path="/apps/falcon/prod-cluster/working" />
</locations>
</cluster>
Needed by distcp
for replications
Writing to HDFS
Used to submit
processes as MR
Submit Oozie jobs
Hive metastore to
register/deregister
partitions and get
events on partition
availability
Used For alerts
HDFS directories used by Falcon server
Cluster Specification

Feed Specification
<?xml version="1.0"?>
<feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1">
<frequency>hours(1)</frequency>
<late-arrival cut-off="hours(6)”/>
<groups>churnAnalysisFeeds</groups>
<tags externalSource=TeradataEDW-1, externalTarget=Marketing>
<clusters>
<cluster name=”cluster-primary" type="source">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
<cluster name=”cluster-secondary" type="target">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
<retention limit=”days(7)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
</locations>
<ACL owner=”hdfs" group="users" permission="0755"/>
<schema location="/none" provider="none"/>
</feed>
Feed run frequency in mins/hrs/days/mths
Late arrival cutoff
Global location across
clusters - HDFS paths
or Hive tables
Feeds can belong to
multiple groups
One or more source &
target clusters for
retention & replication
Access Permissions
Metadata tagging

Process Specification
<process name="process-test" xmlns="uri:falcon:process:0.1”>
<clusters>
<cluster name="cluster-primary">
<validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" />
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>days(1)</frequency>
<inputs>
<input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" />
</inputs>
<outputs>
<output instance="now(0,2)" feed="feed-clicks-clean" name="output" />
</outputs>
<workflow engine="pig" path="/apps/clickstream/clean-script.pig" />
<retry policy="periodic" delay="minutes(10)" attempts="3"/>
<late-process policy="exp-backoff" delay="hours(1)">
<late-input input="input" workflow-path="/apps/clickstream/late" />
</late-process>
</process>
How frequently does the
process run , how many
instances can be run in parallel
and in what order
Which cluster should the
process run on and when
The processing logic.
Retry policy on
failure
Handling late input
feeds
Input & output feeds
for process

Late Data Handling
 Defines how the late (out of band) data is handled
 Each Feed can define a late cut-off value
<late-arrival cut-off="hours(4)”/>
 Each Process can define how this late data is handled
<late-process policy="exp-backoff" delay="hours(1)”>
<late-input input="input" workflow-path="/apps/clickstream/late" />
</late-process>
 Policies include:
 backoff
 exp-backoff
 final

Retry Policies
 Each Process can define retry policy
<process name="[process name]">
...
<retry policy=[retry policy] delay=[retry delay]attempts=[attempts]/>
<retry policy="backoff" delay="minutes(10)" attempts="3"/>
...
</process>
 Policies include:
 backoff
 exp-backoff

Apache Falcon
Provides Orchestrates
Data Management Needs Tools
Multi Cluster Management Oozie
Replication Sqoop
Scheduling Distcp
Data Reprocessing Flume
Dependency Management Map / Reduce
Eviction Hive and Pig Jobs
Governance
Falcon provides a single interface to orchestrate data lifecycle.
Sophisticated DLM easily added to Hadoop applications.
Falcon: One-stop Shop for
Data Management

High Level Architecture
Apache
Falcon
Oozie
Messaging
HCatalog
HDFS
Entity
Entity
status
Process
status /
notification
CLI/REST
JMS
Config
store

Feed Schedule
Cluster
xml
Feed xml Falcon
Falcon config
store / Graph
Retention /
Replication
workflow
Oozie
Scheduler HDFS
JMS Notification
per action
Catalog
service
Instance
Management

Process Schedule
Cluster/fe
ed xml
Process
xml
Falcon
Falcon config
store / Graph
Process
workflow
Oozie
Scheduler HDFS
JMS Notification
per available
feed
Catalog
service
Instance
Management

Physical Architecture
• STANDALONE
– Single Data Center
– Single Falcon Server
– Hadoop jobs and relevant
processing involves only one
cluster
• DISTRBUTED
– Multiple Data Centers
– Falcon Server per DC
– Multiple instances of hadoop
clusters and workflow schedulers
HADOOP
Store & Process
Falcon
Server
(standalone)
Site 1
HADOOP
Store & Process
replication
HADOOP
Store & Process
Falcon
Server
(standalone)
Site 1
HADOOP
Store & Process
replication
Site 2
Falcon
Server
(standalone)
Falcon Prism
Server
(distributed)

CASE STUDY
Multi Cluster Failover

CASE STUDY
Distributed Processing
Example: Digital Advertising @ InMobi

Processing – Single Data Center
Ad Request
data
Impression
render event
Click event
Conversion
event
Continuou
s
Streaming
(minutely)
Hourly
summary
Enrichment
(minutely/5
minutely)
Summarizer

Global Aggregation
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
……..
DataCenter1
DataCenterN
Consumable
global aggregate

Future
Data Governance
Data Pipeline Designer
Data Acquisition – file-based
Monitoring/Management
Dashboard

Questions?
 Apache Falcon
 http://falcon.incubator.apache.org
 mailto: dev@falcon.incubator.apache.org
 Venkatesh Seetharam
 venkatesh@apache.org
 #innerzeal

Apache Falcon at Hadoop Summit Europe 2014

More Related Content

What's hot

Similar to Apache Falcon at Hadoop Summit Europe 2014

Recently uploaded

Apache Falcon at Hadoop Summit Europe 2014

Editor's Notes