SlideShare a Scribd company logo
1 of 31
Data Management Platform
on Hadoop
Srikanth Sundarrajan
Venkatesh Seetharam
(Incubating)
© Hortonworks Inc. 2011
whoami
 Srikanth Sundarrajan
–Principal Architect, InMobi
–PMC/Committer, Apache Falcon
–Apache Hadoop Contributor
–Hadoop Team @ Yahoo!
 Venkatesh Seetharam
–Architect/Developer, Hortonworks Inc.
–Apache Falcon Committer, IPMC
–Apache Knox Committer
–Apache Hadoop, Sqoop, Oozie Contributor
–Hadoop team at Yahoo! since 2007
–Built 2 generations of Data Management at Yahoo!
Page 2
Architecting the Future of Big Data
Agenda
2 Falcon Overview
1 Motivation
3 Falcon Architecture
4 Case Studies
MOTIVATION
Data Processing Landscape
External
data
source
Acquire
(Import)
Data Processing
(Transform/Pipeline
)
Eviction Archive
Replicate
(Copy)
Export
Core Services
Process
Management
• Relays
• Late data handling
• Retries
Data
Management
• Import/Export
• Replication
• Retention
Data
Governance
• Lineage
• Audit
• SLA
FALCON OVERVIEW
Holistic Declaration of Intent
picture courtersy: http://bigboxdetox.com
Entity Dependency Graph
Hadoop /
Hbase …
Cluster
External
data
source
feed Process
depends
depends
<?xml version="1.0"?>
<cluster colo=”NJ-datacenter" description="" name=”prod-cluster">
<interfaces>
<interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" />
<interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" />
<interface type="execute" endpoint=”rm:8050" version="2.2.0" />
<interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" />
<interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" />
<interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" />
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/prod-cluster/staging" />
<location name="temp" path="/tmp" />
<location name="working" path="/apps/falcon/prod-cluster/working" />
</locations>
</cluster>
Needed by distcp
for replications
Writing to HDFS
Used to submit
processes as MR
Submit Oozie jobs
Hive metastore to
register/deregister
partitions and get
events on partition
availability
Used For alerts
HDFS directories used by Falcon server
Cluster Specification
Feed Specification
<?xml version="1.0"?>
<feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1">
<frequency>hours(1)</frequency>
<late-arrival cut-off="hours(6)”/>
<groups>churnAnalysisFeeds</groups>
<tags externalSource=TeradataEDW-1, externalTarget=Marketing>
<clusters>
<cluster name=”cluster-primary" type="source">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
<cluster name=”cluster-secondary" type="target">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
<retention limit=”days(7)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
</locations>
<ACL owner=”hdfs" group="users" permission="0755"/>
<schema location="/none" provider="none"/>
</feed>
Feed run frequency in mins/hrs/days/mths
Late arrival cutoff
Global location across
clusters - HDFS paths
or Hive tables
Feeds can belong to
multiple groups
One or more source &
target clusters for
retention & replication
Access Permissions
Metadata tagging
Process Specification
<process name="process-test" xmlns="uri:falcon:process:0.1”>
<clusters>
<cluster name="cluster-primary">
<validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" />
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>days(1)</frequency>
<inputs>
<input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" />
</inputs>
<outputs>
<output instance="now(0,2)" feed="feed-clicks-clean" name="output" />
</outputs>
<workflow engine="pig" path="/apps/clickstream/clean-script.pig" />
<retry policy="periodic" delay="minutes(10)" attempts="3"/>
<late-process policy="exp-backoff" delay="hours(1)">
<late-input input="input" workflow-path="/apps/clickstream/late" />
</late-process>
</process>
How frequently does the
process run , how many
instances can be run in parallel
and in what order
Which cluster should the
process run on and when
The processing logic.
Retry policy on
failure
Handling late input
feeds
Input & output feeds
for process
Late Data Handling
 Defines how the late (out of band) data is handled
 Each Feed can define a late cut-off value
<late-arrival cut-off="hours(4)”/>
 Each Process can define how this late data is handled
<late-process policy="exp-backoff" delay="hours(1)”>
<late-input input="input" workflow-path="/apps/clickstream/late" />
</late-process>
 Policies include:
 backoff
 exp-backoff
 final
Retry Policies
 Each Process can define retry policy
<process name="[process name]">
...
<retry policy=[retry policy] delay=[retry
delay]attempts=[attempts]/>
<retry policy="backoff" delay="minutes(10)" attempts="3"/>
...
</process>
 Policies include:
 backoff
 exp-backoff
Lineage
Apache Falcon
Provides Orchestrates
Data Management Needs Tools
Multi Cluster Management Oozie
Replication Sqoop
Scheduling Distcp
Data Reprocessing Flume
Dependency Management Map / Reduce
Eviction Hive and Pig Jobs
Governance
Falcon provides a single interface to orchestrate data lifecycle.
Sophisticated DLM easily added to Hadoop applications.
Falcon: One-stop Shop for
Data Management
FALCON ARCHITECTURE
High Level Architecture
Apache
Falcon
Oozie
Messaging
HCatalog
HDFS
Entity
Entity
status
Process
status /
notification
CLI/REST
JMS
Config
store
Feed Schedule
Cluster
xml
Feed xml Falcon
Falcon config
store / Graph
Retention /
Replication
workflow
Oozie
Scheduler HDFS
JMS Notification
per action
Catalog
service
Instance
Management
Process Schedule
Cluster/fe
ed xml
Process
xml
Falcon
Falcon config
store / Graph
Process
workflow
Oozie
Scheduler HDFS
JMS Notification
per available
feed
Catalog
service
Instance
Management
Physical Architecture
• STANDALONE
– Single Data Center
– Single Falcon Server
– Hadoop jobs and relevant
processing involves only one
cluster
• DISTRBUTED
– Multiple Data Centers
– Falcon Server per DC
– Multiple instances of hadoop
clusters and workflow schedulers
HADOOP
Store & Process
Falcon
Server
(standalone)
Site 1
HADOOP
Store & Process
replication
HADOOP
Store & Process
Falcon
Server
(standalone)
Site 1
HADOOP
Store & Process
replication
Site 2
Falcon
Server
(standalone)
Falcon Prism
Server
(distributed)
CASE STUDY
Multi Cluster Failover
Multi Cluster – Failover
> Falcon manages workflow, replication or both.
> Enables business continuity without requiring full data reprocessing.
> Failover clusters require less storage and CPU.
Staged
Data
Cleansed
Data
Conformed
Data
Presented
Data
Staged
Data
Presented
Data
BI and Analytics
Primary Hadoop Cluster
Failover Hadoop Cluster
Replication
Retention Policies
Staged Data
Retain 5 Years
Cleansed Data
Retain 3 Years
Conformed
Data
Retain 3 Years
Presented
Data
Retain Last
Copy Only
> Sophisticated retention policies expressed in one place.
> Simplify data retention for audit, compliance, or for data re-processing.
CASE STUDY
Distributed Processing
Example: Digital Advertising @ InMobi
Processing – Single Data Center
Ad Request
data
Impression
render event
Click event
Conversion
event
Continuou
s
Streaming
(minutely)
Hourly
summary
Enrichment
(minutely/5
minutely)
Summarizer
Global Aggregation
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
……..
DataCenter1
DataCenterN
Consumable
global aggregate
HIGHLIGHTS
Future
Data Governance
Data Pipeline Designer
Authorization
Monitoring/Management
Dashboard
Summary
Questions?
 Apache Falcon
 http://falcon.incubator.apache.org
 mailto: dev@falcon.incubator.apache.org
 Srikanth Sundarrajan
 sriksun@apache.org
 #sriksun
 Venkatesh Seetharam
 venkatesh@apache.org
 #innerzeal

More Related Content

What's hot

Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks Technical Workshop -  build a yarn ready application with apache ...Hortonworks Technical Workshop -  build a yarn ready application with apache ...
Hortonworks Technical Workshop - build a yarn ready application with apache ...Hortonworks
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Best Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentBest Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentDataWorks Summit/Hadoop Summit
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHortonworks
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016alanfgates
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudDataWorks Summit
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasDataWorks Summit/Hadoop Summit
 
Enabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integrationEnabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integrationDataWorks Summit
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in EnterpriseDataWorks Summit
 

What's hot (20)

Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks Technical Workshop -  build a yarn ready application with apache ...Hortonworks Technical Workshop -  build a yarn ready application with apache ...
Hortonworks Technical Workshop - build a yarn ready application with apache ...
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Best Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentBest Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop Environment
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud Storage
 
Apache Falcon DevOps
Apache Falcon DevOpsApache Falcon DevOps
Apache Falcon DevOps
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
 
Enabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integrationEnabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integration
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in Enterprise
 

Viewers also liked

Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies DataWorks Summit/Hadoop Summit
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Why is My Hadoop Job Slow?
Why is My Hadoop Job Slow?Why is My Hadoop Job Slow?
Why is My Hadoop Job Slow?Bikas Saha
 
Trackurhealth an application for assisting person with thalassaemia condition.
Trackurhealth an application for assisting person with thalassaemia condition.Trackurhealth an application for assisting person with thalassaemia condition.
Trackurhealth an application for assisting person with thalassaemia condition.Trackurhealth Global Limited
 
Baseball powerpoint and games edited (2)
Baseball powerpoint and games edited (2)Baseball powerpoint and games edited (2)
Baseball powerpoint and games edited (2)Odenah Rutas
 
Transparent Encryption in HDFS
Transparent Encryption in HDFSTransparent Encryption in HDFS
Transparent Encryption in HDFSDataWorks Summit
 
27 thalassemias
27 thalassemias27 thalassemias
27 thalassemiasDr UAK
 
Falcon.io - The Future of Brand Experiences is Virtual Reality
Falcon.io - The Future of Brand Experiences is Virtual RealityFalcon.io - The Future of Brand Experiences is Virtual Reality
Falcon.io - The Future of Brand Experiences is Virtual RealityFalcon.io
 
Introdução ao Marketing Online com as Ferramentas do Google
Introdução ao Marketing Online com as Ferramentas do GoogleIntrodução ao Marketing Online com as Ferramentas do Google
Introdução ao Marketing Online com as Ferramentas do GoogleMário Valney
 
Big Data, Performance, Posix, RTB no mercado de publicidade online
Big Data, Performance, Posix, RTB no mercado de publicidade onlineBig Data, Performance, Posix, RTB no mercado de publicidade online
Big Data, Performance, Posix, RTB no mercado de publicidade onlineTiago Peczenyj
 
Uma Estratégia Rentável de Pay-per-Click (Ecommarketing Show 2010)
Uma Estratégia Rentável de Pay-per-Click (Ecommarketing Show 2010)Uma Estratégia Rentável de Pay-per-Click (Ecommarketing Show 2010)
Uma Estratégia Rentável de Pay-per-Click (Ecommarketing Show 2010)Inesting
 
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...Eran Chinthaka Withana
 
Adnetworks, Adexchanges e a Mídia Programática.
Adnetworks, Adexchanges e a Mídia Programática.Adnetworks, Adexchanges e a Mídia Programática.
Adnetworks, Adexchanges e a Mídia Programática.Alexandre Borges
 
Hadoop - Introduzione all’architettura ed approcci applicativi
Hadoop - Introduzione all’architettura ed approcci applicativiHadoop - Introduzione all’architettura ed approcci applicativi
Hadoop - Introduzione all’architettura ed approcci applicativilostrettodigitale
 

Viewers also liked (20)

Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
Falcon Meetup
Falcon Meetup Falcon Meetup
Falcon Meetup
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Why is My Hadoop Job Slow?
Why is My Hadoop Job Slow?Why is My Hadoop Job Slow?
Why is My Hadoop Job Slow?
 
Trackurhealth an application for assisting person with thalassaemia condition.
Trackurhealth an application for assisting person with thalassaemia condition.Trackurhealth an application for assisting person with thalassaemia condition.
Trackurhealth an application for assisting person with thalassaemia condition.
 
Thalassemia Treatment Guidelines
Thalassemia Treatment GuidelinesThalassemia Treatment Guidelines
Thalassemia Treatment Guidelines
 
Baseball powerpoint and games edited (2)
Baseball powerpoint and games edited (2)Baseball powerpoint and games edited (2)
Baseball powerpoint and games edited (2)
 
Transparent Encryption in HDFS
Transparent Encryption in HDFSTransparent Encryption in HDFS
Transparent Encryption in HDFS
 
27 thalassemias
27 thalassemias27 thalassemias
27 thalassemias
 
Falcon.io - The Future of Brand Experiences is Virtual Reality
Falcon.io - The Future of Brand Experiences is Virtual RealityFalcon.io - The Future of Brand Experiences is Virtual Reality
Falcon.io - The Future of Brand Experiences is Virtual Reality
 
Introdução ao Marketing Online com as Ferramentas do Google
Introdução ao Marketing Online com as Ferramentas do GoogleIntrodução ao Marketing Online com as Ferramentas do Google
Introdução ao Marketing Online com as Ferramentas do Google
 
Big Data, Performance, Posix, RTB no mercado de publicidade online
Big Data, Performance, Posix, RTB no mercado de publicidade onlineBig Data, Performance, Posix, RTB no mercado de publicidade online
Big Data, Performance, Posix, RTB no mercado de publicidade online
 
Uma Estratégia Rentável de Pay-per-Click (Ecommarketing Show 2010)
Uma Estratégia Rentável de Pay-per-Click (Ecommarketing Show 2010)Uma Estratégia Rentável de Pay-per-Click (Ecommarketing Show 2010)
Uma Estratégia Rentável de Pay-per-Click (Ecommarketing Show 2010)
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
 
Adnetworks, Adexchanges e a Mídia Programática.
Adnetworks, Adexchanges e a Mídia Programática.Adnetworks, Adexchanges e a Mídia Programática.
Adnetworks, Adexchanges e a Mídia Programática.
 
Hadoop - Introduzione all’architettura ed approcci applicativi
Hadoop - Introduzione all’architettura ed approcci applicativiHadoop - Introduzione all’architettura ed approcci applicativi
Hadoop - Introduzione all’architettura ed approcci applicativi
 

Similar to Apache Falcon - Simplifying Managing Data Jobs on Hadoop

Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)Cedric CARBONE
 
Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014Modern Data Stack France
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveDiscover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveHortonworks
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 
Apache Falcon - Sanjeev Tripurari
Apache Falcon - Sanjeev TripurariApache Falcon - Sanjeev Tripurari
Apache Falcon - Sanjeev TripurariDevOpsBangalore
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
HDP Search Overview (APACHE SOLR & HADOOP)
HDP Search Overview (APACHE SOLR & HADOOP)HDP Search Overview (APACHE SOLR & HADOOP)
HDP Search Overview (APACHE SOLR & HADOOP)Chris Casano
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for ArchitectsTomasz Kopacz
 
The Practice of Alluxio in JD.com
The Practice of Alluxio in JD.comThe Practice of Alluxio in JD.com
The Practice of Alluxio in JD.comAlluxio, Inc.
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Continuent
 
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, AnalyticsReal-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, AnalyticsContinuent
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Continuent
 
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?M. Fevzi Korkutata
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopMukund Babbar
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015NoSQLmatters
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 

Similar to Apache Falcon - Simplifying Managing Data Jobs on Hadoop (20)

Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
 
Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveDiscover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Apache Falcon - Sanjeev Tripurari
Apache Falcon - Sanjeev TripurariApache Falcon - Sanjeev Tripurari
Apache Falcon - Sanjeev Tripurari
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
HDP Search Overview (APACHE SOLR & HADOOP)
HDP Search Overview (APACHE SOLR & HADOOP)HDP Search Overview (APACHE SOLR & HADOOP)
HDP Search Overview (APACHE SOLR & HADOOP)
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
 
The Practice of Alluxio in JD.com
The Practice of Alluxio in JD.comThe Practice of Alluxio in JD.com
The Practice of Alluxio in JD.com
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...
 
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, AnalyticsReal-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...
 
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on Hadoop
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Solr -
Solr - Solr -
Solr -
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 

Recently uploaded (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 

Apache Falcon - Simplifying Managing Data Jobs on Hadoop

  • 1. Data Management Platform on Hadoop Srikanth Sundarrajan Venkatesh Seetharam (Incubating)
  • 2. © Hortonworks Inc. 2011 whoami  Srikanth Sundarrajan –Principal Architect, InMobi –PMC/Committer, Apache Falcon –Apache Hadoop Contributor –Hadoop Team @ Yahoo!  Venkatesh Seetharam –Architect/Developer, Hortonworks Inc. –Apache Falcon Committer, IPMC –Apache Knox Committer –Apache Hadoop, Sqoop, Oozie Contributor –Hadoop team at Yahoo! since 2007 –Built 2 generations of Data Management at Yahoo! Page 2 Architecting the Future of Big Data
  • 3. Agenda 2 Falcon Overview 1 Motivation 3 Falcon Architecture 4 Case Studies
  • 5. Data Processing Landscape External data source Acquire (Import) Data Processing (Transform/Pipeline ) Eviction Archive Replicate (Copy) Export
  • 6. Core Services Process Management • Relays • Late data handling • Retries Data Management • Import/Export • Replication • Retention Data Governance • Lineage • Audit • SLA
  • 8. Holistic Declaration of Intent picture courtersy: http://bigboxdetox.com
  • 9. Entity Dependency Graph Hadoop / Hbase … Cluster External data source feed Process depends depends
  • 10. <?xml version="1.0"?> <cluster colo=”NJ-datacenter" description="" name=”prod-cluster"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations> </cluster> Needed by distcp for replications Writing to HDFS Used to submit processes as MR Submit Oozie jobs Hive metastore to register/deregister partitions and get events on partition availability Used For alerts HDFS directories used by Falcon server Cluster Specification
  • 11. Feed Specification <?xml version="1.0"?> <feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <tags externalSource=TeradataEDW-1, externalTarget=Marketing> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/> </feed> Feed run frequency in mins/hrs/days/mths Late arrival cutoff Global location across clusters - HDFS paths or Hive tables Feeds can belong to multiple groups One or more source & target clusters for retention & replication Access Permissions Metadata tagging
  • 12. Process Specification <process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process> </process> How frequently does the process run , how many instances can be run in parallel and in what order Which cluster should the process run on and when The processing logic. Retry policy on failure Handling late input feeds Input & output feeds for process
  • 13. Late Data Handling  Defines how the late (out of band) data is handled  Each Feed can define a late cut-off value <late-arrival cut-off="hours(4)”/>  Each Process can define how this late data is handled <late-process policy="exp-backoff" delay="hours(1)”> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process>  Policies include:  backoff  exp-backoff  final
  • 14. Retry Policies  Each Process can define retry policy <process name="[process name]"> ... <retry policy=[retry policy] delay=[retry delay]attempts=[attempts]/> <retry policy="backoff" delay="minutes(10)" attempts="3"/> ... </process>  Policies include:  backoff  exp-backoff
  • 16. Apache Falcon Provides Orchestrates Data Management Needs Tools Multi Cluster Management Oozie Replication Sqoop Scheduling Distcp Data Reprocessing Flume Dependency Management Map / Reduce Eviction Hive and Pig Jobs Governance Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications. Falcon: One-stop Shop for Data Management
  • 19. Feed Schedule Cluster xml Feed xml Falcon Falcon config store / Graph Retention / Replication workflow Oozie Scheduler HDFS JMS Notification per action Catalog service Instance Management
  • 20. Process Schedule Cluster/fe ed xml Process xml Falcon Falcon config store / Graph Process workflow Oozie Scheduler HDFS JMS Notification per available feed Catalog service Instance Management
  • 21. Physical Architecture • STANDALONE – Single Data Center – Single Falcon Server – Hadoop jobs and relevant processing involves only one cluster • DISTRBUTED – Multiple Data Centers – Falcon Server per DC – Multiple instances of hadoop clusters and workflow schedulers HADOOP Store & Process Falcon Server (standalone) Site 1 HADOOP Store & Process replication HADOOP Store & Process Falcon Server (standalone) Site 1 HADOOP Store & Process replication Site 2 Falcon Server (standalone) Falcon Prism Server (distributed)
  • 23. Multi Cluster – Failover > Falcon manages workflow, replication or both. > Enables business continuity without requiring full data reprocessing. > Failover clusters require less storage and CPU. Staged Data Cleansed Data Conformed Data Presented Data Staged Data Presented Data BI and Analytics Primary Hadoop Cluster Failover Hadoop Cluster Replication
  • 24. Retention Policies Staged Data Retain 5 Years Cleansed Data Retain 3 Years Conformed Data Retain 3 Years Presented Data Retain Last Copy Only > Sophisticated retention policies expressed in one place. > Simplify data retention for audit, compliance, or for data re-processing.
  • 25. CASE STUDY Distributed Processing Example: Digital Advertising @ InMobi
  • 26. Processing – Single Data Center Ad Request data Impression render event Click event Conversion event Continuou s Streaming (minutely) Hourly summary Enrichment (minutely/5 minutely) Summarizer
  • 27. Global Aggregation Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer …….. DataCenter1 DataCenterN Consumable global aggregate
  • 29. Future Data Governance Data Pipeline Designer Authorization Monitoring/Management Dashboard
  • 31. Questions?  Apache Falcon  http://falcon.incubator.apache.org  mailto: dev@falcon.incubator.apache.org  Srikanth Sundarrajan  sriksun@apache.org  #sriksun  Venkatesh Seetharam  venkatesh@apache.org  #innerzeal

Editor's Notes

  1. In a typical big data environment involving Hadoop, the use cases tend to be around processing very large volumes of data either for machine or human consumption. Some of the data that gets to the hadoop platform can contain critical business & financial information. The data processing team in such an environment is often distracted by the multitude of data management and process orchestration challenges. To name a few Ingesting large volumes of events/streams Ingesting slowly changing data typically available on a traditional database Creating a pipeline / sequence of processing logic to extract the desired piece of insight / information Handling processing complexities relating to change of data / failures Managing eviction of older data elements Backup the data in an alternate location or archive it in a cheaper storage for DR/BCP & Compliance requirements Ship data out of the hadoop environment periodically for machine or human consumption etc These tend to be standard challenges that are better handled in a platform and this might allow the data processing team to focus on their core business application. A platform approach to this also allows us to adopt best practices in solving each of these for subsequent users of the platform to leverage.
  2. As we just noted that there are numerous data and process management services when made available to the data processing team, can reduce their day-to-day complexities significantly and allow them to focus on their business application. This is an enumeration of such services, which we intend to cover in adequate detail as we go along. More often than not pipelines are sequence of data processing or data movement tasks that need to happen before raw data can be transformed into a meaningfully consumable form. Normally the end stage of the pipeline where the final sets of data are produced is in the critical path and may be subject to tight SLA bounds. Any step in the sequence/pipeline if either delayed or failed could cause the pipeline to stall. It is important that each step in the pipeline handoff to the next step to avoid any buffering of time and to allow seamless progression of the pipeline. People who are familiar with Apache Oozie might be able to appreciate this feature provided through the Coordinator. As the pipelines gets more and more time critical and time sensitive, this becomes very very critical and this ought to be available off the shelf for application developers. It is also important for this feature to scalable to support the needs of concurrent pipelines. A fact that data volumes are large and increasing by the day is the reason one adopts a big data platform like Hadoop and that would automatically mean that we would run of space pretty soon, if we didn’t take care of evicting & purging older instances of data. Few problems to consider for retention are Should avoid using a general purpose super user with world writable privileges to delete old data (for obvious reasons) Different types of data may require different criteria for aging and hence purging Other life cycle functions like Archival of old data if defined ought to be scheduled before eviction kicks in Hadoop is being increasingly critical for many businesses and for some users the raw data volumes are too large for them to be shipped to one place for processing, for others data needs to be redundantly available for business continuity reasons. In either scenarios replication of data from one cluster to another plays a vital role. This being available as a service would again free up the cycles from the application developer of these responsibilities. The key challenges to consider while offering this as a service are Bandwidth consumption and management Chunking/bulking strategy Correctness guarantees HDFS version compatibility issues
  3. Data Lifecycle is Challenging in spite of some good Hadoop tools - Patchwork of tools complicate data lifecycle management. Some of the things we have spoken about so far can be done if we took a silo-ed approach. For instance it is possible to process few data sets and produce a few more through a scheduler. However if there are two other consumers of the data produced by the first workflow then the same will be repeatedly defined by the other two consumers and so on. There is a serious duplication of metadata information of what data is ingested, processed or produced and where they are processed and how they are produced. A single system which creates a complete view of this would be able to provide a fairly complete picture of what is happening in the system compared to collection to independent scheduled applications. Both the production support and application development team on Hadoop platform have to scramble and write custom script and monitoring system to get a broader and holistic view of what is happening. An approach where this information is systemically collected and used for seamless management can alleviate much of the pains of folks operating or developing data processing application on hadoop. There is a tendency to burn in feed locations, apps, cluster location, cluster services But things may change over time From where you ingest, the feed frequency, file locations, file formats, format conversions, compressions, the app, … You may end up with multiple clusters A dataset location may be different in different clusters Some dataset and apps may move from one cluster to another Things are slightly different in the BCP cluster
  4. The entity graph at the core is what makes Falcon what it is and that in a way enables all the unique features that Falcon has to offer or can potentially make available in future. At the core Dependency between Data Processing logic and Cluster end points Rules governing Data management Processing management Metadata management
  5. The workflow is re-tried after 10 mins, 20 mins and 30 mins. With exponential backoff, the workflow will be re-tried after 10 mins, 20 mins and 40 mins.
  6. Falcon provides the key services data processing applications need. Complex data processing logic handled by Falcon instead of hard-coded in apps. Faster development and higher quality for ETL, reporting and other data processing apps on Hadoop.
  7. System accepts entities using DSL Infrastructure, Datasets, Pipeline/Processing logic Transforms the input into automated and scheduled workflows System orchestrates workflows Instruments execution of configured policies Handles retry logic and late data processing Records audit, lineage Seamless integration with metastore/catalog (WIP) Provides notifications based on availability Integrated Seamless experience to users Automates processing and tracks the end to end progress Data Set management (Replication, Retention, etc.) offered as a service Users can cherry pick, No coupling between primitives Provides hooks for monitoring, metrics collection
  8. Ad Request, Click, Impression, Conversion feed Minutely (with identical location, retention configuration, but with many data centers) Summary data Hourly (with multiple partitions – one per dc, each configured as source and one target which is global datacenter) Click, Impression Conversion enrichment & Summarizer processes Single definition with multiple data centers Identical periodicity and scheduling configuration