SlideShare a Scribd company logo
1 of 22
HUG France - 22 Sept 2014 - @Criteo 
Data Management platform for Hadoop 
Cédric Carbone, Talend CTO 
@carbone 
Jean-Baptiste Onofré, Falcon Committer 
@jbonofre 
Ce support est mis à disposition selon les termes de la Licence Creative 
Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 
France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/ 
© Talend 2014 1
Overview 
• Falcon is a Data Management solution for Hadoop 
• Falcon in production at InMobi since 2012 
• InMobi gave Falcon to ASF in April 2013 
• Falcon is in Apache incubation 
• Falcon embedded per default inside HDP 
• Falcon leverages a lot of Apache components 
- Oozie, Ambari, ActiveMQ, HCat, Sqoop… 
• Committer/PPMC/IPMC: 
- #8 InMobi 
- #5 Hortonworks 
- #1 Talend 
© Talend 2014 2
Why Falcon? 
© Talend 2014 3
Why Falcon? 
© Talend 2014 4
What is Falcon? 
• Data Motion Import, Export, CDC 
• Policy-based Lifecycle Management Retention, Replication, 
Archival, Anonymization of PII data 
• Process orchestration and scheduling Late data handling, 
reprocessing, dependency checking, etc. Multi-cluster management to 
support Local/Global Aggregations, Rollups, etc. 
• Data Governance Lineage, Audit, SLA 
© Talend 2014 5
Falcon - The Solution! 
• Introduces a higher layer of abstraction – Data Set Decouples a 
data location and its properties from workflows Understanding the life-time 
of a feed will allow for implicit validation of the processing rules 
• Provides the key services for data processing apps Common data 
services are simple directives, No need to define them verbosely in 
each job Allows process owners to keep their processing specific to 
their application logic Sits in the execution path, intercepts to handle 
OOB data / retries etc. 
• Promotes Polyglot Programming Does not do any heavy lifting but 
delegates to tools with in the Hadoop ecosystem 
© Talend 2014 6
Falcon Basic Concepts : Data Pipelines 
• Cluster: : Represents the Hadoop cluster 
• Feed: Defines a “dataset” 
• Process: Consumes feeds, invokes processing logic & produces feeds 
• Entity Actions: submit, list, dependency, schedule, suspend, resume, status, 
definition, delete, update 
© Talend 2014 7
Falcon Entity Relationships 
© Talend 2014 8
Cluster Entity 
<?xml version="1.0"?> 
<cluster colo=”talend-datacenter" description="" name=”prod-cluster"> 
<interfaces> 
Needed by distcp for replications 
Writing to HDFS 
<interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> 
<interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> 
<interface type="execute" endpoint=”rm:8050" version="2.2.0" /> 
<interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> 
<interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> 
<interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> 
</interfaces> 
<locations> 
Used to submit processes as MR 
Submit Oozie jobs 
<location name="staging" path="/apps/falcon/prod-cluster/staging" /> 
<location name="temp" path="/tmp" /> 
<location name="working" path="/apps/falcon/prod-cluster/working" /> 
</locations> 
</cluster> 
Hive metastore to register/ 
deregister partitions and 
get events on partition availability 
Used For alerts 
HDFS directories used by Falcon server 
© Talend 2014 9
Feed Entity 
Feed run frequency in mins/hrs/days/mths 
<?xml version="1.0"?> 
<feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> 
<frequency>hours(1)</frequency> 
<late-arrival cut-off="hours(6)”/> 
<groups>churnAnalysisFeeds</groups> 
<clusters> 
Late arrival cutoff 
<cluster name=”cluster-primary" type="source"> 
Feeds can belong to multiple groups 
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> 
<retention limit="days(2)" action="delete"/> 
One or more source & target clusters for retention & replication 
</cluster> 
<cluster name=”cluster-secondary" type="target"> 
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> 
<location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> 
<retention limit=”days(7)" action="delete"/> 
</cluster> 
</clusters> 
<locations> 
Global location across clusters - HDFS paths or Hive tables 
<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> 
</locations> 
<ACL owner=”hdfs" group="users" permission="0755"/> 
<schema location="/none" provider="none"/> 
</feed> 
Access Permissions 
© Talend 2014 10
Process Entity 
<process name="process-test" xmlns="uri:falcon:process:0.1”> 
<clusters> 
<cluster name="cluster-primary"> 
Which cluster should the process run on and when 
<validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> 
</cluster> 
How frequently does the process run , how many instances can be run in parallel and in what order 
</clusters> 
<parallel>1</parallel> 
<order>FIFO</order> 
<frequency>days(1)</frequency> 
<inputs> 
<input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> 
</inputs> 
<outputs> 
Input & output feeds for process 
<output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> 
</outputs> 
<workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> 
<retry policy="periodic" delay="minutes(10)" attempts="3"/> 
<late-process policy="exp-backoff" delay="hours(1)"> 
<late-input input="input" workflow-path="/apps/clickstream/late" /> 
</late-process> 
</process> 
The processing logic. 
Retry policy on failure Handling late input feeds 
© Talend 2014 11
High Level Architecture 
© Talend 2014 12
Demo #1 : my first feed 
• Start Falcon on a single cluster 
• Submission of one simple feed 
• Check into Oozie the generated job 
© Talend 2014 13
Replication & Retention 
• Sophisticated retention policies expressed in one place 
• Simplify data retention for audit, compliance, or for data re-processing 
© Talend 2014 14
Demo #2: processes notification 
 Motivation: being able to be notified about processes activity “outside” 
of the cluster 
 Falcon manages workflow, send JMS notification, and a Camel route 
react to the notification 
 Result: trigger an action (Camel route) when a notification is sent by 
Falcon (eviction, late-arrival, process execution, …) 
 Mix BigData/Hadoop and ESB technologies 
© Talend 2014 15
Demo #2: workflow 
Falcon ActiveMQ 
Camel 
2014-03-19 11:25:43,273 | INFO | LCON.my-process] | process-listener | rg.apache.camel.util.CamelLogger 
176 | 74 - org.apache.camel.camel-core - 2.13.0.SNAPSHOT | Exchange[ExchangePattern: InOnly, BodyType: 
java.util.HashMap, Body: {brokerUrl=tcp://localhost:61616, timeStamp=2014-03-19T10:24Z, status=SUCCEEDED, 
logFile=hdfs://localhost:8020/falcon/staging/falcon/workflows/process/my-process/logs/instancePaths-2013-11-15-06- 
05.csv, feedNames=output, runId=0, entityType=process, nominalTime=2013-11-15T06:05Z, brokerTTL=4320, 
workflowUser=null, entityName=my-process, feedInstancePaths=hdfs://localhost:8020/data/output, 
operation=GENERATE, logDir=null, workflowId=0000026-140319105443372-oozie-jbon-W, cluster=local, 
brokerImplClass=org.apache.activemq.ActiveMQConnectionFactory, topicName=FALCON.my-process}] 
© Talend 2014 16
Demo #2 : Data Notification 
• All notification will be in ActiveMQ 
• Subscribers : Camel routes 
• Add some files and see the notification system working! 
• http://blog.nanthrax.net/2014/03/hadoop-cdc-and-processes-notification-with- 
apache-falcon-apache-activemq-and-apache-camel/ 
© Talend 2014 17
Data Pipeline Tracing 
© Talend 2014 18
Topologies 
• STANDALONE 
– Single Data Center 
– Single Falcon Server 
– Hadoop jobs and relevant processing involves only one cluster 
• DISTRIBUTED 
– Multiple Data Centers 
– Falcon Server per DC 
– Multiple instances of hadoop clusters and workflow schedulers 
© Talend 2014 19
Multi cluster failover 
 Motivation: replicate subset of data from one cluster to another, 
guarantee the different eviction depending of the data subset 
 Falcon manages workflow on primary cluster, and replication on failover 
cluster 
 Result: support business continuity without requiring full data 
reprocessing 
© Talend 2014 20
Roadmap 
 Improvement on the late-arrival messages in FALCON.ENTITY.TOPIC 
(ActiveMQ) 
 Feed implicit processes (no need of process) providing “native” CDC 
 Straight forward MR usage (without pig, oozie workflow XML, …) 
 More data acquisition 
 Monitoring/Management/Designer dashboard 
 TLP !!! 
© Talend 2014 21
Questions? 
Data Management platform for Hadoop 
Cédric Carbone, Talend CTO 
@carbone 
Jean-Baptiste Onofré, Falcon Committer 
@jbonofre 
Ce support est mis à disposition selon les termes de la Licence Creative 
Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 
France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/ 
© Talend 2014 22

More Related Content

What's hot

Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - finalHortonworks
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Hortonworks
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentDataWorks Summit/Hadoop Summit
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsDataWorks Summit/Hadoop Summit
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureDataWorks Summit
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 

What's hot (20)

Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 

Viewers also liked

Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...
Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...
Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...MongoDB
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopDataWorks Summit
 
Big Data Paris - Air France: Stratégie BigData et Use Cases
Big Data Paris - Air France: Stratégie BigData et Use CasesBig Data Paris - Air France: Stratégie BigData et Use Cases
Big Data Paris - Air France: Stratégie BigData et Use CasesMongoDB
 
Tiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVMTiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVMIgor Veresov
 
Uccellini Uccellacci
Uccellini UccellacciUccellini Uccellacci
Uccellini Uccellaccifedericoneri
 
Das Prinzip Coworking - Coworking in Köln
Das Prinzip Coworking - Coworking in KölnDas Prinzip Coworking - Coworking in Köln
Das Prinzip Coworking - Coworking in KölnPeter Schreck
 
Liebe Dein Brot / Love your bread
Liebe Dein Brot / Love your breadLiebe Dein Brot / Love your bread
Liebe Dein Brot / Love your breadPeter Schreck
 
Speech videogiochi-pistoia
Speech videogiochi-pistoiaSpeech videogiochi-pistoia
Speech videogiochi-pistoiaLuca De Biase
 
Moviendo Las IDEAS
Moviendo Las IDEASMoviendo Las IDEAS
Moviendo Las IDEAScaaraya
 
Os Rrcc As Bases Do Estado Moderno
Os Rrcc As Bases Do Estado ModernoOs Rrcc As Bases Do Estado Moderno
Os Rrcc As Bases Do Estado ModernoAulagalicia Hxg
 
Il treno della vita
Il treno della vitaIl treno della vita
Il treno della vitafedericoneri
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyPerficient, Inc.
 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7mmathipra
 
2015 Find You - Fuzzy Big Data - A Case Study
2015 Find You - Fuzzy Big Data - A Case Study2015 Find You - Fuzzy Big Data - A Case Study
2015 Find You - Fuzzy Big Data - A Case StudyPhilip Topham
 
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...Amazon Web Services
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAmazon Web Services
 

Viewers also liked (20)

Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...
Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...
Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
 
Big Data Paris - Air France: Stratégie BigData et Use Cases
Big Data Paris - Air France: Stratégie BigData et Use CasesBig Data Paris - Air France: Stratégie BigData et Use Cases
Big Data Paris - Air France: Stratégie BigData et Use Cases
 
Tiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVMTiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVM
 
Uccellini Uccellacci
Uccellini UccellacciUccellini Uccellacci
Uccellini Uccellacci
 
Das Prinzip Coworking - Coworking in Köln
Das Prinzip Coworking - Coworking in KölnDas Prinzip Coworking - Coworking in Köln
Das Prinzip Coworking - Coworking in Köln
 
Piringundines
PiringundinesPiringundines
Piringundines
 
Liebe Dein Brot / Love your bread
Liebe Dein Brot / Love your breadLiebe Dein Brot / Love your bread
Liebe Dein Brot / Love your bread
 
Speech videogiochi-pistoia
Speech videogiochi-pistoiaSpeech videogiochi-pistoia
Speech videogiochi-pistoia
 
Coworking Germany
Coworking Germany Coworking Germany
Coworking Germany
 
Moviendo Las IDEAS
Moviendo Las IDEASMoviendo Las IDEAS
Moviendo Las IDEAS
 
Os Rrcc As Bases Do Estado Moderno
Os Rrcc As Bases Do Estado ModernoOs Rrcc As Bases Do Estado Moderno
Os Rrcc As Bases Do Estado Moderno
 
ESKibana
ESKibanaESKibana
ESKibana
 
Il treno della vita
Il treno della vitaIl treno della vita
Il treno della vita
 
Big Data - uma visão executiva
Big Data - uma visão executivaBig Data - uma visão executiva
Big Data - uma visão executiva
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data Strategy
 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7
 
2015 Find You - Fuzzy Big Data - A Case Study
2015 Find You - Fuzzy Big Data - A Case Study2015 Find You - Fuzzy Big Data - A Case Study
2015 Find You - Fuzzy Big Data - A Case Study
 
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
 

Similar to Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
 
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...li David
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...Courtney Llamas
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
 
Cloud Roundtable | Pivoltal: Agile platform
Cloud Roundtable | Pivoltal: Agile platformCloud Roundtable | Pivoltal: Agile platform
Cloud Roundtable | Pivoltal: Agile platformCodemotion
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchEMC
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsAlluxio, Inc.
 
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)jeckels
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupThomas Weise
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon
 
Database failover from client perspective
Database failover from client perspectiveDatabase failover from client perspective
Database failover from client perspectivePriit Piipuu
 
Building Data Streaming Platforms using OpenShift and Kafka
Building Data Streaming Platforms using OpenShift and KafkaBuilding Data Streaming Platforms using OpenShift and Kafka
Building Data Streaming Platforms using OpenShift and KafkaNenad Bogojevic
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 

Similar to Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo) (20)

Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Cloud Roundtable | Pivoltal: Agile platform
Cloud Roundtable | Pivoltal: Agile platformCloud Roundtable | Pivoltal: Agile platform
Cloud Roundtable | Pivoltal: Agile platform
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
 
OpenStack Murano
OpenStack MuranoOpenStack Murano
OpenStack Murano
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
 
Database failover from client perspective
Database failover from client perspectiveDatabase failover from client perspective
Database failover from client perspective
 
Building Data Streaming Platforms using OpenShift and Kafka
Building Data Streaming Platforms using OpenShift and KafkaBuilding Data Streaming Platforms using OpenShift and Kafka
Building Data Streaming Platforms using OpenShift and Kafka
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 

Recently uploaded

I’ll See Y’All Motherfuckers In Game 7 Shirt
I’ll See Y’All Motherfuckers In Game 7 ShirtI’ll See Y’All Motherfuckers In Game 7 Shirt
I’ll See Y’All Motherfuckers In Game 7 Shirtrahman018755
 
原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样AS
 
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书AS
 
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...Varun Mithran
 
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样AS
 
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样AS
 
APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0
APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0
APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0APNIC
 
iThome_CYBERSEC2024_Drive_Into_the_DarkWeb
iThome_CYBERSEC2024_Drive_Into_the_DarkWebiThome_CYBERSEC2024_Drive_Into_the_DarkWeb
iThome_CYBERSEC2024_Drive_Into_the_DarkWebJie Liau
 
一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理A
 
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download Now
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download NowHUMANIZE YOUR BRAND - FREE E-WORKBOOK Download Now
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download NowIdeoholics
 
Down bad crying at the gym t shirtsDown bad crying at the gym t shirts
Down bad crying at the gym t shirtsDown bad crying at the gym t shirtsDown bad crying at the gym t shirtsDown bad crying at the gym t shirts
Down bad crying at the gym t shirtsDown bad crying at the gym t shirtsrahman018755
 
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样Fi
 
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理apekaom
 
The Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdfThe Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdfe-Market Hub
 
一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理AS
 
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...musaddumba454
 
VIP ℂall Girls Bangalore 8250077686 WhatsApp: Me All Time Serviℂe Available D...
VIP ℂall Girls Bangalore 8250077686 WhatsApp: Me All Time Serviℂe Available D...VIP ℂall Girls Bangalore 8250077686 WhatsApp: Me All Time Serviℂe Available D...
VIP ℂall Girls Bangalore 8250077686 WhatsApp: Me All Time Serviℂe Available D...Model Neeha Mumbai
 
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书B
 
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformonhackersuli
 
一比一定制波士顿学院毕业证学位证书
一比一定制波士顿学院毕业证学位证书一比一定制波士顿学院毕业证学位证书
一比一定制波士顿学院毕业证学位证书A
 

Recently uploaded (20)

I’ll See Y’All Motherfuckers In Game 7 Shirt
I’ll See Y’All Motherfuckers In Game 7 ShirtI’ll See Y’All Motherfuckers In Game 7 Shirt
I’ll See Y’All Motherfuckers In Game 7 Shirt
 
原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样
 
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书
一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证学位证书
 
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...
SOC Analyst Guide For Beginners SOC analysts work as members of a managed sec...
 
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样
一比一原版(Wintec毕业证书)新西兰怀卡托理工学院毕业证原件一模一样
 
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样
原版定制(LBS毕业证书)英国伦敦商学院毕业证原件一模一样
 
APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0
APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0
APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0
 
iThome_CYBERSEC2024_Drive_Into_the_DarkWeb
iThome_CYBERSEC2024_Drive_Into_the_DarkWebiThome_CYBERSEC2024_Drive_Into_the_DarkWeb
iThome_CYBERSEC2024_Drive_Into_the_DarkWeb
 
一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理
 
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download Now
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download NowHUMANIZE YOUR BRAND - FREE E-WORKBOOK Download Now
HUMANIZE YOUR BRAND - FREE E-WORKBOOK Download Now
 
Down bad crying at the gym t shirtsDown bad crying at the gym t shirts
Down bad crying at the gym t shirtsDown bad crying at the gym t shirtsDown bad crying at the gym t shirtsDown bad crying at the gym t shirts
Down bad crying at the gym t shirtsDown bad crying at the gym t shirts
 
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样
一比一原版(Soton毕业证书)南安普顿大学毕业证原件一模一样
 
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理
 
The Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdfThe Rise of Subscription-Based Digital Services.pdf
The Rise of Subscription-Based Digital Services.pdf
 
一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理
 
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
100^%)( POLOKWANE))(*((+27838792658))*))௹ )Abortion Pills for Sale in Sibasa,...
 
VIP ℂall Girls Bangalore 8250077686 WhatsApp: Me All Time Serviℂe Available D...
VIP ℂall Girls Bangalore 8250077686 WhatsApp: Me All Time Serviℂe Available D...VIP ℂall Girls Bangalore 8250077686 WhatsApp: Me All Time Serviℂe Available D...
VIP ℂall Girls Bangalore 8250077686 WhatsApp: Me All Time Serviℂe Available D...
 
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书
一比一定制(Temasek毕业证书)新加坡淡马锡理工学院毕业证学位证书
 
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon
 
一比一定制波士顿学院毕业证学位证书
一比一定制波士顿学院毕业证学位证书一比一定制波士顿学院毕业证学位证书
一比一定制波士顿学院毕业证学位证书
 

Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

  • 1. HUG France - 22 Sept 2014 - @Criteo Data Management platform for Hadoop Cédric Carbone, Talend CTO @carbone Jean-Baptiste Onofré, Falcon Committer @jbonofre Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/ © Talend 2014 1
  • 2. Overview • Falcon is a Data Management solution for Hadoop • Falcon in production at InMobi since 2012 • InMobi gave Falcon to ASF in April 2013 • Falcon is in Apache incubation • Falcon embedded per default inside HDP • Falcon leverages a lot of Apache components - Oozie, Ambari, ActiveMQ, HCat, Sqoop… • Committer/PPMC/IPMC: - #8 InMobi - #5 Hortonworks - #1 Talend © Talend 2014 2
  • 3. Why Falcon? © Talend 2014 3
  • 4. Why Falcon? © Talend 2014 4
  • 5. What is Falcon? • Data Motion Import, Export, CDC • Policy-based Lifecycle Management Retention, Replication, Archival, Anonymization of PII data • Process orchestration and scheduling Late data handling, reprocessing, dependency checking, etc. Multi-cluster management to support Local/Global Aggregations, Rollups, etc. • Data Governance Lineage, Audit, SLA © Talend 2014 5
  • 6. Falcon - The Solution! • Introduces a higher layer of abstraction – Data Set Decouples a data location and its properties from workflows Understanding the life-time of a feed will allow for implicit validation of the processing rules • Provides the key services for data processing apps Common data services are simple directives, No need to define them verbosely in each job Allows process owners to keep their processing specific to their application logic Sits in the execution path, intercepts to handle OOB data / retries etc. • Promotes Polyglot Programming Does not do any heavy lifting but delegates to tools with in the Hadoop ecosystem © Talend 2014 6
  • 7. Falcon Basic Concepts : Data Pipelines • Cluster: : Represents the Hadoop cluster • Feed: Defines a “dataset” • Process: Consumes feeds, invokes processing logic & produces feeds • Entity Actions: submit, list, dependency, schedule, suspend, resume, status, definition, delete, update © Talend 2014 7
  • 8. Falcon Entity Relationships © Talend 2014 8
  • 9. Cluster Entity <?xml version="1.0"?> <cluster colo=”talend-datacenter" description="" name=”prod-cluster"> <interfaces> Needed by distcp for replications Writing to HDFS <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> Used to submit processes as MR Submit Oozie jobs <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations> </cluster> Hive metastore to register/ deregister partitions and get events on partition availability Used For alerts HDFS directories used by Falcon server © Talend 2014 9
  • 10. Feed Entity Feed run frequency in mins/hrs/days/mths <?xml version="1.0"?> <feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <clusters> Late arrival cutoff <cluster name=”cluster-primary" type="source"> Feeds can belong to multiple groups <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> One or more source & target clusters for retention & replication </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> Global location across clusters - HDFS paths or Hive tables <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/> </feed> Access Permissions © Talend 2014 10
  • 11. Process Entity <process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> Which cluster should the process run on and when <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> How frequently does the process run , how many instances can be run in parallel and in what order </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> Input & output feeds for process <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process> </process> The processing logic. Retry policy on failure Handling late input feeds © Talend 2014 11
  • 12. High Level Architecture © Talend 2014 12
  • 13. Demo #1 : my first feed • Start Falcon on a single cluster • Submission of one simple feed • Check into Oozie the generated job © Talend 2014 13
  • 14. Replication & Retention • Sophisticated retention policies expressed in one place • Simplify data retention for audit, compliance, or for data re-processing © Talend 2014 14
  • 15. Demo #2: processes notification  Motivation: being able to be notified about processes activity “outside” of the cluster  Falcon manages workflow, send JMS notification, and a Camel route react to the notification  Result: trigger an action (Camel route) when a notification is sent by Falcon (eviction, late-arrival, process execution, …)  Mix BigData/Hadoop and ESB technologies © Talend 2014 15
  • 16. Demo #2: workflow Falcon ActiveMQ Camel 2014-03-19 11:25:43,273 | INFO | LCON.my-process] | process-listener | rg.apache.camel.util.CamelLogger 176 | 74 - org.apache.camel.camel-core - 2.13.0.SNAPSHOT | Exchange[ExchangePattern: InOnly, BodyType: java.util.HashMap, Body: {brokerUrl=tcp://localhost:61616, timeStamp=2014-03-19T10:24Z, status=SUCCEEDED, logFile=hdfs://localhost:8020/falcon/staging/falcon/workflows/process/my-process/logs/instancePaths-2013-11-15-06- 05.csv, feedNames=output, runId=0, entityType=process, nominalTime=2013-11-15T06:05Z, brokerTTL=4320, workflowUser=null, entityName=my-process, feedInstancePaths=hdfs://localhost:8020/data/output, operation=GENERATE, logDir=null, workflowId=0000026-140319105443372-oozie-jbon-W, cluster=local, brokerImplClass=org.apache.activemq.ActiveMQConnectionFactory, topicName=FALCON.my-process}] © Talend 2014 16
  • 17. Demo #2 : Data Notification • All notification will be in ActiveMQ • Subscribers : Camel routes • Add some files and see the notification system working! • http://blog.nanthrax.net/2014/03/hadoop-cdc-and-processes-notification-with- apache-falcon-apache-activemq-and-apache-camel/ © Talend 2014 17
  • 18. Data Pipeline Tracing © Talend 2014 18
  • 19. Topologies • STANDALONE – Single Data Center – Single Falcon Server – Hadoop jobs and relevant processing involves only one cluster • DISTRIBUTED – Multiple Data Centers – Falcon Server per DC – Multiple instances of hadoop clusters and workflow schedulers © Talend 2014 19
  • 20. Multi cluster failover  Motivation: replicate subset of data from one cluster to another, guarantee the different eviction depending of the data subset  Falcon manages workflow on primary cluster, and replication on failover cluster  Result: support business continuity without requiring full data reprocessing © Talend 2014 20
  • 21. Roadmap  Improvement on the late-arrival messages in FALCON.ENTITY.TOPIC (ActiveMQ)  Feed implicit processes (no need of process) providing “native” CDC  Straight forward MR usage (without pig, oozie workflow XML, …)  More data acquisition  Monitoring/Management/Designer dashboard  TLP !!! © Talend 2014 21
  • 22. Questions? Data Management platform for Hadoop Cédric Carbone, Talend CTO @carbone Jean-Baptiste Onofré, Falcon Committer @jbonofre Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/ © Talend 2014 22