Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

3,094 views

Published on

Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
By Cedric Carbone (@carbone) and JB Onofre (@jbonofre)
#HUGFR

Published in: Internet
  • Be the first to comment

Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

  1. 1. HUG France - 22 Sept 2014 - @Criteo Data Management platform for Hadoop Cédric Carbone, Talend CTO @carbone Jean-Baptiste Onofré, Falcon Committer @jbonofre Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/ © Talend 2014 1
  2. 2. Overview • Falcon is a Data Management solution for Hadoop • Falcon in production at InMobi since 2012 • InMobi gave Falcon to ASF in April 2013 • Falcon is in Apache incubation • Falcon embedded per default inside HDP • Falcon leverages a lot of Apache components - Oozie, Ambari, ActiveMQ, HCat, Sqoop… • Committer/PPMC/IPMC: - #8 InMobi - #5 Hortonworks - #1 Talend © Talend 2014 2
  3. 3. Why Falcon? © Talend 2014 3
  4. 4. Why Falcon? © Talend 2014 4
  5. 5. What is Falcon? • Data Motion Import, Export, CDC • Policy-based Lifecycle Management Retention, Replication, Archival, Anonymization of PII data • Process orchestration and scheduling Late data handling, reprocessing, dependency checking, etc. Multi-cluster management to support Local/Global Aggregations, Rollups, etc. • Data Governance Lineage, Audit, SLA © Talend 2014 5
  6. 6. Falcon - The Solution! • Introduces a higher layer of abstraction – Data Set Decouples a data location and its properties from workflows Understanding the life-time of a feed will allow for implicit validation of the processing rules • Provides the key services for data processing apps Common data services are simple directives, No need to define them verbosely in each job Allows process owners to keep their processing specific to their application logic Sits in the execution path, intercepts to handle OOB data / retries etc. • Promotes Polyglot Programming Does not do any heavy lifting but delegates to tools with in the Hadoop ecosystem © Talend 2014 6
  7. 7. Falcon Basic Concepts : Data Pipelines • Cluster: : Represents the Hadoop cluster • Feed: Defines a “dataset” • Process: Consumes feeds, invokes processing logic & produces feeds • Entity Actions: submit, list, dependency, schedule, suspend, resume, status, definition, delete, update © Talend 2014 7
  8. 8. Falcon Entity Relationships © Talend 2014 8
  9. 9. Cluster Entity <?xml version="1.0"?> <cluster colo=”talend-datacenter" description="" name=”prod-cluster"> <interfaces> Needed by distcp for replications Writing to HDFS <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> Used to submit processes as MR Submit Oozie jobs <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations> </cluster> Hive metastore to register/ deregister partitions and get events on partition availability Used For alerts HDFS directories used by Falcon server © Talend 2014 9
  10. 10. Feed Entity Feed run frequency in mins/hrs/days/mths <?xml version="1.0"?> <feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <clusters> Late arrival cutoff <cluster name=”cluster-primary" type="source"> Feeds can belong to multiple groups <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> One or more source & target clusters for retention & replication </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> Global location across clusters - HDFS paths or Hive tables <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/> </feed> Access Permissions © Talend 2014 10
  11. 11. Process Entity <process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> Which cluster should the process run on and when <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> How frequently does the process run , how many instances can be run in parallel and in what order </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> Input & output feeds for process <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process> </process> The processing logic. Retry policy on failure Handling late input feeds © Talend 2014 11
  12. 12. High Level Architecture © Talend 2014 12
  13. 13. Demo #1 : my first feed • Start Falcon on a single cluster • Submission of one simple feed • Check into Oozie the generated job © Talend 2014 13
  14. 14. Replication & Retention • Sophisticated retention policies expressed in one place • Simplify data retention for audit, compliance, or for data re-processing © Talend 2014 14
  15. 15. Demo #2: processes notification  Motivation: being able to be notified about processes activity “outside” of the cluster  Falcon manages workflow, send JMS notification, and a Camel route react to the notification  Result: trigger an action (Camel route) when a notification is sent by Falcon (eviction, late-arrival, process execution, …)  Mix BigData/Hadoop and ESB technologies © Talend 2014 15
  16. 16. Demo #2: workflow Falcon ActiveMQ Camel 2014-03-19 11:25:43,273 | INFO | LCON.my-process] | process-listener | rg.apache.camel.util.CamelLogger 176 | 74 - org.apache.camel.camel-core - 2.13.0.SNAPSHOT | Exchange[ExchangePattern: InOnly, BodyType: java.util.HashMap, Body: {brokerUrl=tcp://localhost:61616, timeStamp=2014-03-19T10:24Z, status=SUCCEEDED, logFile=hdfs://localhost:8020/falcon/staging/falcon/workflows/process/my-process/logs/instancePaths-2013-11-15-06- 05.csv, feedNames=output, runId=0, entityType=process, nominalTime=2013-11-15T06:05Z, brokerTTL=4320, workflowUser=null, entityName=my-process, feedInstancePaths=hdfs://localhost:8020/data/output, operation=GENERATE, logDir=null, workflowId=0000026-140319105443372-oozie-jbon-W, cluster=local, brokerImplClass=org.apache.activemq.ActiveMQConnectionFactory, topicName=FALCON.my-process}] © Talend 2014 16
  17. 17. Demo #2 : Data Notification • All notification will be in ActiveMQ • Subscribers : Camel routes • Add some files and see the notification system working! • http://blog.nanthrax.net/2014/03/hadoop-cdc-and-processes-notification-with- apache-falcon-apache-activemq-and-apache-camel/ © Talend 2014 17
  18. 18. Data Pipeline Tracing © Talend 2014 18
  19. 19. Topologies • STANDALONE – Single Data Center – Single Falcon Server – Hadoop jobs and relevant processing involves only one cluster • DISTRIBUTED – Multiple Data Centers – Falcon Server per DC – Multiple instances of hadoop clusters and workflow schedulers © Talend 2014 19
  20. 20. Multi cluster failover  Motivation: replicate subset of data from one cluster to another, guarantee the different eviction depending of the data subset  Falcon manages workflow on primary cluster, and replication on failover cluster  Result: support business continuity without requiring full data reprocessing © Talend 2014 20
  21. 21. Roadmap  Improvement on the late-arrival messages in FALCON.ENTITY.TOPIC (ActiveMQ)  Feed implicit processes (no need of process) providing “native” CDC  Straight forward MR usage (without pig, oozie workflow XML, …)  More data acquisition  Monitoring/Management/Designer dashboard  TLP !!! © Talend 2014 21
  22. 22. Questions? Data Management platform for Hadoop Cédric Carbone, Talend CTO @carbone Jean-Baptiste Onofré, Falcon Committer @jbonofre Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/ © Talend 2014 22

×