Your SlideShare is downloading. ×
Dataiku Flow and dctc - Berlin Buzzwords
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Dataiku Flow and dctc - Berlin Buzzwords


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Dataiku Flow and dctcData pipelines made easy Berlin Buzzwords 2013
  • 2. Clément Stenac <>@ClementStenac CTO @ Dataiku Head of product R&D @ Exalead(Search Engine Technology)About me02/06/2013Dataiku Training – Hadoop for Data Science 2
  • 3.  The hard life of a Data Scientist Dataiku Flow DCTC Lunch !02/06/2013DIP – Introduction to Dataiku Flow 3
  • 4. Follow the FlowDataiku - Pig, Hive and CascadingTracker LogMongoDBMySQLMySQLSyslogProductCatalogOrderApache LogsSessionProduct TransformationCategory AffinityCategory TargetingCustomer ProfileProduct RecommenderS3Search Logs (External) Search EngineOptimization(Internal) SearchRankingMongoDBMySQLPartner FTPSync In Sync OutPigPigHiveHiveElasticSearch
  • 5. Zooming moreDataiku - Pig, Hive and CascadingPage ViewsOrdersCatalogBots, Special UsersFiltered Page ViewsUser AffinityProduct PopularityUser Similarity (Per Category)Recommendation GraphRecommendationOrder SummaryUser Similarity (Per Brand)Machine Learning
  • 6.  Many tasks and tools Dozens of stage, evolves daily Exceptional situations are the norm Many pains◦ Shared schemas◦ Efficient incremental synchronizationand computation◦ Data is badReal-life data pipelines02/06/2013DIP – Introduction to Dataiku Flow 6
  • 7.  1970 Shell scripts 1977 Makefile 1980 Makedeps 1999 SCons/CMake 2001 Maven … Shell Scripts 2008 HaMake 2009 Oozie ETLS, … Next ?Dataiku - Pig, Hive and CascadingAn evolution similar to build Better dependencies Higher-level tasks
  • 8.  The hard life of a Data Scientist Dataiku Flow DCTC Lunch !02/06/2013DIP – Introduction to Dataiku Flow 8
  • 9. Dataiku Flow is a data-drivenorchestration framework forcomplex data pipelinesIntroduction to Flow02/06/2013DIP – Introduction to Dataiku Flow 9 Manage data, not steps and taks Simplify common maintainance situations◦ Data rebuilds◦ Processing steps update Handle real day-to-day pains◦ Data validity checks◦ Transfers between systems
  • 10.  Like a table : contains records, with a schema Can be partitioned◦ Time partitioning (by day, by hour, …)◦ « Value » partitioning (by country, by partner, …) Various backends◦ Filesystem◦ HDFS◦ ElasticSearchConcepts: Dataset02/06/2013DIP – Introduction to Dataiku Flow 10◦ SQL◦ NoSQL (MongoDB, …)◦ Cloud Storages
  • 11.  Has input datasets and output datasets Declares dependencies from input to output Built-in tasks with strong integration◦ Pig◦ Hive Customizable tasks◦ Shell script, Java, …Concepts: Task02/06/2013DIP – Introduction to Dataiku Flow 11AggregateVisitsVisitsCustomersWeeklyaggregationDailyaggregation◦ Python Pandas & SciKit◦ Data transfers
  • 12. Introduction to FlowA sample Flow02/06/2013DIP – Introduction to Dataiku Flow 12VisitsShaker« cleanlogs »CleanlogsWebTrackerlogsPig« aggr_visits »CRM tablecustomersShaker« enrich_cust »Cleancusts.BrowsersReferent.Hive« customer_visits »Cust lastvisitsPig« customer_last_product »CRM tableproductsCust lastproducts
  • 13. Flow is data-oriented Don’t ask « Run task A and then task B » Don’t even ask « Run all tasks that depend from task A » Ask « Do what’s needed so that my aggregated customersdata for 2013/01/25 is up to date » Flow manages dependencies between datasets, through tasks You don’t execute tasks, you compute or refresh datasetsData-oriented02/06/2013DIP – Introduction to Dataiku Flow 13
  • 14. Partition-level dependencies02/06/2013Dataiku Training – Hadoop for Data Science 14Shakercleantask1 cleanlogwtlogsPig« aggr_visits » weekly_aggr "wtlogs" and "cleanlog" are day-partitioned "weekly_aggr" needs the previous 7 days of clean logs "sliding days" partition-level dependency "Compute weekly_aggr for 2012-01-25"◦ Automatically computes the required 7 partitions◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition◦ Perform cleantask1 in parallel for all missing / stale days◦ Perform aggr_visits with the 7 partitions as inputsliding_days(7)
  • 15. Automatic parallelism02/06/2013DIP – Introduction to Dataiku Flow 15 Flow computes the global DAG of required activities Compute activities that can take place in parallel Previous example: 8 activities◦ 7 can be parallelized◦ 1 requires the other 7 first Manages running activities Starts new activities based on available resources
  • 16.  Datasets have a schema, available in all tools Advanced verification of computed data◦ "Check that output is not empty"◦ "Check that this custom query returns between X and Y records"◦ "Check that this specific record is found in output"◦ "Check that number of computed records for day B is no morethan 40% different than day A" Automatic tests for data pipelinesSchema and data validity checks02/06/2013DIP – Introduction to Dataiku Flow 16
  • 17.  Native knowledge of Pig and Hive formats Schema-aware loaders and storages A great ecosystem, but not omnipotent◦ Not everything requires Hadoops strong points Hadoop = first-class citizen of Flow, but not the only one Native integration of SQL capabilities Automatic incremental synchronization to/from MongoDB,Vertica, ElasticSearch, … Custom tasksIntegrated in Hadoop,open beyond02/06/2013DIP – Introduction to Dataiku Flow 17
  • 18. What about Oozie and Hcatalog ?02/06/2013DIP – Introduction to Dataiku Flow 18
  • 19.  Engine and core tasks are working Under active development for first betas Get more info and stay informedhttp://flowbeta.dataiku.comAnd while you wait, another thingEver been annoyed by data transfers ?Are we there yet ?02/06/2013DIP – Introduction to Dataiku Flow 19
  • 20. Feel the pain02/06/2013DIP – Introduction to Dataiku Flow 20
  • 21.  The hard life of a Data Scientist Dataiku Flow DCTC Lunch !02/06/2013DIP – Introduction to Dataiku Flow 21
  • 22.  Extract from the core of Flow Manipulate files across filesystemsDCTC : Cloud data manipulation02/06/2013DIP – Introduction to Dataiku Flow 22# List the files and folders in a S3 bucket% dctc ls s3://my-bucket# Synchronize incrementally from GCS to local folder% dctc sync gs://my-bucket/my-path target-directory# Copy from GCS to HDFS, compress to .gz on the fly# (decompress handled too)% dctc cp –R –c gs://my-bucket/my-path hdfs:///data/input# Dispatch the lines of a file to 8 files on S3, gzip-compressed% dctc dispatch input s3://bucket/target –f random –nf 8 -c
  • 23. DCTC : More examples02/06/2013DIP – Introduction to Dataiku Flow 23# cat from anywhere% dctc cat ftp://account@:/pub/data/data.csv# Edit a remote file (with $EDITOR)% dctc edit ssh://account@:myfile.txt# Transparently unzip% dctc# Head / tail from the cloud% dctc tail s3://bucket/huge-log.csv# Multi-account aware% dctc sync s3://account1@path s3://account2@other_path
  • 24. Self-contained binary for Linux, OS X, Windows Amazon S3 Google Cloud Storage FTPTry it now02/06/2013DIP – Introduction to Dataiku Flow 24 HTTP SSH HDFS (through local install)
  • 25. Questions ?Marc BattyChief Customer 6 45 65 67 04@battymarcFlorian DouetteauChief Executive 6 70 56 88 97@fdouetteauThomas CabrolChief Data 7 86 42 62 81@ThomasCabrolClément StenacChief Technical 6 28 06 79 04@ClementStenac