Your SlideShare is downloading. ×
  • Like
  • Save
Dataiku Flow and dctc - Berlin Buzzwords
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Dataiku Flow and dctc - Berlin Buzzwords

  • 1,201 views
Published

 

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,201
On SlideShare
0
From Embeds
0
Number of Embeds
7

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Dataiku Flow and dctcData pipelines made easy Berlin Buzzwords 2013
  • 2. Clément Stenac <clement.stenac@dataiku.com>@ClementStenac CTO @ Dataiku Head of product R&D @ Exalead(Search Engine Technology)About me02/06/2013Dataiku Training – Hadoop for Data Science 2
  • 3.  The hard life of a Data Scientist Dataiku Flow DCTC Lunch !02/06/2013DIP – Introduction to Dataiku Flow 3
  • 4. Follow the FlowDataiku - Pig, Hive and CascadingTracker LogMongoDBMySQLMySQLSyslogProductCatalogOrderApache LogsSessionProduct TransformationCategory AffinityCategory TargetingCustomer ProfileProduct RecommenderS3Search Logs (External) Search EngineOptimization(Internal) SearchRankingMongoDBMySQLPartner FTPSync In Sync OutPigPigHiveHiveElasticSearch
  • 5. Zooming moreDataiku - Pig, Hive and CascadingPage ViewsOrdersCatalogBots, Special UsersFiltered Page ViewsUser AffinityProduct PopularityUser Similarity (Per Category)Recommendation GraphRecommendationOrder SummaryUser Similarity (Per Brand)Machine Learning
  • 6.  Many tasks and tools Dozens of stage, evolves daily Exceptional situations are the norm Many pains◦ Shared schemas◦ Efficient incremental synchronizationand computation◦ Data is badReal-life data pipelines02/06/2013DIP – Introduction to Dataiku Flow 6
  • 7.  1970 Shell scripts 1977 Makefile 1980 Makedeps 1999 SCons/CMake 2001 Maven … Shell Scripts 2008 HaMake 2009 Oozie ETLS, … Next ?Dataiku - Pig, Hive and CascadingAn evolution similar to build Better dependencies Higher-level tasks
  • 8.  The hard life of a Data Scientist Dataiku Flow DCTC Lunch !02/06/2013DIP – Introduction to Dataiku Flow 8
  • 9. Dataiku Flow is a data-drivenorchestration framework forcomplex data pipelinesIntroduction to Flow02/06/2013DIP – Introduction to Dataiku Flow 9 Manage data, not steps and taks Simplify common maintainance situations◦ Data rebuilds◦ Processing steps update Handle real day-to-day pains◦ Data validity checks◦ Transfers between systems
  • 10.  Like a table : contains records, with a schema Can be partitioned◦ Time partitioning (by day, by hour, …)◦ « Value » partitioning (by country, by partner, …) Various backends◦ Filesystem◦ HDFS◦ ElasticSearchConcepts: Dataset02/06/2013DIP – Introduction to Dataiku Flow 10◦ SQL◦ NoSQL (MongoDB, …)◦ Cloud Storages
  • 11.  Has input datasets and output datasets Declares dependencies from input to output Built-in tasks with strong integration◦ Pig◦ Hive Customizable tasks◦ Shell script, Java, …Concepts: Task02/06/2013DIP – Introduction to Dataiku Flow 11AggregateVisitsVisitsCustomersWeeklyaggregationDailyaggregation◦ Python Pandas & SciKit◦ Data transfers
  • 12. Introduction to FlowA sample Flow02/06/2013DIP – Introduction to Dataiku Flow 12VisitsShaker« cleanlogs »CleanlogsWebTrackerlogsPig« aggr_visits »CRM tablecustomersShaker« enrich_cust »Cleancusts.BrowsersReferent.Hive« customer_visits »Cust lastvisitsPig« customer_last_product »CRM tableproductsCust lastproducts
  • 13. Flow is data-oriented Don’t ask « Run task A and then task B » Don’t even ask « Run all tasks that depend from task A » Ask « Do what’s needed so that my aggregated customersdata for 2013/01/25 is up to date » Flow manages dependencies between datasets, through tasks You don’t execute tasks, you compute or refresh datasetsData-oriented02/06/2013DIP – Introduction to Dataiku Flow 13
  • 14. Partition-level dependencies02/06/2013Dataiku Training – Hadoop for Data Science 14Shakercleantask1 cleanlogwtlogsPig« aggr_visits » weekly_aggr "wtlogs" and "cleanlog" are day-partitioned "weekly_aggr" needs the previous 7 days of clean logs "sliding days" partition-level dependency "Compute weekly_aggr for 2012-01-25"◦ Automatically computes the required 7 partitions◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition◦ Perform cleantask1 in parallel for all missing / stale days◦ Perform aggr_visits with the 7 partitions as inputsliding_days(7)
  • 15. Automatic parallelism02/06/2013DIP – Introduction to Dataiku Flow 15 Flow computes the global DAG of required activities Compute activities that can take place in parallel Previous example: 8 activities◦ 7 can be parallelized◦ 1 requires the other 7 first Manages running activities Starts new activities based on available resources
  • 16.  Datasets have a schema, available in all tools Advanced verification of computed data◦ "Check that output is not empty"◦ "Check that this custom query returns between X and Y records"◦ "Check that this specific record is found in output"◦ "Check that number of computed records for day B is no morethan 40% different than day A" Automatic tests for data pipelinesSchema and data validity checks02/06/2013DIP – Introduction to Dataiku Flow 16
  • 17.  Native knowledge of Pig and Hive formats Schema-aware loaders and storages A great ecosystem, but not omnipotent◦ Not everything requires Hadoops strong points Hadoop = first-class citizen of Flow, but not the only one Native integration of SQL capabilities Automatic incremental synchronization to/from MongoDB,Vertica, ElasticSearch, … Custom tasksIntegrated in Hadoop,open beyond02/06/2013DIP – Introduction to Dataiku Flow 17
  • 18. What about Oozie and Hcatalog ?02/06/2013DIP – Introduction to Dataiku Flow 18
  • 19.  Engine and core tasks are working Under active development for first betas Get more info and stay informedhttp://flowbeta.dataiku.comAnd while you wait, another thingEver been annoyed by data transfers ?Are we there yet ?02/06/2013DIP – Introduction to Dataiku Flow 19
  • 20. Feel the pain02/06/2013DIP – Introduction to Dataiku Flow 20
  • 21.  The hard life of a Data Scientist Dataiku Flow DCTC Lunch !02/06/2013DIP – Introduction to Dataiku Flow 21
  • 22.  Extract from the core of Flow Manipulate files across filesystemsDCTC : Cloud data manipulation02/06/2013DIP – Introduction to Dataiku Flow 22# List the files and folders in a S3 bucket% dctc ls s3://my-bucket# Synchronize incrementally from GCS to local folder% dctc sync gs://my-bucket/my-path target-directory# Copy from GCS to HDFS, compress to .gz on the fly# (decompress handled too)% dctc cp –R –c gs://my-bucket/my-path hdfs:///data/input# Dispatch the lines of a file to 8 files on S3, gzip-compressed% dctc dispatch input s3://bucket/target –f random –nf 8 -c
  • 23. DCTC : More examples02/06/2013DIP – Introduction to Dataiku Flow 23# cat from anywhere% dctc cat ftp://account@:/pub/data/data.csv# Edit a remote file (with $EDITOR)% dctc edit ssh://account@:myfile.txt# Transparently unzip% dctc# Head / tail from the cloud% dctc tail s3://bucket/huge-log.csv# Multi-account aware% dctc sync s3://account1@path s3://account2@other_path
  • 24. http://dctc.io Self-contained binary for Linux, OS X, Windows Amazon S3 Google Cloud Storage FTPTry it now02/06/2013DIP – Introduction to Dataiku Flow 24 HTTP SSH HDFS (through local install)
  • 25. Questions ?Marc BattyChief Customer Officermarc.batty@dataiku.com+33 6 45 65 67 04@battymarcFlorian DouetteauChief Executive Officerflorian.douetteau@dataiku.com+33 6 70 56 88 97@fdouetteauThomas CabrolChief Data Scientistthomas.cabrol@dataiku.com+33 7 86 42 62 81@ThomasCabrolClément StenacChief Technical Officerclement.stenac@dataiku.com+33 6 28 06 79 04@ClementStenac