Your SlideShare is downloading. ×
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Dataiku Flow and dctc - Berlin Buzzwords

1,450

Published on

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,450
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Dataiku Flow and dctcData pipelines made easy Berlin Buzzwords 2013
  • 2. Clément Stenac <clement.stenac@dataiku.com>@ClementStenac CTO @ Dataiku Head of product R&D @ Exalead(Search Engine Technology)About me02/06/2013Dataiku Training – Hadoop for Data Science 2
  • 3.  The hard life of a Data Scientist Dataiku Flow DCTC Lunch !02/06/2013DIP – Introduction to Dataiku Flow 3
  • 4. Follow the FlowDataiku - Pig, Hive and CascadingTracker LogMongoDBMySQLMySQLSyslogProductCatalogOrderApache LogsSessionProduct TransformationCategory AffinityCategory TargetingCustomer ProfileProduct RecommenderS3Search Logs (External) Search EngineOptimization(Internal) SearchRankingMongoDBMySQLPartner FTPSync In Sync OutPigPigHiveHiveElasticSearch
  • 5. Zooming moreDataiku - Pig, Hive and CascadingPage ViewsOrdersCatalogBots, Special UsersFiltered Page ViewsUser AffinityProduct PopularityUser Similarity (Per Category)Recommendation GraphRecommendationOrder SummaryUser Similarity (Per Brand)Machine Learning
  • 6.  Many tasks and tools Dozens of stage, evolves daily Exceptional situations are the norm Many pains◦ Shared schemas◦ Efficient incremental synchronizationand computation◦ Data is badReal-life data pipelines02/06/2013DIP – Introduction to Dataiku Flow 6
  • 7.  1970 Shell scripts 1977 Makefile 1980 Makedeps 1999 SCons/CMake 2001 Maven … Shell Scripts 2008 HaMake 2009 Oozie ETLS, … Next ?Dataiku - Pig, Hive and CascadingAn evolution similar to build Better dependencies Higher-level tasks
  • 8.  The hard life of a Data Scientist Dataiku Flow DCTC Lunch !02/06/2013DIP – Introduction to Dataiku Flow 8
  • 9. Dataiku Flow is a data-drivenorchestration framework forcomplex data pipelinesIntroduction to Flow02/06/2013DIP – Introduction to Dataiku Flow 9 Manage data, not steps and taks Simplify common maintainance situations◦ Data rebuilds◦ Processing steps update Handle real day-to-day pains◦ Data validity checks◦ Transfers between systems
  • 10.  Like a table : contains records, with a schema Can be partitioned◦ Time partitioning (by day, by hour, …)◦ « Value » partitioning (by country, by partner, …) Various backends◦ Filesystem◦ HDFS◦ ElasticSearchConcepts: Dataset02/06/2013DIP – Introduction to Dataiku Flow 10◦ SQL◦ NoSQL (MongoDB, …)◦ Cloud Storages
  • 11.  Has input datasets and output datasets Declares dependencies from input to output Built-in tasks with strong integration◦ Pig◦ Hive Customizable tasks◦ Shell script, Java, …Concepts: Task02/06/2013DIP – Introduction to Dataiku Flow 11AggregateVisitsVisitsCustomersWeeklyaggregationDailyaggregation◦ Python Pandas & SciKit◦ Data transfers
  • 12. Introduction to FlowA sample Flow02/06/2013DIP – Introduction to Dataiku Flow 12VisitsShaker« cleanlogs »CleanlogsWebTrackerlogsPig« aggr_visits »CRM tablecustomersShaker« enrich_cust »Cleancusts.BrowsersReferent.Hive« customer_visits »Cust lastvisitsPig« customer_last_product »CRM tableproductsCust lastproducts
  • 13. Flow is data-oriented Don’t ask « Run task A and then task B » Don’t even ask « Run all tasks that depend from task A » Ask « Do what’s needed so that my aggregated customersdata for 2013/01/25 is up to date » Flow manages dependencies between datasets, through tasks You don’t execute tasks, you compute or refresh datasetsData-oriented02/06/2013DIP – Introduction to Dataiku Flow 13
  • 14. Partition-level dependencies02/06/2013Dataiku Training – Hadoop for Data Science 14Shakercleantask1 cleanlogwtlogsPig« aggr_visits » weekly_aggr "wtlogs" and "cleanlog" are day-partitioned "weekly_aggr" needs the previous 7 days of clean logs "sliding days" partition-level dependency "Compute weekly_aggr for 2012-01-25"◦ Automatically computes the required 7 partitions◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition◦ Perform cleantask1 in parallel for all missing / stale days◦ Perform aggr_visits with the 7 partitions as inputsliding_days(7)
  • 15. Automatic parallelism02/06/2013DIP – Introduction to Dataiku Flow 15 Flow computes the global DAG of required activities Compute activities that can take place in parallel Previous example: 8 activities◦ 7 can be parallelized◦ 1 requires the other 7 first Manages running activities Starts new activities based on available resources
  • 16.  Datasets have a schema, available in all tools Advanced verification of computed data◦ "Check that output is not empty"◦ "Check that this custom query returns between X and Y records"◦ "Check that this specific record is found in output"◦ "Check that number of computed records for day B is no morethan 40% different than day A" Automatic tests for data pipelinesSchema and data validity checks02/06/2013DIP – Introduction to Dataiku Flow 16
  • 17.  Native knowledge of Pig and Hive formats Schema-aware loaders and storages A great ecosystem, but not omnipotent◦ Not everything requires Hadoops strong points Hadoop = first-class citizen of Flow, but not the only one Native integration of SQL capabilities Automatic incremental synchronization to/from MongoDB,Vertica, ElasticSearch, … Custom tasksIntegrated in Hadoop,open beyond02/06/2013DIP – Introduction to Dataiku Flow 17
  • 18. What about Oozie and Hcatalog ?02/06/2013DIP – Introduction to Dataiku Flow 18
  • 19.  Engine and core tasks are working Under active development for first betas Get more info and stay informedhttp://flowbeta.dataiku.comAnd while you wait, another thingEver been annoyed by data transfers ?Are we there yet ?02/06/2013DIP – Introduction to Dataiku Flow 19
  • 20. Feel the pain02/06/2013DIP – Introduction to Dataiku Flow 20
  • 21.  The hard life of a Data Scientist Dataiku Flow DCTC Lunch !02/06/2013DIP – Introduction to Dataiku Flow 21
  • 22.  Extract from the core of Flow Manipulate files across filesystemsDCTC : Cloud data manipulation02/06/2013DIP – Introduction to Dataiku Flow 22# List the files and folders in a S3 bucket% dctc ls s3://my-bucket# Synchronize incrementally from GCS to local folder% dctc sync gs://my-bucket/my-path target-directory# Copy from GCS to HDFS, compress to .gz on the fly# (decompress handled too)% dctc cp –R –c gs://my-bucket/my-path hdfs:///data/input# Dispatch the lines of a file to 8 files on S3, gzip-compressed% dctc dispatch input s3://bucket/target –f random –nf 8 -c
  • 23. DCTC : More examples02/06/2013DIP – Introduction to Dataiku Flow 23# cat from anywhere% dctc cat ftp://account@:/pub/data/data.csv# Edit a remote file (with $EDITOR)% dctc edit ssh://account@:myfile.txt# Transparently unzip% dctc# Head / tail from the cloud% dctc tail s3://bucket/huge-log.csv# Multi-account aware% dctc sync s3://account1@path s3://account2@other_path
  • 24. http://dctc.io Self-contained binary for Linux, OS X, Windows Amazon S3 Google Cloud Storage FTPTry it now02/06/2013DIP – Introduction to Dataiku Flow 24 HTTP SSH HDFS (through local install)
  • 25. Questions ?Marc BattyChief Customer Officermarc.batty@dataiku.com+33 6 45 65 67 04@battymarcFlorian DouetteauChief Executive Officerflorian.douetteau@dataiku.com+33 6 70 56 88 97@fdouetteauThomas CabrolChief Data Scientistthomas.cabrol@dataiku.com+33 7 86 42 62 81@ThomasCabrolClément StenacChief Technical Officerclement.stenac@dataiku.com+33 6 28 06 79 04@ClementStenac

×