Dataiku Flow and dctc
Data pipelines made easy
 Berlin Buzzwords 2013
Clément Stenac <clement.stenac@dataiku.com>
@ClementStenac
 CTO @ Dataiku
 Head of product R&D @ Exalead
(Search Engine Technology)
About me
02/06/2013Dataiku Training – Hadoop for Data Science 2
 The hard life of a Data Scientist
 Dataiku Flow
 DCTC
 Lunch !
02/06/2013DIP – Introduction to Dataiku Flow 3
Follow the Flow
Dataiku - Pig, Hive and Cascading
Tracker Log
MongoDB
MySQL
MySQL
Syslog
Product
Catalog
Order
Apache Logs
Session
Product Transformation
Category Affinity
Category Targeting
Customer Profile
Product Recommender
S3
Search Logs (External) Search Engine
Optimization
(Internal) Search
Ranking
MongoDB
MySQL
Partner FTP
Sync In Sync Out
Pig
Pig
Hive
Hive
ElasticSearch
Zooming more
Dataiku - Pig, Hive and Cascading
Page Views
Orders
Catalog
Bots, Special Users
Filtered Page Views
User Affinity
Product Popularity
User Similarity (Per Category)
Recommendation Graph
Recommendation
Order Summary
User Similarity (Per Brand)
Machine Learning
 Many tasks and tools
 Dozens of stage, evolves daily
 Exceptional situations are the norm
 Many pains
◦ Shared schemas
◦ Efficient incremental synchronization
and computation
◦ Data is bad
Real-life data pipelines
02/06/2013DIP – Introduction to Dataiku Flow 6
 1970 Shell scripts
 1977 Makefile
 1980 Makedeps
 1999 SCons/CMake
 2001 Maven
 … Shell Scripts
 2008 HaMake
 2009 Oozie
 ETLS, …
 Next ?
Dataiku - Pig, Hive and Cascading
An evolution similar to build
 Better dependencies
 Higher-level tasks
 The hard life of a Data Scientist
 Dataiku Flow
 DCTC
 Lunch !
02/06/2013DIP – Introduction to Dataiku Flow 8
Dataiku Flow is a data-driven
orchestration framework for
complex data pipelines
Introduction to Flow
02/06/2013DIP – Introduction to Dataiku Flow 9
 Manage data, not steps and taks
 Simplify common maintainance situations
◦ Data rebuilds
◦ Processing steps update
 Handle real day-to-day pains
◦ Data validity checks
◦ Transfers between systems
 Like a table : contains records, with a schema
 Can be partitioned
◦ Time partitioning (by day, by hour, …)
◦ « Value » partitioning (by country, by partner, …)
 Various backends
◦ Filesystem
◦ HDFS
◦ ElasticSearch
Concepts: Dataset
02/06/2013DIP – Introduction to Dataiku Flow 10
◦ SQL
◦ NoSQL (MongoDB, …)
◦ Cloud Storages
 Has input datasets and output datasets
 Declares dependencies from input to output
 Built-in tasks with strong integration
◦ Pig
◦ Hive
 Customizable tasks
◦ Shell script, Java, …
Concepts: Task
02/06/2013DIP – Introduction to Dataiku Flow 11
Aggregate
Visits
Visits
Customers
Weekly
aggregation
Daily
aggregation
◦ Python Pandas & SciKit
◦ Data transfers
Introduction to Flow
A sample Flow
02/06/2013DIP – Introduction to Dataiku Flow 12
Visits
Shaker
« cleanlogs »
Clean
logs
Web
Tracker
logs
Pig
« aggr_visits »
CRM table
customers
Shaker
« enrich_cust »
Clean
custs.
Browser
s
Referent
.
Hive
« customer_visits »
Cust last
visits
Pig
« customer_last_
product »
CRM table
products
Cust last
products
Flow is data-oriented
 Don’t ask « Run task A and then task B »
 Don’t even ask « Run all tasks that depend from task A »
 Ask « Do what’s needed so that my aggregated customers
data for 2013/01/25 is up to date »
 Flow manages dependencies between datasets, through tasks
 You don’t execute tasks, you compute or refresh datasets
Data-oriented
02/06/2013DIP – Introduction to Dataiku Flow 13
Partition-level dependencies
02/06/2013Dataiku Training – Hadoop for Data Science 14
Shaker
cleantask1 cleanlogwtlogs
Pig
« aggr_visits » weekly_aggr
 "wtlogs" and "cleanlog" are day-partitioned
 "weekly_aggr" needs the previous 7 days of clean logs
 "sliding days" partition-level dependency
 "Compute weekly_aggr for 2012-01-25"
◦ Automatically computes the required 7 partitions
◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition
◦ Perform cleantask1 in parallel for all missing / stale days
◦ Perform aggr_visits with the 7 partitions as input
sliding_days(7)
Automatic parallelism
02/06/2013DIP – Introduction to Dataiku Flow 15
 Flow computes the global DAG of required activities
 Compute activities that can take place in parallel
 Previous example: 8 activities
◦ 7 can be parallelized
◦ 1 requires the other 7 first
 Manages running activities
 Starts new activities based on available resources
 Datasets have a schema, available in all tools
 Advanced verification of computed data
◦ "Check that output is not empty"
◦ "Check that this custom query returns between X and Y records"
◦ "Check that this specific record is found in output"
◦ "Check that number of computed records for day B is no more
than 40% different than day A"
 Automatic tests for data pipelines
Schema and data validity checks
02/06/2013DIP – Introduction to Dataiku Flow 16
 Native knowledge of Pig and Hive formats
 Schema-aware loaders and storages
 A great ecosystem, but not omnipotent
◦ Not everything requires Hadoop's strong points
 Hadoop = first-class citizen of Flow, but not the only one
 Native integration of SQL capabilities
 Automatic incremental synchronization to/from MongoDB,
Vertica, ElasticSearch, …
 Custom tasks
Integrated in Hadoop,
open beyond
02/06/2013DIP – Introduction to Dataiku Flow 17
What about Oozie and Hcatalog ?
02/06/2013DIP – Introduction to Dataiku Flow 18
 Engine and core tasks are working
 Under active development for first betas
 Get more info and stay informed
http://flowbeta.dataiku.com
And while you wait, another thing
Ever been annoyed by data transfers ?
Are we there yet ?
02/06/2013DIP – Introduction to Dataiku Flow 19
Feel the pain
02/06/2013DIP – Introduction to Dataiku Flow 20
 The hard life of a Data Scientist
 Dataiku Flow
 DCTC
 Lunch !
02/06/2013DIP – Introduction to Dataiku Flow 21
 Extract from the core of Flow
 Manipulate files across filesystems
DCTC : Cloud data manipulation
02/06/2013DIP – Introduction to Dataiku Flow 22
# List the files and folders in a S3 bucket
% dctc ls s3://my-bucket
# Synchronize incrementally from GCS to local folder
% dctc sync gs://my-bucket/my-path target-directory
# Copy from GCS to HDFS, compress to .gz on the fly
# (decompress handled too)
% dctc cp –R –c gs://my-bucket/my-path hdfs:///data/input
# Dispatch the lines of a file to 8 files on S3, gzip-compressed
% dctc dispatch input s3://bucket/target –f random –nf 8 -c
DCTC : More examples
02/06/2013DIP – Introduction to Dataiku Flow 23
# cat from anywhere
% dctc cat ftp://account@:/pub/data/data.csv
# Edit a remote file (with $EDITOR)
% dctc edit ssh://account@:myfile.txt
# Transparently unzip
% dctc
# Head / tail from the cloud
% dctc tail s3://bucket/huge-log.csv
# Multi-account aware
% dctc sync s3://account1@path s3://account2@other_path
http://dctc.io
 Self-contained binary for Linux, OS X, Windows
 Amazon S3
 Google Cloud Storage
 FTP
Try it now
02/06/2013DIP – Introduction to Dataiku Flow 24
 HTTP
 SSH
 HDFS (through local install)
Questions ?
Marc Batty
Chief Customer Officer
marc.batty@dataiku.com
+33 6 45 65 67 04
@battymarc
Florian Douetteau
Chief Executive Officer
florian.douetteau@dataiku.com
+33 6 70 56 88 97
@fdouetteau
Thomas Cabrol
Chief Data Scientist
thomas.cabrol@dataiku.com
+33 7 86 42 62 81
@ThomasCabrol
Clément Stenac
Chief Technical Officer
clement.stenac@dataiku.com
+33 6 28 06 79 04
@ClementStenac
Dataiku Flow and dctc - Berlin Buzzwords

Dataiku Flow and dctc - Berlin Buzzwords

  • 1.
    Dataiku Flow anddctc Data pipelines made easy  Berlin Buzzwords 2013
  • 2.
    Clément Stenac <clement.stenac@dataiku.com> @ClementStenac CTO @ Dataiku  Head of product R&D @ Exalead (Search Engine Technology) About me 02/06/2013Dataiku Training – Hadoop for Data Science 2
  • 3.
     The hardlife of a Data Scientist  Dataiku Flow  DCTC  Lunch ! 02/06/2013DIP – Introduction to Dataiku Flow 3
  • 4.
    Follow the Flow Dataiku- Pig, Hive and Cascading Tracker Log MongoDB MySQL MySQL Syslog Product Catalog Order Apache Logs Session Product Transformation Category Affinity Category Targeting Customer Profile Product Recommender S3 Search Logs (External) Search Engine Optimization (Internal) Search Ranking MongoDB MySQL Partner FTP Sync In Sync Out Pig Pig Hive Hive ElasticSearch
  • 5.
    Zooming more Dataiku -Pig, Hive and Cascading Page Views Orders Catalog Bots, Special Users Filtered Page Views User Affinity Product Popularity User Similarity (Per Category) Recommendation Graph Recommendation Order Summary User Similarity (Per Brand) Machine Learning
  • 6.
     Many tasksand tools  Dozens of stage, evolves daily  Exceptional situations are the norm  Many pains ◦ Shared schemas ◦ Efficient incremental synchronization and computation ◦ Data is bad Real-life data pipelines 02/06/2013DIP – Introduction to Dataiku Flow 6
  • 7.
     1970 Shellscripts  1977 Makefile  1980 Makedeps  1999 SCons/CMake  2001 Maven  … Shell Scripts  2008 HaMake  2009 Oozie  ETLS, …  Next ? Dataiku - Pig, Hive and Cascading An evolution similar to build  Better dependencies  Higher-level tasks
  • 8.
     The hardlife of a Data Scientist  Dataiku Flow  DCTC  Lunch ! 02/06/2013DIP – Introduction to Dataiku Flow 8
  • 9.
    Dataiku Flow isa data-driven orchestration framework for complex data pipelines Introduction to Flow 02/06/2013DIP – Introduction to Dataiku Flow 9  Manage data, not steps and taks  Simplify common maintainance situations ◦ Data rebuilds ◦ Processing steps update  Handle real day-to-day pains ◦ Data validity checks ◦ Transfers between systems
  • 10.
     Like atable : contains records, with a schema  Can be partitioned ◦ Time partitioning (by day, by hour, …) ◦ « Value » partitioning (by country, by partner, …)  Various backends ◦ Filesystem ◦ HDFS ◦ ElasticSearch Concepts: Dataset 02/06/2013DIP – Introduction to Dataiku Flow 10 ◦ SQL ◦ NoSQL (MongoDB, …) ◦ Cloud Storages
  • 11.
     Has inputdatasets and output datasets  Declares dependencies from input to output  Built-in tasks with strong integration ◦ Pig ◦ Hive  Customizable tasks ◦ Shell script, Java, … Concepts: Task 02/06/2013DIP – Introduction to Dataiku Flow 11 Aggregate Visits Visits Customers Weekly aggregation Daily aggregation ◦ Python Pandas & SciKit ◦ Data transfers
  • 12.
    Introduction to Flow Asample Flow 02/06/2013DIP – Introduction to Dataiku Flow 12 Visits Shaker « cleanlogs » Clean logs Web Tracker logs Pig « aggr_visits » CRM table customers Shaker « enrich_cust » Clean custs. Browser s Referent . Hive « customer_visits » Cust last visits Pig « customer_last_ product » CRM table products Cust last products
  • 13.
    Flow is data-oriented Don’t ask « Run task A and then task B »  Don’t even ask « Run all tasks that depend from task A »  Ask « Do what’s needed so that my aggregated customers data for 2013/01/25 is up to date »  Flow manages dependencies between datasets, through tasks  You don’t execute tasks, you compute or refresh datasets Data-oriented 02/06/2013DIP – Introduction to Dataiku Flow 13
  • 14.
    Partition-level dependencies 02/06/2013Dataiku Training– Hadoop for Data Science 14 Shaker cleantask1 cleanlogwtlogs Pig « aggr_visits » weekly_aggr  "wtlogs" and "cleanlog" are day-partitioned  "weekly_aggr" needs the previous 7 days of clean logs  "sliding days" partition-level dependency  "Compute weekly_aggr for 2012-01-25" ◦ Automatically computes the required 7 partitions ◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition ◦ Perform cleantask1 in parallel for all missing / stale days ◦ Perform aggr_visits with the 7 partitions as input sliding_days(7)
  • 15.
    Automatic parallelism 02/06/2013DIP –Introduction to Dataiku Flow 15  Flow computes the global DAG of required activities  Compute activities that can take place in parallel  Previous example: 8 activities ◦ 7 can be parallelized ◦ 1 requires the other 7 first  Manages running activities  Starts new activities based on available resources
  • 16.
     Datasets havea schema, available in all tools  Advanced verification of computed data ◦ "Check that output is not empty" ◦ "Check that this custom query returns between X and Y records" ◦ "Check that this specific record is found in output" ◦ "Check that number of computed records for day B is no more than 40% different than day A"  Automatic tests for data pipelines Schema and data validity checks 02/06/2013DIP – Introduction to Dataiku Flow 16
  • 17.
     Native knowledgeof Pig and Hive formats  Schema-aware loaders and storages  A great ecosystem, but not omnipotent ◦ Not everything requires Hadoop's strong points  Hadoop = first-class citizen of Flow, but not the only one  Native integration of SQL capabilities  Automatic incremental synchronization to/from MongoDB, Vertica, ElasticSearch, …  Custom tasks Integrated in Hadoop, open beyond 02/06/2013DIP – Introduction to Dataiku Flow 17
  • 18.
    What about Oozieand Hcatalog ? 02/06/2013DIP – Introduction to Dataiku Flow 18
  • 19.
     Engine andcore tasks are working  Under active development for first betas  Get more info and stay informed http://flowbeta.dataiku.com And while you wait, another thing Ever been annoyed by data transfers ? Are we there yet ? 02/06/2013DIP – Introduction to Dataiku Flow 19
  • 20.
    Feel the pain 02/06/2013DIP– Introduction to Dataiku Flow 20
  • 21.
     The hardlife of a Data Scientist  Dataiku Flow  DCTC  Lunch ! 02/06/2013DIP – Introduction to Dataiku Flow 21
  • 22.
     Extract fromthe core of Flow  Manipulate files across filesystems DCTC : Cloud data manipulation 02/06/2013DIP – Introduction to Dataiku Flow 22 # List the files and folders in a S3 bucket % dctc ls s3://my-bucket # Synchronize incrementally from GCS to local folder % dctc sync gs://my-bucket/my-path target-directory # Copy from GCS to HDFS, compress to .gz on the fly # (decompress handled too) % dctc cp –R –c gs://my-bucket/my-path hdfs:///data/input # Dispatch the lines of a file to 8 files on S3, gzip-compressed % dctc dispatch input s3://bucket/target –f random –nf 8 -c
  • 23.
    DCTC : Moreexamples 02/06/2013DIP – Introduction to Dataiku Flow 23 # cat from anywhere % dctc cat ftp://account@:/pub/data/data.csv # Edit a remote file (with $EDITOR) % dctc edit ssh://account@:myfile.txt # Transparently unzip % dctc # Head / tail from the cloud % dctc tail s3://bucket/huge-log.csv # Multi-account aware % dctc sync s3://account1@path s3://account2@other_path
  • 24.
    http://dctc.io  Self-contained binaryfor Linux, OS X, Windows  Amazon S3  Google Cloud Storage  FTP Try it now 02/06/2013DIP – Introduction to Dataiku Flow 24  HTTP  SSH  HDFS (through local install)
  • 25.
    Questions ? Marc Batty ChiefCustomer Officer marc.batty@dataiku.com +33 6 45 65 67 04 @battymarc Florian Douetteau Chief Executive Officer florian.douetteau@dataiku.com +33 6 70 56 88 97 @fdouetteau Thomas Cabrol Chief Data Scientist thomas.cabrol@dataiku.com +33 7 86 42 62 81 @ThomasCabrol Clément Stenac Chief Technical Officer clement.stenac@dataiku.com +33 6 28 06 79 04 @ClementStenac