SlideShare a Scribd company logo
1 of 26
Dataiku Flow and dctc
Data pipelines made easy
 Berlin Buzzwords 2013
Clément Stenac <clement.stenac@dataiku.com>
@ClementStenac
 CTO @ Dataiku
 Head of product R&D @ Exalead
(Search Engine Technology)
About me
02/06/2013Dataiku Training – Hadoop for Data Science 2
 The hard life of a Data Scientist
 Dataiku Flow
 DCTC
 Lunch !
02/06/2013DIP – Introduction to Dataiku Flow 3
Follow the Flow
Dataiku - Pig, Hive and Cascading
Tracker Log
MongoDB
MySQL
MySQL
Syslog
Product
Catalog
Order
Apache Logs
Session
Product Transformation
Category Affinity
Category Targeting
Customer Profile
Product Recommender
S3
Search Logs (External) Search Engine
Optimization
(Internal) Search
Ranking
MongoDB
MySQL
Partner FTP
Sync In Sync Out
Pig
Pig
Hive
Hive
ElasticSearch
Zooming more
Dataiku - Pig, Hive and Cascading
Page Views
Orders
Catalog
Bots, Special Users
Filtered Page Views
User Affinity
Product Popularity
User Similarity (Per Category)
Recommendation Graph
Recommendation
Order Summary
User Similarity (Per Brand)
Machine Learning
 Many tasks and tools
 Dozens of stage, evolves daily
 Exceptional situations are the norm
 Many pains
◦ Shared schemas
◦ Efficient incremental synchronization
and computation
◦ Data is bad
Real-life data pipelines
02/06/2013DIP – Introduction to Dataiku Flow 6
 1970 Shell scripts
 1977 Makefile
 1980 Makedeps
 1999 SCons/CMake
 2001 Maven
 … Shell Scripts
 2008 HaMake
 2009 Oozie
 ETLS, …
 Next ?
Dataiku - Pig, Hive and Cascading
An evolution similar to build
 Better dependencies
 Higher-level tasks
 The hard life of a Data Scientist
 Dataiku Flow
 DCTC
 Lunch !
02/06/2013DIP – Introduction to Dataiku Flow 8
Dataiku Flow is a data-driven
orchestration framework for
complex data pipelines
Introduction to Flow
02/06/2013DIP – Introduction to Dataiku Flow 9
 Manage data, not steps and taks
 Simplify common maintainance situations
◦ Data rebuilds
◦ Processing steps update
 Handle real day-to-day pains
◦ Data validity checks
◦ Transfers between systems
 Like a table : contains records, with a schema
 Can be partitioned
◦ Time partitioning (by day, by hour, …)
◦ « Value » partitioning (by country, by partner, …)
 Various backends
◦ Filesystem
◦ HDFS
◦ ElasticSearch
Concepts: Dataset
02/06/2013DIP – Introduction to Dataiku Flow 10
◦ SQL
◦ NoSQL (MongoDB, …)
◦ Cloud Storages
 Has input datasets and output datasets
 Declares dependencies from input to output
 Built-in tasks with strong integration
◦ Pig
◦ Hive
 Customizable tasks
◦ Shell script, Java, …
Concepts: Task
02/06/2013DIP – Introduction to Dataiku Flow 11
Aggregate
Visits
Visits
Customers
Weekly
aggregation
Daily
aggregation
◦ Python Pandas & SciKit
◦ Data transfers
Introduction to Flow
A sample Flow
02/06/2013DIP – Introduction to Dataiku Flow 12
Visits
Shaker
« cleanlogs »
Clean
logs
Web
Tracker
logs
Pig
« aggr_visits »
CRM table
customers
Shaker
« enrich_cust »
Clean
custs.
Browser
s
Referent
.
Hive
« customer_visits »
Cust last
visits
Pig
« customer_last_
product »
CRM table
products
Cust last
products
Flow is data-oriented
 Don’t ask « Run task A and then task B »
 Don’t even ask « Run all tasks that depend from task A »
 Ask « Do what’s needed so that my aggregated customers
data for 2013/01/25 is up to date »
 Flow manages dependencies between datasets, through tasks
 You don’t execute tasks, you compute or refresh datasets
Data-oriented
02/06/2013DIP – Introduction to Dataiku Flow 13
Partition-level dependencies
02/06/2013Dataiku Training – Hadoop for Data Science 14
Shaker
cleantask1 cleanlogwtlogs
Pig
« aggr_visits » weekly_aggr
 "wtlogs" and "cleanlog" are day-partitioned
 "weekly_aggr" needs the previous 7 days of clean logs
 "sliding days" partition-level dependency
 "Compute weekly_aggr for 2012-01-25"
◦ Automatically computes the required 7 partitions
◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition
◦ Perform cleantask1 in parallel for all missing / stale days
◦ Perform aggr_visits with the 7 partitions as input
sliding_days(7)
Automatic parallelism
02/06/2013DIP – Introduction to Dataiku Flow 15
 Flow computes the global DAG of required activities
 Compute activities that can take place in parallel
 Previous example: 8 activities
◦ 7 can be parallelized
◦ 1 requires the other 7 first
 Manages running activities
 Starts new activities based on available resources
 Datasets have a schema, available in all tools
 Advanced verification of computed data
◦ "Check that output is not empty"
◦ "Check that this custom query returns between X and Y records"
◦ "Check that this specific record is found in output"
◦ "Check that number of computed records for day B is no more
than 40% different than day A"
 Automatic tests for data pipelines
Schema and data validity checks
02/06/2013DIP – Introduction to Dataiku Flow 16
 Native knowledge of Pig and Hive formats
 Schema-aware loaders and storages
 A great ecosystem, but not omnipotent
◦ Not everything requires Hadoop's strong points
 Hadoop = first-class citizen of Flow, but not the only one
 Native integration of SQL capabilities
 Automatic incremental synchronization to/from MongoDB,
Vertica, ElasticSearch, …
 Custom tasks
Integrated in Hadoop,
open beyond
02/06/2013DIP – Introduction to Dataiku Flow 17
What about Oozie and Hcatalog ?
02/06/2013DIP – Introduction to Dataiku Flow 18
 Engine and core tasks are working
 Under active development for first betas
 Get more info and stay informed
http://flowbeta.dataiku.com
And while you wait, another thing
Ever been annoyed by data transfers ?
Are we there yet ?
02/06/2013DIP – Introduction to Dataiku Flow 19
Feel the pain
02/06/2013DIP – Introduction to Dataiku Flow 20
 The hard life of a Data Scientist
 Dataiku Flow
 DCTC
 Lunch !
02/06/2013DIP – Introduction to Dataiku Flow 21
 Extract from the core of Flow
 Manipulate files across filesystems
DCTC : Cloud data manipulation
02/06/2013DIP – Introduction to Dataiku Flow 22
# List the files and folders in a S3 bucket
% dctc ls s3://my-bucket
# Synchronize incrementally from GCS to local folder
% dctc sync gs://my-bucket/my-path target-directory
# Copy from GCS to HDFS, compress to .gz on the fly
# (decompress handled too)
% dctc cp –R –c gs://my-bucket/my-path hdfs:///data/input
# Dispatch the lines of a file to 8 files on S3, gzip-compressed
% dctc dispatch input s3://bucket/target –f random –nf 8 -c
DCTC : More examples
02/06/2013DIP – Introduction to Dataiku Flow 23
# cat from anywhere
% dctc cat ftp://account@:/pub/data/data.csv
# Edit a remote file (with $EDITOR)
% dctc edit ssh://account@:myfile.txt
# Transparently unzip
% dctc
# Head / tail from the cloud
% dctc tail s3://bucket/huge-log.csv
# Multi-account aware
% dctc sync s3://account1@path s3://account2@other_path
http://dctc.io
 Self-contained binary for Linux, OS X, Windows
 Amazon S3
 Google Cloud Storage
 FTP
Try it now
02/06/2013DIP – Introduction to Dataiku Flow 24
 HTTP
 SSH
 HDFS (through local install)
Questions ?
Marc Batty
Chief Customer Officer
marc.batty@dataiku.com
+33 6 45 65 67 04
@battymarc
Florian Douetteau
Chief Executive Officer
florian.douetteau@dataiku.com
+33 6 70 56 88 97
@fdouetteau
Thomas Cabrol
Chief Data Scientist
thomas.cabrol@dataiku.com
+33 7 86 42 62 81
@ThomasCabrol
Clément Stenac
Chief Technical Officer
clement.stenac@dataiku.com
+33 6 28 06 79 04
@ClementStenac
Dataiku Flow and dctc - Berlin Buzzwords

More Related Content

What's hot

How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?Vincent Terrasi
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQueryDharmesh Vaya
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !Christian Bilien
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BIPrasad Prabhu (PP)
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...CloudxLab
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataTreasure Data, Inc.
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerDataWorks Summit
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Arohi Khandelwal
 
The of Operational Analytics Data Store
The of Operational Analytics Data StoreThe of Operational Analytics Data Store
The of Operational Analytics Data StoreRommel Garcia
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Jen Stirrup
 
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyNati Shalom
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabeDataiku
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 

What's hot (20)

How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
 
Big data 101
Big data 101Big data 101
Big data 101
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
 
The of Operational Analytics Data Store
The of Operational Analytics Data StoreThe of Operational Analytics Data Store
The of Operational Analytics Data Store
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
 
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 

Viewers also liked

Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data CircleDataiku
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013Dataiku
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunDataiku
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 
Doctor Faustus by Micaela & Rocío
Doctor Faustus by Micaela & RocíoDoctor Faustus by Micaela & Rocío
Doctor Faustus by Micaela & RocíoAndrea Izzo
 
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...Getting Ahead of the Game: How a gaming studio gets customer insights from bi...
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...DataWorks Summit
 
Aggietarium slideshow final
Aggietarium slideshow finalAggietarium slideshow final
Aggietarium slideshow finalAna Monzon
 
Тематическое планирование 8 класс
Тематическое планирование 8 классТематическое планирование 8 класс
Тематическое планирование 8 классkoneqq
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 

Viewers also liked (17)

Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for Fun
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
Practical Placement
Practical PlacementPractical Placement
Practical Placement
 
Doctor Faustus by Micaela & Rocío
Doctor Faustus by Micaela & RocíoDoctor Faustus by Micaela & Rocío
Doctor Faustus by Micaela & Rocío
 
Oxleas_Review_2012_website_version_1
Oxleas_Review_2012_website_version_1Oxleas_Review_2012_website_version_1
Oxleas_Review_2012_website_version_1
 
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...Getting Ahead of the Game: How a gaming studio gets customer insights from bi...
Getting Ahead of the Game: How a gaming studio gets customer insights from bi...
 
Aggietarium slideshow final
Aggietarium slideshow finalAggietarium slideshow final
Aggietarium slideshow final
 
Тематическое планирование 8 класс
Тематическое планирование 8 классТематическое планирование 8 класс
Тематическое планирование 8 класс
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine Learning
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Pay periods use show
Pay periods use showPay periods use show
Pay periods use show
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 

Similar to Dataiku Flow and dctc - Berlin Buzzwords

Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshopFang Mac
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryMárton Kodok
 
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryGDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryMárton Kodok
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupJelena Zanko
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesArcadia Data
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayAjay Shriwastava
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 

Similar to Dataiku Flow and dctc - Berlin Buzzwords (20)

Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
 
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQueryGDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Oracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAsOracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAs
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid Meetup
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 

More from Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thDataiku
 

More from Dataiku (13)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
 

Recently uploaded

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 

Recently uploaded (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 

Dataiku Flow and dctc - Berlin Buzzwords

  • 1. Dataiku Flow and dctc Data pipelines made easy  Berlin Buzzwords 2013
  • 2. Clément Stenac <clement.stenac@dataiku.com> @ClementStenac  CTO @ Dataiku  Head of product R&D @ Exalead (Search Engine Technology) About me 02/06/2013Dataiku Training – Hadoop for Data Science 2
  • 3.  The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch ! 02/06/2013DIP – Introduction to Dataiku Flow 3
  • 4. Follow the Flow Dataiku - Pig, Hive and Cascading Tracker Log MongoDB MySQL MySQL Syslog Product Catalog Order Apache Logs Session Product Transformation Category Affinity Category Targeting Customer Profile Product Recommender S3 Search Logs (External) Search Engine Optimization (Internal) Search Ranking MongoDB MySQL Partner FTP Sync In Sync Out Pig Pig Hive Hive ElasticSearch
  • 5. Zooming more Dataiku - Pig, Hive and Cascading Page Views Orders Catalog Bots, Special Users Filtered Page Views User Affinity Product Popularity User Similarity (Per Category) Recommendation Graph Recommendation Order Summary User Similarity (Per Brand) Machine Learning
  • 6.  Many tasks and tools  Dozens of stage, evolves daily  Exceptional situations are the norm  Many pains ◦ Shared schemas ◦ Efficient incremental synchronization and computation ◦ Data is bad Real-life data pipelines 02/06/2013DIP – Introduction to Dataiku Flow 6
  • 7.  1970 Shell scripts  1977 Makefile  1980 Makedeps  1999 SCons/CMake  2001 Maven  … Shell Scripts  2008 HaMake  2009 Oozie  ETLS, …  Next ? Dataiku - Pig, Hive and Cascading An evolution similar to build  Better dependencies  Higher-level tasks
  • 8.  The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch ! 02/06/2013DIP – Introduction to Dataiku Flow 8
  • 9. Dataiku Flow is a data-driven orchestration framework for complex data pipelines Introduction to Flow 02/06/2013DIP – Introduction to Dataiku Flow 9  Manage data, not steps and taks  Simplify common maintainance situations ◦ Data rebuilds ◦ Processing steps update  Handle real day-to-day pains ◦ Data validity checks ◦ Transfers between systems
  • 10.  Like a table : contains records, with a schema  Can be partitioned ◦ Time partitioning (by day, by hour, …) ◦ « Value » partitioning (by country, by partner, …)  Various backends ◦ Filesystem ◦ HDFS ◦ ElasticSearch Concepts: Dataset 02/06/2013DIP – Introduction to Dataiku Flow 10 ◦ SQL ◦ NoSQL (MongoDB, …) ◦ Cloud Storages
  • 11.  Has input datasets and output datasets  Declares dependencies from input to output  Built-in tasks with strong integration ◦ Pig ◦ Hive  Customizable tasks ◦ Shell script, Java, … Concepts: Task 02/06/2013DIP – Introduction to Dataiku Flow 11 Aggregate Visits Visits Customers Weekly aggregation Daily aggregation ◦ Python Pandas & SciKit ◦ Data transfers
  • 12. Introduction to Flow A sample Flow 02/06/2013DIP – Introduction to Dataiku Flow 12 Visits Shaker « cleanlogs » Clean logs Web Tracker logs Pig « aggr_visits » CRM table customers Shaker « enrich_cust » Clean custs. Browser s Referent . Hive « customer_visits » Cust last visits Pig « customer_last_ product » CRM table products Cust last products
  • 13. Flow is data-oriented  Don’t ask « Run task A and then task B »  Don’t even ask « Run all tasks that depend from task A »  Ask « Do what’s needed so that my aggregated customers data for 2013/01/25 is up to date »  Flow manages dependencies between datasets, through tasks  You don’t execute tasks, you compute or refresh datasets Data-oriented 02/06/2013DIP – Introduction to Dataiku Flow 13
  • 14. Partition-level dependencies 02/06/2013Dataiku Training – Hadoop for Data Science 14 Shaker cleantask1 cleanlogwtlogs Pig « aggr_visits » weekly_aggr  "wtlogs" and "cleanlog" are day-partitioned  "weekly_aggr" needs the previous 7 days of clean logs  "sliding days" partition-level dependency  "Compute weekly_aggr for 2012-01-25" ◦ Automatically computes the required 7 partitions ◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition ◦ Perform cleantask1 in parallel for all missing / stale days ◦ Perform aggr_visits with the 7 partitions as input sliding_days(7)
  • 15. Automatic parallelism 02/06/2013DIP – Introduction to Dataiku Flow 15  Flow computes the global DAG of required activities  Compute activities that can take place in parallel  Previous example: 8 activities ◦ 7 can be parallelized ◦ 1 requires the other 7 first  Manages running activities  Starts new activities based on available resources
  • 16.  Datasets have a schema, available in all tools  Advanced verification of computed data ◦ "Check that output is not empty" ◦ "Check that this custom query returns between X and Y records" ◦ "Check that this specific record is found in output" ◦ "Check that number of computed records for day B is no more than 40% different than day A"  Automatic tests for data pipelines Schema and data validity checks 02/06/2013DIP – Introduction to Dataiku Flow 16
  • 17.  Native knowledge of Pig and Hive formats  Schema-aware loaders and storages  A great ecosystem, but not omnipotent ◦ Not everything requires Hadoop's strong points  Hadoop = first-class citizen of Flow, but not the only one  Native integration of SQL capabilities  Automatic incremental synchronization to/from MongoDB, Vertica, ElasticSearch, …  Custom tasks Integrated in Hadoop, open beyond 02/06/2013DIP – Introduction to Dataiku Flow 17
  • 18. What about Oozie and Hcatalog ? 02/06/2013DIP – Introduction to Dataiku Flow 18
  • 19.  Engine and core tasks are working  Under active development for first betas  Get more info and stay informed http://flowbeta.dataiku.com And while you wait, another thing Ever been annoyed by data transfers ? Are we there yet ? 02/06/2013DIP – Introduction to Dataiku Flow 19
  • 20. Feel the pain 02/06/2013DIP – Introduction to Dataiku Flow 20
  • 21.  The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch ! 02/06/2013DIP – Introduction to Dataiku Flow 21
  • 22.  Extract from the core of Flow  Manipulate files across filesystems DCTC : Cloud data manipulation 02/06/2013DIP – Introduction to Dataiku Flow 22 # List the files and folders in a S3 bucket % dctc ls s3://my-bucket # Synchronize incrementally from GCS to local folder % dctc sync gs://my-bucket/my-path target-directory # Copy from GCS to HDFS, compress to .gz on the fly # (decompress handled too) % dctc cp –R –c gs://my-bucket/my-path hdfs:///data/input # Dispatch the lines of a file to 8 files on S3, gzip-compressed % dctc dispatch input s3://bucket/target –f random –nf 8 -c
  • 23. DCTC : More examples 02/06/2013DIP – Introduction to Dataiku Flow 23 # cat from anywhere % dctc cat ftp://account@:/pub/data/data.csv # Edit a remote file (with $EDITOR) % dctc edit ssh://account@:myfile.txt # Transparently unzip % dctc # Head / tail from the cloud % dctc tail s3://bucket/huge-log.csv # Multi-account aware % dctc sync s3://account1@path s3://account2@other_path
  • 24. http://dctc.io  Self-contained binary for Linux, OS X, Windows  Amazon S3  Google Cloud Storage  FTP Try it now 02/06/2013DIP – Introduction to Dataiku Flow 24  HTTP  SSH  HDFS (through local install)
  • 25. Questions ? Marc Batty Chief Customer Officer marc.batty@dataiku.com +33 6 45 65 67 04 @battymarc Florian Douetteau Chief Executive Officer florian.douetteau@dataiku.com +33 6 70 56 88 97 @fdouetteau Thomas Cabrol Chief Data Scientist thomas.cabrol@dataiku.com +33 7 86 42 62 81 @ThomasCabrol Clément Stenac Chief Technical Officer clement.stenac@dataiku.com +33 6 28 06 79 04 @ClementStenac