SlideShare a Scribd company logo
1Dataiku6/4/2013
6/4/2013Dataiku 2
Hi !
Current Life:
CEO, Dataiku
Tweet about this: @dataiku @club_dsi_gun
Past Life:
Criteo
IsCool Entertainment
Exalead
Florian
Douetteau
Available on Slide Share
http://www.slideshare.net/Dataiku
Goals Today:
• Concrete Feedback on Data Analytics
Projects
• Data Team in practice and Key technologies
• Motivate you to start a data science project
Slide deck allergic ? Check:
https://github.com/dataiku
6/4/2013Dataiku 3
Dataiku
Dataiku : An open source platform
to help you build your data lab
‟
”
6/4/2013Dataiku 4
Collocation
6/4/2013Dataiku 5
Big Apple
Big Mama
Big Data
A familiar grouping of words,
especially words that habitually appear
together and thereby convey meaning
by association.
C
o
l
l
o
c
“Big” Data in 1999
6/4/2013Dataiku 6
struct Element {
Key key;
void* stat_data ;
}
….
C
Optimized Data structures
Perfect Hashing
HP-UNIX Servers – 4GB Ram
100 GB data
Web Crawler – Socket reuse
HTTP 0.9
1 Month
 Hadoop
 Java / Pig / Hive / Scala /
Closure / …
 A Dozen NoSQL data store
 MPP Databases
 Real-Time
6/4/2013Dataiku 7
Big Data in 2013
1 Hour
Data Analytics: The Stakes
6/4/2013Dataiku 8
1 TB
? $
Social Gaming
2011Web Search
1999
Logistics
2004
Online
Advertising
2012
1 TB
100M $
E-
Commerce
2013
Banking
CRM
2008
1 TB
1B $
Web
Search
2010
100 TB
? $
10 TB
10M $
1000TB
500M $
50TB
1B$
Meet Hal Alowne
6/4/2013Dataiku - Data Tuesday 9
Big Guys
• 10B$+ Revenue
• 100M+ customers
• 100+ Data Scientist
Hal Alowne
BI Manager
Dim’s Private Showroom
Hey Hal ! We need
a big data platform
like the big guys.
Let’s just do as they do!
‟
”European E-commerce Web site
• 100M$ Revenue
• 1 Million customer
• 1 Data Analyst (Hal Himself)
Dim Sum
CEO & Founder
Dim’s Private Showroom
Big Data
Copy Cat
Project
Technology is complex
6/4/2013Dataiku 10
Hadoop
Ceph
Sphere
Cassandra
Spark
Scikit-Learn
Mahout
WEKA
MLBase
RapidMiner
Panda
D3
Crossfilter
InfiniDB
LucidDB
Impala
Elastic Search
SOLR
MongoDB
Riak
Membase
Pig
Hive
Cascading
Talend
Machine Learning
Mystery Land
Scalability CentralNoSQL-Slavia
SQL Colunnar Republic
Vizualization County
Data Clean Wasteland
Statistician Old
House
R
Statistics and Machine Learning is
complex !
6/4/2013Dataiku 11
 Try to understand
myself
(Some Book you might want to read)
6/4/2013Dataiku 12
Plumbing is not complex
(but difficult)
6/4/2013Dataiku 13
Implicit User Data
(Views, Searches…)
Content Data
(Title, Categories, Price, …)
Explicit User Data
(Click, Buy, …)
User Information
(Location, Graph…)
500TB
50TB
1TB
200GB
Transformation
Matrix
Transformation
Predictor
Per User Stats
Per Content Stats
User Similarity
Rank Predictor
Content Similarity
MERIT = TIME + ROI
6/4/2013Dataiku 14
Targeted
Newsletter
Recommender
Systems
Adapted Product
/ Promotions
TIME : 6 MONTHS ROI : APPS
 Build a lab in 6 months
(rather than 18 months)
Find the right
people
(6 months?)
Choose the
technology
(6 months?)
Make it work
(6 months?)
Build the lab
(6 months)
 Deploy apps
that actually deliver value
2013 2014
2013
• Train People
• Reuse working patterns
The Problem
6/4/2013Dataiku 15
It’s utterly complex and
unreasonable
Our Goal
6/4/2013Dataiku 16
Our Goal:
Change his perspective
on data science projects
(sorry, we couldn’t
find a picture of Hal
Smiling)
 Why and For What ?
◦ Business Theory
◦ Concrete Projects
 How people and project ?
◦ How to start
◦ Dedicated team ?
 What technologies ?
◦ Machine Learning
◦ Architecture
Agenda
6/4/2013Dataiku 17
Embodiment of Knowledge
6/4/2013Dataiku 18
 Product Success
driven by Quality !
 Margin / Customer
Value / Traffic /
Acquisition
6/4/2013Dataiku 19
Example: Launching an App
on the App Store
 Margin for new
customers might
decline …
 Margin for new
features might
decline …
 Is your business
really scalable ?
6/4/2013Dataiku 20
you continue growing ….
 Existing Customers
Profiles
 Existing Product Assets
 Existing Specific
Business Model
 And your KNOWLEDGE
of it
6/4/2013Dataiku 21
Where is your core business
advantage ?
6/4/2013Dataiku 22
Data Driven Business
What your value ?
Number of
Customers
Customer Knowledge
Increase over time with:
- Time spend in your app
- User relationship (network effet)
- Partner / Other Apps Interactions
Your Value
Data Impact
Not all business equals
6/4/2013Dataiku 23
Online
Advertising
Telecommunication
Insurance
Ability
to Acquire
Margin
New
Services
Overall
Subscription
Market
Infrastructure
Driver
Selling Data
Risk / Price
Optimization
Subscription
Market
Subscription
Market
From Theory To Practice
6/4/2013Dataiku 24
 What should be free
in the application ?
 How to optimize
conversion ?
 How to plan and
create a business
model ?
Main Pain Point:
How to plan and
optimize pricing in
the application ?
6/4/2013Dataiku 25
Freemium Application
Example (Freemium Application)
Fremium Model Optimization
6/4/2013Dataiku 26
Business
Model
User
Cluster
Simulation
 Optimized Pricing: Margin
+23%
 Business Planning
Capability
1 month  9 months
 R + Python + InfiniDB
On-Premise
1TB Dataset
5 weeks project
 Business Intelligence
Stack as Scalability and
maintenance issues
 Backoffice implements
business rules that are
challenged
 Existing infrastructure
cannot cope with per-
user information
Main Pain Point:
23 hours 52 minutes to
compute Business Intelligence
aggregates for one day.
6/4/2013Dataiku 27
Large E-Retailer
• Relieve their current DWH and
accelerate production of some
aggregates/KPIs
• Be the backbone for new
personalized user experience
on their website: more
recommendations, more
profiling, etc.,
• Train existing people around
machine learning and
segmentation experience
 1h12 to perform the
aggregate, available every
morning
 New home page
personalization deployed in a
few weeks
 Hadoop Cluster (24 cores)
Google Compute Engine
Python + R + Vertica
12 TB dataset
6 weeks projects
6/4/2013Dataiku - Data Tuesday 28
Large E-Retailer : The Datalab
 BI performed directly on
production databases
 New reports required the
CTO direct work for
design and
implementation
 Each photo tag manually
validated and completed
Large Photo Bank
6/4/2013Dataiku - Data Tuesday 29
Main pain point:
No visibility on new users
behaviours
 Implementing a Cloud-based
data lab to :
• centralize all available data,
previously scattered between
SQL DB and file systems,
• improve web tracking
granularity to enhance
customer knowledge via
behavior modeling and
segmentation,
• create content-based
recommendation engines with
keywords clustering and
association.
6/4/2013Dataiku - Data Tuesday 30
Large Photo Bank : The Datalab
 R + Vertica + Hadoop
Amazon Web Services
8 weeks projects
 Automated content filtering
and recommendation
 Large set of
manually crafted
linguistic resources
for interpreting
users queries
 New Brands, rare
terms .. hard to
maintain
6/4/2013Dataiku 31
Large Online Directory
Main Pain Point:
Ability to maintain a very
large ontological knowledge
sets, with more than 100k
concepts
 Analyze clicks,
rephrasing navigation to
detect queries that
require specific
processing
 Gather web and external
data to enrich the
existing index
 Train team to Hadoop
and Machine Learning
 Continuous Relevance
Monitoring
 Automated enrichment 
2x more productivity
 Hadoop (48 cores)
Python
On Premise
10 weeks projects
6/4/2013Dataiku 32
Large Online Directory: The Data Lab
 Launch A Marketing
campaign
 After a few days
PREDICT based on
behaviours
◦  Total ARPU for users
after 3 months
◦  Efficiency of a campaign
◦ Continue or not ?
Example ( E-Application )
Marketing Campaign Prediction
Dataiku 33
A very large community
Some mid-size
communities
Lots of small clusters
mostly 2 players)
 Correlation
◦ between community size
and engagement / virality
 Meaningul patterns
◦ 2 players / Family / Group
 What is the minimum
number of friends to
have in the application
to get additional
engagement ?
Example (Social Gaming)
Social Gaming Communities
6/4/2013Dataiku 34
 What others do ?
◦ Concrete Projects
 How people and project ?
◦ How to start
◦ Dedicated team ?
 What technologies ?
◦ Machine Learning
◦ Architecture
Agenda
6/4/2013Dataiku 35
6/4/2013Dataiku 36
 A / B Test
(or equivalent for your
business) is the first step to
get into a “data-driven”
mind set
 No advanced analytics
requires, some existing
tools can help
 Changing a color button
+21%
6/4/2013Dataiku 37
(1) Be Data Driven
 People  Microsoft Excel
6/4/2013Dataiku 38
(2) Use Excel
 Data Team  Data Tools
6/4/2013Dataiku 39
(3) Build a team
The Business Expert
who knows maths
The Analyst
that reveals patterns
The Coding Guy That
is enthusiastic
 data lab, (n. m): a small group
with all the expertise, including
business minded people,
machine learning knowledge and
the right technology
 A proven organization used by
successful data-driven
companies over the past few
years (eBay, LinkedIn, Walmart…)
TEAM + TOOLS = LAB
6/4/2013Dataiku 40
Organization
6/4/2013Dataiku 41
Targeted campaings
Price optimization
Personalized
experience
Quality Assurance
Workload and yield
management
User Feedback (A/B Test)
Continuous improvement
Data
Product
Designer
Business
&
Marketing
Engineers
User
Voice
Short Term Focus Long Term Drive
Business People Optimize Margin, …. Create new business
revenue streams
Marketing People Optimize click ratio Brand awareness and
impact
IT People Make IT work Clean and efficient
Architecture
Data People Get Stats Right, make
predictions
Create Data Driven
Features
It’s just a new team
6/4/2013Dataiku 42
Super Intern
6/4/2013Dataiku 43
What is your ability to integrate a new
smart guy and give him any
data he would need and any computing
power he would need to enhance
your product ?
 What others do ?
◦ Concrete Projects
 How people and project ?
◦ How to start
◦ Dedicated team ?
 What technologies ?
◦ Machine Learning
◦ Architecture
Agenda
6/4/2013Dataiku 44
An oversimplified view of big data architecture
6/4/2013Dataiku 45
6/4/2013Dataiku 46
Database Business Layer Application
(What it really looks like)
6/4/2013Dataiku 47
What kind of scale?
6/4/2013Dataiku 48
Database Business Layer Application
Or
Data Science App
Or ?
What kind of interaction ?
6/4/2013Dataiku 49
Database Business Layer Application
Data Science App
?
?
? ? ?
?
Classic Columnar Architecture
6/4/2013Dataiku 50
Some data Some Place To
Pour It In
Some Tool To
To Some Maths And Graphs
Classic Columnar Architecture
6/4/2013Dataiku 51
Lots of data Some Place To
Pour It In
Some Tool To
To Some Maths And Graphs
Web Tracking Logs
Raw Server Logs
Order / Product / Customer
Facebook Info
Open Data (Weather, Currency …)
The Corinthian Architecture
6/4/2013Dataiku 52
Lots of data
Some Place
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
And Charts
Some Place To
Pour It In And
Clean / Prepare It
Data Storage And Preparation
6/4/2013Dataiku 53
Large Scale:
Hadoop Cluster
Cassandra
MPP SQL Columnar
Medium/Large Scale:
CouchBase
MongoDB
….
Selection Drivers
Volume
Scalability
Calculations
6/4/2013Dataiku 54
Classic Database
• PostgresSQL
• MySQL
• ….
MPP SQL Database
• Vertica, Vectorwise, InfiniDB,
GreenplumHD….
Hadoop New Databases
• Impala
…
Selection Drivers:
Speed ( Interactivity )
Expressivity
The Corinthian Architecture
6/4/2013Dataiku 55
Lots of data
Some Place
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
And Charts
Some Place To
Pour It In And
Clean / Prepare It
Statistics
Cohorts
Regressions
Bar Charts For Marketing
Nice Infography for you Company Board
The Corinthian Architecture
6/4/2013Dataiku 56
Lots of data
Some Database
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
Statistical Tools
6/4/2013Dataiku 57
Open Source:
• IPython
• Rstudio
Commercial
• RapidMiner
• SAS
• RevolutionR
Selection Drivers
Existing Knowhow
Scalability
6/4/2013Dataiku 58
What is a statistical tool ?
 Interact and explore
data
 Some stats
capabilities
 Some Graph
Capabilities
Visualization Tools
6/4/2013Dataiku 59
Open Source:
• SpotFire
• Tableau
• QlikView
SAAS
• BIME
• ChartIO
• RevolutionR
HTML5 / AdHoc
• D3
• GraphViz
Selection Drivers
How Many Contributors /
Readers ?
Scalability
The One Database won’t
make it all problem
6/4/2013Dataiku 60
Lots of data
Some Database
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
JOIN / Aggregate
Rapid Goup By Computations
Direct Access to the computed Results
to production etc..
The Roman Social Forum
6/4/2013Dataiku 61
Lots of data
Some Database
To Perform
Rapid Calculations
And Some Database
For Graphs
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
Graph
6/4/2013Dataiku 62
Databases
• Neo4J
• Titan
• OrientDB
• InfiniteGraph
Analytic / Visualization
• Gephi
Selection Drivers
Scalability
What Algorithms ?
Licensing Constraints
The Key Value Store
6/4/2013Dataiku 63
Lots of data
Some Database
To Perform
Rapid Calculations
And Some Database
For Graphs And
Some Distributed Key
Value Store
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
NoSQL
6/4/2013Dataiku 64
Search
• SOLR
• ElasticSearch
Document
• MongoDB
• CouchDB
KeyValue
• Redis
• Hbase
…
Selection Drivers
Durability / Avaiability …
Performance
Ease of use and API
Indexing
Action requires Prediction
6/4/2013Dataiku 65
Lots of data
Some Database
To Perform
Rapid Calculations
And some database
for graphs And
Some Distributed Key
Value Store
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
Draw A Line  For the future
What are my real users groups ?
Should I launch a discount offering or not ?
To everybody or to specific users only ?
The Medieval Fairy Land
6/4/2013Dataiku 66
Lots of data
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts and some
MACHINE LEARNING
Some Place To
Pour It In And
Clean / Prepare It
Some Database
To Perform
Rapid Calculations
And Some Database
For Graphs And
Some Distributed Key
Value Store
Predictions
6/4/2013Dataiku 67
Java
• Mahout (Hadoop)
• WEKA
Python
• Scikit-Learn
• PyML
R
Commercial
• Kxen
• SAS
• SPSS…
Selection Drivers
Scalability
Black Box / White Box ?
Data Management Integration
Can be fun
6/4/2013Dataiku 68
 Exploratory Data Analysis
◦ Identifying and visualizing key patterns and correlations within the dataset
 Unsupervised Learning
◦ Create groups of similar observations sharing same patterns (aka Clustering, Segmentation)
 Supervised Learning
◦ Modeling a variable using independent features (aka Scoring, Predictive Modeling, Classification)
 Time Series Prevision
◦ Predict a time-dependent variable using its own history, and sometimes other covariates (variables)
 Graph Analysis
◦ Analyzing relationships between a set of “nodes”, linked by “edges”
 Associations / Sequences Mining
◦ Identifying frequently associated items within transactions/ events databases, sometimes ordered over time
 And many more…
Classes of Machine Learning Problems
04/06/2013Dataiku - Innovation Services 69
Mapping ML to Business Questions
04/06/2013Dataiku - Innovation Services 70
Class Sample Business Questions
Exploratory Data Analysis What does my dataset look like ? What are the key correlations in my data ?
Unsupervised Learning Can I create groups of users who share the same purchasing behavior ? The
same navigation behavior ?
Supervised Learning What users are likely to click on ad X ? What users are likely to convert to paying
users ? Who is going to leave my service ? What is the profile of the users who
do X ?
Time Series Prevision What is the prevision of my revenue next month ? Given the weather forecast,
can I also forecast my sales ?
Product Sale Forecast (for surbooking)
Graph Analysis Can I identify influencers in my users community ? Can I recommend new friends
to my users ?
Association & Sequences Mining Which products are frequently bought together ? What is the typical navigation
path on my website ?
Machine Learning Methods Detailed
04/06/2013Dataiku - Innovation Services 71
Analytical Task ML Task Sample Algorithms Shape of Dataset
Exploratory Data Analysis Univariate Analysis Distribution, frequencies, histogram, boxplots, fit tests... N obs. (1 row per obs.) * P features
Bivariate Analysis Scatterplots, correlations (Pearson, Spearman), GLM, Chi Square... N obs. (1 row per obs.) * P features
Multivariate Analysis Principal components analysis, multi-dimensional scaling
correspondence analysis, factor analysis…
N obs. (1 row per obs.) * P features
“Oriented” Data Analysis Unsupervised Learning K-means, K-medoids, hierarchical clustering, gaussian mixture
models, mean shift, dbscan, spectral clustering...
N obs. (1 row per obs.) * P features
Supervised Learning Linear & logistic regression, decision trees, neural networks, SVM,
naïve Bayes, K-NN, random forests…
N obs. (1 row per obs.) * P features
Time Series Prevision ARMA, VARMAX, ARIMA… Time Series (rows: time period,
columns: measures)
Graph Analysis Centrality (closeness, betweeness, Page Rank, HITS), modularity
(Louvain)…
Nodes and Edges lists (+
attributes)
Associations &
Sequences
Frequent Itemsets, A priori, Market Basket… (Timestamped) events or
transactions
 Cluster a dataset
into K Buckets by
choosing the
“closest”
neighbours
6/4/2013Dataiku 72
Unsupervised Method
K-Means
 Predict the color of
a point depending
on the colors of its
K closest
neighbours
6/4/2013Dataiku 73
Supervised
K-Nearest-Neighbours
 Find the most
“significant” input
variable and split
value
 Split the dataset
recursively
6/4/2013Dataiku 74
Supervised
Decision Tree
Several Paths to Machine Learning
04/06/2013Dataiku - Innovation Services 75
Analytical
Dataset
I’m looking
for clusters
I want to
predict a
variable
I’m looking
variable by
variable, or
pairs
I know how
many groups
to look for
HCA
…
Partitioning (K-
means…)
GMM
…
DP
GMM
…
K-means + Gap
| Silhouette | …
2-steps
clustering
I just want
to
explore
Yes
No
Ye
s
No
Small
Dataset
(<<1K)
Ye
s
No
Medium Dataset
(<<100K)
Ye
s
No
I can
sample
Ye
s
No
Affinity
Propagation,
Mean Shift…
Unsupervised Learning
Ye
s
No
All my
variables
are
numeric Ye
s
No
CA…
I have a
distance
matrix
Ye
s
No
MDS...
PCA
…
Exploratory Data Analysis Data
Viz...
Ye
s
Not
Only
I value
interpretability
Generalized
Linear
Model
Simple
Decision
Tree
Supervised Learning*
Correlation
Analysis
GLM
Parametric and non
parametric stat.
tests
* Methods generally working for both classification & regression
Support
Vector
Machines
Neural
Networks
K-Nearest
Neighbors
Ensembles (Random
Forest, Gradient
Boosted Tree
MARS
Generalized
Additive
Model
6/4/2013Dataiku 76
Questions ?
 Take Away
◦ There are new ways to perform data
analytics that are within your reach and
can bring business value
 Some Additional Resources
◦ Open Source Projects
 Dataiku Cloud Transport Client
http://dctc.io
 Dataiku Web Tracker
https://github.com/dataiku/wt1
◦ Our Technical Blog
 http://www.dataiku.com/blog

More Related Content

What's hot

Dataiku Data Science Studio (datasheet)
Dataiku Data Science Studio (datasheet)Dataiku Data Science Studio (datasheet)
Dataiku Data Science Studio (datasheet)John Cann
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
 
Get Savvy with Snowflake
Get Savvy with SnowflakeGet Savvy with Snowflake
Get Savvy with SnowflakeMatillion
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera, Inc.
 
Dataiku data science studio
Dataiku data science studioDataiku data science studio
Dataiku data science studioNorman Poh
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best PracticesMatillion
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Platform Strategy to Deliver Digital Experiences on Azure
Platform Strategy to Deliver Digital Experiences on AzurePlatform Strategy to Deliver Digital Experiences on Azure
Platform Strategy to Deliver Digital Experiences on AzureWSO2
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
A cloud readiness assessment framework
A cloud readiness assessment frameworkA cloud readiness assessment framework
A cloud readiness assessment frameworkCarlo Colicchio
 
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?Albert Hoitingh
 
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseSnowflake Computing
 

What's hot (20)

Dataiku Data Science Studio (datasheet)
Dataiku Data Science Studio (datasheet)Dataiku Data Science Studio (datasheet)
Dataiku Data Science Studio (datasheet)
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Get Savvy with Snowflake
Get Savvy with SnowflakeGet Savvy with Snowflake
Get Savvy with Snowflake
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for Analytics
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Dataiku data science studio
Dataiku data science studioDataiku data science studio
Dataiku data science studio
 
Data and AI reference architecture
Data and AI reference architectureData and AI reference architecture
Data and AI reference architecture
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Google Cloud Platform Data Storage
Google Cloud Platform Data StorageGoogle Cloud Platform Data Storage
Google Cloud Platform Data Storage
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Products and teams
Data Products and teamsData Products and teams
Data Products and teams
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Platform Strategy to Deliver Digital Experiences on Azure
Platform Strategy to Deliver Digital Experiences on AzurePlatform Strategy to Deliver Digital Experiences on Azure
Platform Strategy to Deliver Digital Experiences on Azure
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
A cloud readiness assessment framework
A cloud readiness assessment frameworkA cloud readiness assessment framework
A cloud readiness assessment framework
 
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
ExpertsLive NL 2022 - Microsoft Purview - What's in it for my organization?
 
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 

Similar to Dataiku - From Big Data To Machine Learning

Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...Johan-André Jeanville
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunDataiku
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...DATAVERSITY
 
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...DATAVERSITY
 
Data analytics course archtype
Data analytics course archtypeData analytics course archtype
Data analytics course archtypenakshatraL
 
MVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsMVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsBoost Labs
 
Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?DiUS
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teamsVenkatesh Umaashankar
 
Big Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CultureBig Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CulturePauline Chow
 
Success Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackSuccess Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackDomino Data Lab
 
Drinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire HoseDrinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire HoseGigi Johnson
 
AI Orange Belt - Session 3
AI Orange Belt - Session 3AI Orange Belt - Session 3
AI Orange Belt - Session 3AI Black Belt
 
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantageBIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantageAurélie Pols
 
Your Company Cares About Open Source Sustainability, But Are You Measuring an...
Your Company Cares About Open Source Sustainability, But Are You Measuring an...Your Company Cares About Open Source Sustainability, But Are You Measuring an...
Your Company Cares About Open Source Sustainability, But Are You Measuring an...All Things Open
 

Similar to Dataiku - From Big Data To Machine Learning (20)

Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for Fun
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
 
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
 
Data analytics course archtype
Data analytics course archtypeData analytics course archtype
Data analytics course archtype
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
MVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsMVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost Labs
 
Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Big Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CultureBig Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven Culture
 
Success Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackSuccess Through an Actionable Data Science Stack
Success Through an Actionable Data Science Stack
 
Drinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire HoseDrinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire Hose
 
Data is not the new snake oil
Data is not the new snake oilData is not the new snake oil
Data is not the new snake oil
 
AI Orange Belt - Session 3
AI Orange Belt - Session 3AI Orange Belt - Session 3
AI Orange Belt - Session 3
 
First Steps on Big Data
First Steps on Big DataFirst Steps on Big Data
First Steps on Big Data
 
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantageBIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
 
Your Company Cares About Open Source Sustainability, But Are You Measuring an...
Your Company Cares About Open Source Sustainability, But Are You Measuring an...Your Company Cares About Open Source Sustainability, But Are You Measuring an...
Your Company Cares About Open Source Sustainability, But Are You Measuring an...
 

More from Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015 Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 

More from Dataiku (20)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 

Recently uploaded

10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...Elena Simperl
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2DianaGray10
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...Sri Ambati
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
 

Recently uploaded (20)

10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

Dataiku - From Big Data To Machine Learning

  • 2. 6/4/2013Dataiku 2 Hi ! Current Life: CEO, Dataiku Tweet about this: @dataiku @club_dsi_gun Past Life: Criteo IsCool Entertainment Exalead Florian Douetteau Available on Slide Share http://www.slideshare.net/Dataiku Goals Today: • Concrete Feedback on Data Analytics Projects • Data Team in practice and Key technologies • Motivate you to start a data science project Slide deck allergic ? Check: https://github.com/dataiku
  • 3. 6/4/2013Dataiku 3 Dataiku Dataiku : An open source platform to help you build your data lab ‟ ”
  • 5. Collocation 6/4/2013Dataiku 5 Big Apple Big Mama Big Data A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association. C o l l o c
  • 6. “Big” Data in 1999 6/4/2013Dataiku 6 struct Element { Key key; void* stat_data ; } …. C Optimized Data structures Perfect Hashing HP-UNIX Servers – 4GB Ram 100 GB data Web Crawler – Socket reuse HTTP 0.9 1 Month
  • 7.  Hadoop  Java / Pig / Hive / Scala / Closure / …  A Dozen NoSQL data store  MPP Databases  Real-Time 6/4/2013Dataiku 7 Big Data in 2013 1 Hour
  • 8. Data Analytics: The Stakes 6/4/2013Dataiku 8 1 TB ? $ Social Gaming 2011Web Search 1999 Logistics 2004 Online Advertising 2012 1 TB 100M $ E- Commerce 2013 Banking CRM 2008 1 TB 1B $ Web Search 2010 100 TB ? $ 10 TB 10M $ 1000TB 500M $ 50TB 1B$
  • 9. Meet Hal Alowne 6/4/2013Dataiku - Data Tuesday 9 Big Guys • 10B$+ Revenue • 100M+ customers • 100+ Data Scientist Hal Alowne BI Manager Dim’s Private Showroom Hey Hal ! We need a big data platform like the big guys. Let’s just do as they do! ‟ ”European E-commerce Web site • 100M$ Revenue • 1 Million customer • 1 Data Analyst (Hal Himself) Dim Sum CEO & Founder Dim’s Private Showroom Big Data Copy Cat Project
  • 10. Technology is complex 6/4/2013Dataiku 10 Hadoop Ceph Sphere Cassandra Spark Scikit-Learn Mahout WEKA MLBase RapidMiner Panda D3 Crossfilter InfiniDB LucidDB Impala Elastic Search SOLR MongoDB Riak Membase Pig Hive Cascading Talend Machine Learning Mystery Land Scalability CentralNoSQL-Slavia SQL Colunnar Republic Vizualization County Data Clean Wasteland Statistician Old House R
  • 11. Statistics and Machine Learning is complex ! 6/4/2013Dataiku 11  Try to understand myself
  • 12. (Some Book you might want to read) 6/4/2013Dataiku 12
  • 13. Plumbing is not complex (but difficult) 6/4/2013Dataiku 13 Implicit User Data (Views, Searches…) Content Data (Title, Categories, Price, …) Explicit User Data (Click, Buy, …) User Information (Location, Graph…) 500TB 50TB 1TB 200GB Transformation Matrix Transformation Predictor Per User Stats Per Content Stats User Similarity Rank Predictor Content Similarity
  • 14. MERIT = TIME + ROI 6/4/2013Dataiku 14 Targeted Newsletter Recommender Systems Adapted Product / Promotions TIME : 6 MONTHS ROI : APPS  Build a lab in 6 months (rather than 18 months) Find the right people (6 months?) Choose the technology (6 months?) Make it work (6 months?) Build the lab (6 months)  Deploy apps that actually deliver value 2013 2014 2013 • Train People • Reuse working patterns
  • 15. The Problem 6/4/2013Dataiku 15 It’s utterly complex and unreasonable
  • 16. Our Goal 6/4/2013Dataiku 16 Our Goal: Change his perspective on data science projects (sorry, we couldn’t find a picture of Hal Smiling)
  • 17.  Why and For What ? ◦ Business Theory ◦ Concrete Projects  How people and project ? ◦ How to start ◦ Dedicated team ?  What technologies ? ◦ Machine Learning ◦ Architecture Agenda 6/4/2013Dataiku 17
  • 19.  Product Success driven by Quality !  Margin / Customer Value / Traffic / Acquisition 6/4/2013Dataiku 19 Example: Launching an App on the App Store
  • 20.  Margin for new customers might decline …  Margin for new features might decline …  Is your business really scalable ? 6/4/2013Dataiku 20 you continue growing ….
  • 21.  Existing Customers Profiles  Existing Product Assets  Existing Specific Business Model  And your KNOWLEDGE of it 6/4/2013Dataiku 21 Where is your core business advantage ?
  • 22. 6/4/2013Dataiku 22 Data Driven Business What your value ? Number of Customers Customer Knowledge Increase over time with: - Time spend in your app - User relationship (network effet) - Partner / Other Apps Interactions Your Value
  • 23. Data Impact Not all business equals 6/4/2013Dataiku 23 Online Advertising Telecommunication Insurance Ability to Acquire Margin New Services Overall Subscription Market Infrastructure Driver Selling Data Risk / Price Optimization Subscription Market Subscription Market
  • 24. From Theory To Practice 6/4/2013Dataiku 24
  • 25.  What should be free in the application ?  How to optimize conversion ?  How to plan and create a business model ? Main Pain Point: How to plan and optimize pricing in the application ? 6/4/2013Dataiku 25 Freemium Application
  • 26. Example (Freemium Application) Fremium Model Optimization 6/4/2013Dataiku 26 Business Model User Cluster Simulation  Optimized Pricing: Margin +23%  Business Planning Capability 1 month  9 months  R + Python + InfiniDB On-Premise 1TB Dataset 5 weeks project
  • 27.  Business Intelligence Stack as Scalability and maintenance issues  Backoffice implements business rules that are challenged  Existing infrastructure cannot cope with per- user information Main Pain Point: 23 hours 52 minutes to compute Business Intelligence aggregates for one day. 6/4/2013Dataiku 27 Large E-Retailer
  • 28. • Relieve their current DWH and accelerate production of some aggregates/KPIs • Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc., • Train existing people around machine learning and segmentation experience  1h12 to perform the aggregate, available every morning  New home page personalization deployed in a few weeks  Hadoop Cluster (24 cores) Google Compute Engine Python + R + Vertica 12 TB dataset 6 weeks projects 6/4/2013Dataiku - Data Tuesday 28 Large E-Retailer : The Datalab
  • 29.  BI performed directly on production databases  New reports required the CTO direct work for design and implementation  Each photo tag manually validated and completed Large Photo Bank 6/4/2013Dataiku - Data Tuesday 29 Main pain point: No visibility on new users behaviours
  • 30.  Implementing a Cloud-based data lab to : • centralize all available data, previously scattered between SQL DB and file systems, • improve web tracking granularity to enhance customer knowledge via behavior modeling and segmentation, • create content-based recommendation engines with keywords clustering and association. 6/4/2013Dataiku - Data Tuesday 30 Large Photo Bank : The Datalab  R + Vertica + Hadoop Amazon Web Services 8 weeks projects  Automated content filtering and recommendation
  • 31.  Large set of manually crafted linguistic resources for interpreting users queries  New Brands, rare terms .. hard to maintain 6/4/2013Dataiku 31 Large Online Directory Main Pain Point: Ability to maintain a very large ontological knowledge sets, with more than 100k concepts
  • 32.  Analyze clicks, rephrasing navigation to detect queries that require specific processing  Gather web and external data to enrich the existing index  Train team to Hadoop and Machine Learning  Continuous Relevance Monitoring  Automated enrichment  2x more productivity  Hadoop (48 cores) Python On Premise 10 weeks projects 6/4/2013Dataiku 32 Large Online Directory: The Data Lab
  • 33.  Launch A Marketing campaign  After a few days PREDICT based on behaviours ◦  Total ARPU for users after 3 months ◦  Efficiency of a campaign ◦ Continue or not ? Example ( E-Application ) Marketing Campaign Prediction Dataiku 33
  • 34. A very large community Some mid-size communities Lots of small clusters mostly 2 players)  Correlation ◦ between community size and engagement / virality  Meaningul patterns ◦ 2 players / Family / Group  What is the minimum number of friends to have in the application to get additional engagement ? Example (Social Gaming) Social Gaming Communities 6/4/2013Dataiku 34
  • 35.  What others do ? ◦ Concrete Projects  How people and project ? ◦ How to start ◦ Dedicated team ?  What technologies ? ◦ Machine Learning ◦ Architecture Agenda 6/4/2013Dataiku 35
  • 37.  A / B Test (or equivalent for your business) is the first step to get into a “data-driven” mind set  No advanced analytics requires, some existing tools can help  Changing a color button +21% 6/4/2013Dataiku 37 (1) Be Data Driven
  • 38.  People  Microsoft Excel 6/4/2013Dataiku 38 (2) Use Excel
  • 39.  Data Team  Data Tools 6/4/2013Dataiku 39 (3) Build a team The Business Expert who knows maths The Analyst that reveals patterns The Coding Guy That is enthusiastic
  • 40.  data lab, (n. m): a small group with all the expertise, including business minded people, machine learning knowledge and the right technology  A proven organization used by successful data-driven companies over the past few years (eBay, LinkedIn, Walmart…) TEAM + TOOLS = LAB 6/4/2013Dataiku 40
  • 41. Organization 6/4/2013Dataiku 41 Targeted campaings Price optimization Personalized experience Quality Assurance Workload and yield management User Feedback (A/B Test) Continuous improvement Data Product Designer Business & Marketing Engineers User Voice
  • 42. Short Term Focus Long Term Drive Business People Optimize Margin, …. Create new business revenue streams Marketing People Optimize click ratio Brand awareness and impact IT People Make IT work Clean and efficient Architecture Data People Get Stats Right, make predictions Create Data Driven Features It’s just a new team 6/4/2013Dataiku 42
  • 43. Super Intern 6/4/2013Dataiku 43 What is your ability to integrate a new smart guy and give him any data he would need and any computing power he would need to enhance your product ?
  • 44.  What others do ? ◦ Concrete Projects  How people and project ? ◦ How to start ◦ Dedicated team ?  What technologies ? ◦ Machine Learning ◦ Architecture Agenda 6/4/2013Dataiku 44
  • 45. An oversimplified view of big data architecture 6/4/2013Dataiku 45
  • 47. (What it really looks like) 6/4/2013Dataiku 47
  • 48. What kind of scale? 6/4/2013Dataiku 48 Database Business Layer Application Or Data Science App Or ?
  • 49. What kind of interaction ? 6/4/2013Dataiku 49 Database Business Layer Application Data Science App ? ? ? ? ? ?
  • 50. Classic Columnar Architecture 6/4/2013Dataiku 50 Some data Some Place To Pour It In Some Tool To To Some Maths And Graphs
  • 51. Classic Columnar Architecture 6/4/2013Dataiku 51 Lots of data Some Place To Pour It In Some Tool To To Some Maths And Graphs Web Tracking Logs Raw Server Logs Order / Product / Customer Facebook Info Open Data (Weather, Currency …)
  • 52. The Corinthian Architecture 6/4/2013Dataiku 52 Lots of data Some Place To Perform Rapid Calculations Some Tools To Do Some Maths And Charts Some Place To Pour It In And Clean / Prepare It
  • 53. Data Storage And Preparation 6/4/2013Dataiku 53 Large Scale: Hadoop Cluster Cassandra MPP SQL Columnar Medium/Large Scale: CouchBase MongoDB …. Selection Drivers Volume Scalability
  • 54. Calculations 6/4/2013Dataiku 54 Classic Database • PostgresSQL • MySQL • …. MPP SQL Database • Vertica, Vectorwise, InfiniDB, GreenplumHD…. Hadoop New Databases • Impala … Selection Drivers: Speed ( Interactivity ) Expressivity
  • 55. The Corinthian Architecture 6/4/2013Dataiku 55 Lots of data Some Place To Perform Rapid Calculations Some Tools To Do Some Maths And Charts Some Place To Pour It In And Clean / Prepare It Statistics Cohorts Regressions Bar Charts For Marketing Nice Infography for you Company Board
  • 56. The Corinthian Architecture 6/4/2013Dataiku 56 Lots of data Some Database To Perform Rapid Calculations Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It
  • 57. Statistical Tools 6/4/2013Dataiku 57 Open Source: • IPython • Rstudio Commercial • RapidMiner • SAS • RevolutionR Selection Drivers Existing Knowhow Scalability
  • 58. 6/4/2013Dataiku 58 What is a statistical tool ?  Interact and explore data  Some stats capabilities  Some Graph Capabilities
  • 59. Visualization Tools 6/4/2013Dataiku 59 Open Source: • SpotFire • Tableau • QlikView SAAS • BIME • ChartIO • RevolutionR HTML5 / AdHoc • D3 • GraphViz Selection Drivers How Many Contributors / Readers ? Scalability
  • 60. The One Database won’t make it all problem 6/4/2013Dataiku 60 Lots of data Some Database To Perform Rapid Calculations Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It JOIN / Aggregate Rapid Goup By Computations Direct Access to the computed Results to production etc..
  • 61. The Roman Social Forum 6/4/2013Dataiku 61 Lots of data Some Database To Perform Rapid Calculations And Some Database For Graphs Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It
  • 62. Graph 6/4/2013Dataiku 62 Databases • Neo4J • Titan • OrientDB • InfiniteGraph Analytic / Visualization • Gephi Selection Drivers Scalability What Algorithms ? Licensing Constraints
  • 63. The Key Value Store 6/4/2013Dataiku 63 Lots of data Some Database To Perform Rapid Calculations And Some Database For Graphs And Some Distributed Key Value Store Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It
  • 64. NoSQL 6/4/2013Dataiku 64 Search • SOLR • ElasticSearch Document • MongoDB • CouchDB KeyValue • Redis • Hbase … Selection Drivers Durability / Avaiability … Performance Ease of use and API Indexing
  • 65. Action requires Prediction 6/4/2013Dataiku 65 Lots of data Some Database To Perform Rapid Calculations And some database for graphs And Some Distributed Key Value Store Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It Draw A Line  For the future What are my real users groups ? Should I launch a discount offering or not ? To everybody or to specific users only ?
  • 66. The Medieval Fairy Land 6/4/2013Dataiku 66 Lots of data Some Tools To Do Some Maths Some Other To Do Some Charts and some MACHINE LEARNING Some Place To Pour It In And Clean / Prepare It Some Database To Perform Rapid Calculations And Some Database For Graphs And Some Distributed Key Value Store
  • 67. Predictions 6/4/2013Dataiku 67 Java • Mahout (Hadoop) • WEKA Python • Scikit-Learn • PyML R Commercial • Kxen • SAS • SPSS… Selection Drivers Scalability Black Box / White Box ? Data Management Integration
  • 69.  Exploratory Data Analysis ◦ Identifying and visualizing key patterns and correlations within the dataset  Unsupervised Learning ◦ Create groups of similar observations sharing same patterns (aka Clustering, Segmentation)  Supervised Learning ◦ Modeling a variable using independent features (aka Scoring, Predictive Modeling, Classification)  Time Series Prevision ◦ Predict a time-dependent variable using its own history, and sometimes other covariates (variables)  Graph Analysis ◦ Analyzing relationships between a set of “nodes”, linked by “edges”  Associations / Sequences Mining ◦ Identifying frequently associated items within transactions/ events databases, sometimes ordered over time  And many more… Classes of Machine Learning Problems 04/06/2013Dataiku - Innovation Services 69
  • 70. Mapping ML to Business Questions 04/06/2013Dataiku - Innovation Services 70 Class Sample Business Questions Exploratory Data Analysis What does my dataset look like ? What are the key correlations in my data ? Unsupervised Learning Can I create groups of users who share the same purchasing behavior ? The same navigation behavior ? Supervised Learning What users are likely to click on ad X ? What users are likely to convert to paying users ? Who is going to leave my service ? What is the profile of the users who do X ? Time Series Prevision What is the prevision of my revenue next month ? Given the weather forecast, can I also forecast my sales ? Product Sale Forecast (for surbooking) Graph Analysis Can I identify influencers in my users community ? Can I recommend new friends to my users ? Association & Sequences Mining Which products are frequently bought together ? What is the typical navigation path on my website ?
  • 71. Machine Learning Methods Detailed 04/06/2013Dataiku - Innovation Services 71 Analytical Task ML Task Sample Algorithms Shape of Dataset Exploratory Data Analysis Univariate Analysis Distribution, frequencies, histogram, boxplots, fit tests... N obs. (1 row per obs.) * P features Bivariate Analysis Scatterplots, correlations (Pearson, Spearman), GLM, Chi Square... N obs. (1 row per obs.) * P features Multivariate Analysis Principal components analysis, multi-dimensional scaling correspondence analysis, factor analysis… N obs. (1 row per obs.) * P features “Oriented” Data Analysis Unsupervised Learning K-means, K-medoids, hierarchical clustering, gaussian mixture models, mean shift, dbscan, spectral clustering... N obs. (1 row per obs.) * P features Supervised Learning Linear & logistic regression, decision trees, neural networks, SVM, naïve Bayes, K-NN, random forests… N obs. (1 row per obs.) * P features Time Series Prevision ARMA, VARMAX, ARIMA… Time Series (rows: time period, columns: measures) Graph Analysis Centrality (closeness, betweeness, Page Rank, HITS), modularity (Louvain)… Nodes and Edges lists (+ attributes) Associations & Sequences Frequent Itemsets, A priori, Market Basket… (Timestamped) events or transactions
  • 72.  Cluster a dataset into K Buckets by choosing the “closest” neighbours 6/4/2013Dataiku 72 Unsupervised Method K-Means
  • 73.  Predict the color of a point depending on the colors of its K closest neighbours 6/4/2013Dataiku 73 Supervised K-Nearest-Neighbours
  • 74.  Find the most “significant” input variable and split value  Split the dataset recursively 6/4/2013Dataiku 74 Supervised Decision Tree
  • 75. Several Paths to Machine Learning 04/06/2013Dataiku - Innovation Services 75 Analytical Dataset I’m looking for clusters I want to predict a variable I’m looking variable by variable, or pairs I know how many groups to look for HCA … Partitioning (K- means…) GMM … DP GMM … K-means + Gap | Silhouette | … 2-steps clustering I just want to explore Yes No Ye s No Small Dataset (<<1K) Ye s No Medium Dataset (<<100K) Ye s No I can sample Ye s No Affinity Propagation, Mean Shift… Unsupervised Learning Ye s No All my variables are numeric Ye s No CA… I have a distance matrix Ye s No MDS... PCA … Exploratory Data Analysis Data Viz... Ye s Not Only I value interpretability Generalized Linear Model Simple Decision Tree Supervised Learning* Correlation Analysis GLM Parametric and non parametric stat. tests * Methods generally working for both classification & regression Support Vector Machines Neural Networks K-Nearest Neighbors Ensembles (Random Forest, Gradient Boosted Tree MARS Generalized Additive Model
  • 76. 6/4/2013Dataiku 76 Questions ?  Take Away ◦ There are new ways to perform data analytics that are within your reach and can bring business value  Some Additional Resources ◦ Open Source Projects  Dataiku Cloud Transport Client http://dctc.io  Dataiku Web Tracker https://github.com/dataiku/wt1 ◦ Our Technical Blog  http://www.dataiku.com/blog