SlideShare a Scribd company logo
1 of 118
Download to read offline
BIG DATA
How do elephant
make babies

Florian Douetteau
CEO, Dataiku
Agenda
•

Big Data & Hadoop Overview

•

Practical Big Data Coding: Pig / Hive / Cascading

•

PagesJaunes Big Data Use Case

•

Machine Learning For Big Data
Motivation

3
Dataiku 1/8/14
Collocation

Dataiku

C
o
l
l
o
c
a
t

A familiar grouping of words,
especially words that habitually
appear together and thereby
convey meaning by association.

Big

Apple

Big

Mama

Big

Data
4
1/8/14
“Big” Data in 1999
struct Element {
Key key;
void* stat_data ;
}
….

C
Optimized Data structures
Perfect Hashing
HP-UNIX Servers – 4GB Ram
100 GB data
Web Crawler – Socket reuse
HTTP 0.9
Dataiku

1 Month
5
1/8/14
Big Data in 2013







Hadoop
Java / Pig / Hive / Scala / Closure / …
A Dozen NoSQL data store
MPP Databases
Real-Time

1 Hour
6
Dataiku 1/8/14
Data Analytics: The Stakes
1 TB
1B $

1 TB
?$
1 TB
100M $

Web Search
1999
Logistics
2004
Dataiku

10 TB
10M $
100 TB
?$

Banking
CRM
2008

50TB
1B$
1000TB
500M $
E-Commerce
2013

Social Gaming
2011
Web
Search
2010

Online
Advertising
2012
1/8/14

7
Meet Hal Alowne

Hal Alowne
BI Manager
Dim‟s Private Showroom
European E-commerce Web site
• 100M$ Revenue
• 1 Million customer
• 1 Data Analyst (Hal Himself)

Dataiku - Data Tuesday

‟

Dim Sum
CEO & Founder
Dim‟s Private Showroom

Hey Hal ! We need
a big data platform
like the big guys.
Let‟s just do as they do!
Big Data
Copy Cat Project

”

Big Guys
• 10B$+ Revenue
• 100M+ customers
• 100+ Data Scientist
1/8/14

8
QUESTION #1
IS IT EASY OR NOT
?
SUBTLE
PATTERN
S
"MORE
BUSINESS"
BUTTONS
QUESTION #2
WHO TO HIRE
?
DATA SCIENTIST
AT NIGHT
DATA CLEANER
THE DAY
PARADOX #3
WHERE ?
MY DATA
IS WORTH
MILLIONS
I SEND IT
TO THE
MARKETING
CLOUD
QUERSTION #4
IS IT BIG OR NOT
WE ALL LIVE
IN A BIG DATA
LAKE
ALL MY DATA
PROBABLY FITS
IN HERE
QUESTION #5 (at last)
HUMAN OR NOT ?
MACHINE
LEARNING
WILL SAVE
US ALL
I JUST WANT
MORE
REPORTS
MERIT = TIME + ROI
TIME : 6 MONTHS

ROI : APPS
2014

2013

Find the right
people
(6 months?)

Choose the
technology
(6 months?)

Make it work
(6 months?)

2013

Build the lab
(6 months)
• Train People
• Reuse working patterns

 Build a lab in 6 months
(rather than 18 months)

Dataiku

Targeted
Newsletter
Recommender
Systems

Adapted Product
/ Promotions
 Deploy apps
24
that actually deliver value
1/9/14
Statistics and Machine Learning is complex
!
 Try to
understand
myself

25
Dataiku

1/9/14
(Some Book you might want to read)

26
Dataiku

1/9/14
CHOOSE TECHNOLOGY
NoSQL-Slavia

Hadoop

Elastic Search

Ceph

SOLR

Riak

Machine Learning
Mystery Land

Scalability Central

Cassandra

MongoDB
Membase

Scikit-Learn
GraphLAB
prediction.io jubatus
Mahout
WEKA

Sphere

Kafka Flume
Real-time island
Spark Storm

SQL Colunnar Republic

MLBase

RapidMiner

Vertica

Netezza

QlickView
Kibana
SpotFire D3

Cascading

Tableau

Dataiku - Pig, Hive and Cascading

SPSS

Panda

Pig

Vizualization County

R

SAS

InfiniDB Drill
GreenPlum
Impala

LibSVM

Talend

Data Clean Wasteland

Statistician Old
House
Large E-Retailer






Business Intelligence Stack as
Scalability and maintenance
issues
Backoffice implements
business rules that are
challenged
Existing infrastructure cannot
cope with per-user
information

Main Pain Point:

23 hours 52 minutes to
compute Business Intelligence
aggregates for one day.

29
Dataiku 1/9/14
Large E-Retailer : The
Datalab
•

•

•

Relieve their current DWH and
accelerate production of some
aggregates/KPIs
Be the backbone for new
personalized user experience on
their website: more
recommendations, more profiling,
etc.,
Train existing people around
machine learning and
segmentation experience

1h12

to perform the
aggregate, available every morning

New

home page personalization
deployed in a few weeks

Hadoop

Cluster (24 cores)
Google Compute Engine
Python + R + Vertica
12 TB dataset
6 weeks projects

30
Dataiku - Data Tuesday 1/9/14
Example (Social Gaming)
Social Gaming Communities


Correlation
◦ between community size and
engagement / virality



Some mid-size
communities

Meaningul patterns

◦ 2 players / Family / Group



What is the minimum
number of friends to have in
the application to get
additional engagement ?

A very large community

Lots of small clusters
mostly 2 players)

31
Dataiku

1/9/14
How do I (pre)process data?
Implicit User Data
(Views, Searches…)

Online User
Information
Transformation
Predictor

500TB
Transformation
Matrix

Explicit User Data

Predictor
Runtime

(Click, Buy, …)

Per User Stats

Rank Predictor

50TB
Per Content Stats

User Information
(Location, Graph…)
User Similarity

1TB
Content Data
(Title, Categories, Price, …)

200GB

Content Similarity

A/B Test Data

Dataiku - Pig, Hive and Cascading
Always the same
Pour Data In

Compute Something
Smart About It

Make Available
The Questions
Pour Data In

How often ?
What kind of
interaction?
How much ?

Compute Something
Smart About It

How complex ?
Do you need all
data at once ?
How incremental
?

Make Available

Interaction ?
Random Access ?
At the Beginning was the
elephant
MapReduce
How to count works in many many boxes

37
Dataiku - Innovation Services

1/8/14
ELEPHANT MAKE BABIES
After Hadoop
Random Access
In Memory
MultiCore
Machine Learning

Faster in Memory
Computation

Massive Batch
Map Reduce Over HDFS

Real-Time
Distributed
Computation
Faster SQL Analytics
Queries
MapReduce
Simplicity is a complexity

40
Dataiku - Innovation Services

1/8/14
Agenda









Dataiku - Pig, Hive and Cascading

Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How they work (-0:15)
Comparing the tools (-0:35)
Make them work together (-0:40)
Wrap‟up and question (-Beer)
Pig History



Yahoo Research in 2006
Inspired from Sawzall, a Google Paper from
2003
2007 as an Apache Project



Initial motivation




◦ Search Log Analytics: how long is the average user
session ? how many links does a user click ? on before
leaving a website ? how do click patterns vary in the
course of a day/week/month ? …

words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;
DUMP first_words;
Dataiku - Pig, Hive and Cascading
Hive History


Developed by Facebook in January 2007



Open source in August 2008



Initial Motivation

◦ Provide a SQL like abstraction to perform statistics on
status updates

create external table wordcounts (
word string,
count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit
10;
select SUM(count) from wordcounts where word like
„th%‟;
Dataiku - Pig, Hive and Cascading
Cascading History


Authored by Chris Wensel 2008



Associated Projects

◦ Cascalog : Cascading in Closure
◦ Scalding : Cascading in Scala (Twitter in 2012)
◦ Lingual ( to be released soon): SQL layer on top
of cascading

Dataiku - Pig, Hive and Cascading
Agenda









Dataiku - Pig, Hive and Cascading

Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How they work (-0:15)
Comparing the tools (-0:35)
Make them work together (-0:40)
Wrap‟up and question (-Beer)
Pig  Hive

Mapping to Mapreduce jobs
events

= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);

events_filtered = FILTER events BY type;
by_user

= GROUP events_filtered BY user;

price_by_user

= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;

high_pbu

= FILTER price_by_user BY total_price  1000;

Job 1 : Mapper
LOAD

FILTER

Job 1 : Reducer1
Shuffle and
sort by user

GROUP

FOREACH

FILTER

* VAT
excluded
Dataiku - Innovation Services

1/8/14

46
Pig  Hive

Mapping to Mapreduce jobs
= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);

events

events_filtered = FILTER events BY type;
by_user

= GROUP events_filtered BY user;

price_by_user

= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;

high_pbu

= FILTER price_by_user BY total_price  1000;

recent_high

= ORDER high_pbu BY max_ts DESC;

STORE recent_high INTO „/output‟;

Job 1: Mapper
LOAD

FILTER

Job 1 :Reducer
Shuffle and
sort by user

Job 2: Mapper
LOAD
(from tmp)

GROUP

FOREACH

FILTER

Job 2: Reducer
Shuffle and
sort by max_ts

STORE
47

Dataiku - Innovation Services

1/8/14
Pig
How does it work
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)

Dataiku - Pig, Hive and Cascading
Hive Joins

How to join with MapReduce ?
Uid
tbl_idx

uid

1
2

1
1
2

Dupont

Type2

Type1

2

Type2

type

Tbl_idx

Name

Type
Uid

1

Type

Durand

Type1

Durand

Type2
2

Name

2

Type1

2
2

Type1

Reducer 1

2
2

Dupont

1

2

Durand

Uid
2

Type

Dupont

Shuffle by uid
Sort by (uid, tbl_idx)
uid

Name

1

1

Dupont

1

tbl_idx

Type
Uid

1
1

Name

name
1

1

Tbl_idx

Type1

Type1

Mappers output

Reducer 2
49

Dataiku - Innovation Services

1/8/14
Agenda









Dataiku - Pig, Hive and Cascading

Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How they work (-0:15)
Comparing the tools (-0:35)
Make them work together (-0:40)
Wrap‟up and question (-Beer)
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment



Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration



Performance and optimization

Dataiku - Pig, Hive and Cascading
Procedural Vs Declarative


Transformation as a
sequence of operations

Users
= load 'users' as (name, age, ipaddr);
Clicks
= load 'clicks' as (user, url, value);
ValuableClicks
= filter Clicks by value  0;
UserClicks
= join Users by name, ValuableClicks by
user;
Geoinfo
= load 'geoinfo' as (ipaddr, dma);
UserGeo
= join UserClicks by ipaddr, Geoinfo by
ipaddr;
ByDMA
= group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group,
COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';



Transformation as a set of
formulas

insert into ValuableClicksPerDMA select
dma, count(*)
from geoinfo join (
select name, ipaddr from
users join clicks on (users.name =
clicks.user)
where value  0;
) using ipaddr
group by dma;

Dataiku - Pig, Hive and Cascading
Data type and Model
Rationale


All three Extend basic data model with extended
data types
◦ array-like [ event1, event2, event3]
◦ map-like { type1:value1, type2:value2, …}



Different approach
◦ Resilient Schema
◦ Static Typing
◦ No Static Typing

Dataiku - Pig, Hive and Cascading
Hive
Data Type and Schema
CREATE TABLE visit (
user_name
user_id
user_details
);

STRING,
INT,
STRUCTage:INT, zipcode:INT

Simple type

Details

TINYINT, SMALLINT, INT, BIGINT

1, 2, 4 and 8 bytes

FLOAT, DOUBLE

4 and 8 bytes

BOOLEAN
STRING

Arbitrary-length, replaces VARCHAR

TIMESTAMP
Complex type

Details

ARRAY

Array of typed items (0-indexed)

MAP

Associative map

STRUCT

Complex class-like objects
54

Dataiku Training – Hadoop for Data Science

1/8/14
Data types and Schema
Pig
rel = LOAD '/folder/path/'
USING PigStorage(„t‟)
AS (col:type, col:type, col:type);
Simple type

Details

int, long, float,
double

32 and 64 bits, signed

chararray

A string

bytearray

An array of … bytes

boolean

A boolean

Complex type

Details

tuple

a tuple is an ordered fieldname:value map

bag

a bag is a set of tuples
55

Dataiku Training – Hadoop for Data Science

1/8/14
Data Type and Schema
Cascading




Support for Any Java Types, provided they can be
serialized in Hadoop
No support for Typing
Simple type

Details

Int, Long, Float,
Double

32 and 64 bits, signed

String

A string

byte[]

An array of … bytes

Boolean

A boolean

Complex type
Object

Dataiku - Pig, Hive and Cascading

Details
Object must be « Hadoop serializable »
Style Summary
Style

Typing

Data Model

Metadata
store

Pig

Procedural

Static +
Dynamic

scalar +
tuple+ bag
(fully
recursive)

No
(HCatalog)

Hive

Declarative

Static +
Dynamic,
enforced at
execution
time

scalar+ list +
map

Integrated

Cascading

Procedural

Weak

scalar+ java
objects

No

Dataiku - Pig, Hive and Cascading
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing, error management and environment



Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration



Performance and optimization

Dataiku - Pig, Hive and Cascading
Headachility
Motivation


Does debugging the
tool lead to bad
headaches ?

Dataiku - Pig, Hive and Cascading
Headaches
Pig


Out Of Memory Error (Reducer)



Exception in Building /
Extended Functions
(handling of null)



Null vs “”



Nested Foreach and scoping



Date Management (pig 0.10)



Field implicit ordering

Dataiku - Pig, Hive and Cascading
A Pig Error

Dataiku - Pig, Hive and Cascading
Headaches
Hive


Out of Memory Errors in
Reducers



Few Debugging Options



Null / “”



No builtin “first”

Dataiku - Pig, Hive and Cascading
Headaches
Cascading


Weak Typing Errors (comparing
Int and String … )



Illegal Operation Sequence
(Group after group …)



Field Implicit Ordering

Dataiku - Pig, Hive and Cascading
Testing
Motivation



How to perform unit tests ?
How to have different versions of the same script
(parameter) ?

Dataiku - Pig, Hive and Cascading
Testing
Pig





System Variables
Comment to test
No Meta Programming
pig –x local to execute on local files

Dataiku - Pig, Hive and Cascading
Testing / Environment
Cascading



Junit Tests are possible
Ability to use code to actually comment out some
variables

Dataiku - Pig, Hive and Cascading
Checkpointing
Motivation





Lots of iteration while developing on Hadoop
Sometime jobs fail
Sometimes need to restart from the start …

Parse Logs

Per Page Stats

Page User Correlation

FIX and
relaunch
Dataiku - Pig, Hive and Cascading

Filtering

Output
Pig
Manual Checkpointing


STORE Command to manually
store files

Parse Logs

Per Page Stats

Page User Correlation

// COMMENT Beginning
of script and relaunch
Dataiku - Pig, Hive and Cascading

Filtering

Output
Cascading
Automated Checkpointing


Ability to re-run a
flow automatically
from the last saved
checkpoint

addCheckpoint(…
)

Dataiku - Pig, Hive and Cascading
Cascading
Topological Scheduler




Check each file intermediate timestamp
Execute only if more recent

Parse Logs

Per Page Stats

Page User Correlation

Filtering

Dataiku - Pig, Hive and Cascading

Output
Productivity Summary
Headaches
Pig

Hive

Cascading

Checkpointing/Rep
lay

Testing /
Metaprogrammation

Lots

Manual Save

Difficult Meta
programming, easy local
testing

Few, but without None (That‟s SQL)
debugging
options
Weak Typing
Complexity

Dataiku - Pig, Hive and Cascading

Checkpointing
Partial Updates

None (That‟s SQL)

Possible
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment



Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration



Performance and optimization

Dataiku - Pig, Hive and Cascading
Formats Integration
Motivation


Ability to integrate different file formats



Ability to integrate with external data sources or sink (
MongoDB, ElasticSearch, Database. …)

◦ Text Delimited
◦ Sequence File (Binary Hadoop format)
◦ Avro, Thrift ..

Format impact on size and performance
Format

Size on Disk (GB)

HIVE Processing time (24 cores)

Text File, uncompressed

18.7

1m32s

1 Text File, Gzipped

3.89

6m23s

JSON compressed

7.89

2m42s

multiple text file gzipped

4.02

43s

Sequence File, Block, Gzip

5.32

1m18s

Text File, LZO Indexed

7.03

1m22s

Dataiku - Pig, Hive and Cascading

(no parallelization)
Format Integration





Hive: Serde (Serialize-Deserializer)
Pig : Storage
Cascading: Tap

Dataiku - Pig, Hive and Cascading
Partitions
Motivation




No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition
Common partition schemas on Hadoop
◦
◦
◦
◦
◦

By Date /apache_logs/dt=2013-01-23
By Data center /apache_logs/dc=redbus01/…
By Country
…
Or any combination of the above

Dataiku - Pig, Hive and Cascading
Hive Partitioning
Partitioned tables

CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0
/hive/event/day=2013-01-27/server_id=s1/file1
/hive/event/day=2013-01-27/server_id=s2/file0
/hive/event/day=2013-01-27/server_id=s2/file1
…
/hive/event/day=2013-01-28/server_id=s2/file0
/hive/event/day=2013-01-28/server_id=s2/file1

INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27',
server_id=„s1‟)
SELECT * FROM event_tmp;
Dataiku Training – Hadoop for Data Science

1/8/14

76
Cascading Partition
No Direct support for partition
 Support for “Glob” Tap, to build read from files using patterns




➔

You can code your own custom or virtual partition schemes

Dataiku - Pig, Hive and Cascading
External Code Integration
Simple UDF
Pig

Hive

Cascadin
g

Dataiku - Pig, Hive and Cascading
Hive Complex UDF
(Aggregators)

Dataiku - Pig, Hive and Cascading
Cascading
Direct Code Evaluation

Uses Janino, a very cool project:
http://docs.codehaus.org/display/JANINO

Dataiku - Pig, Hive and Cascading
Spring Batch
Cascading Integration




Allow to call a cascading flow from a Spring Batch

No full Integration with Spring MessageSource or
MessageHandler yet (only for local flows)

Dataiku - Pig, Hive and Cascading
Integration
Summary

Partition/Increme External Code
ntal Updates
Pig

No Direct Support

Hive

Cascading

Dataiku - Pig, Hive and Cascading

Fully integrated,
SQL Like

With Coding

Simple

Format
Integration
Doable and rich
community

Very simple, but
Doable and existing
complex dev setup
community

Complex UDFS
but regular, and
Java Expression
embeddable

Doable and
growing
commuinty
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment



Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration



Performance and optimization

Dataiku - Pig, Hive and Cascading
Optimization


Several Common Map Reduce Optimization Patterns
◦
◦
◦
◦
◦



Combiners
MapJoin
Job Fusion
Job Parallelism
Reducer Parallelism

Different support per framework
◦ Fully Automatic
◦ Pragma / Directives / Options
◦ Coding style / Code to write

Dataiku - Pig, Hive and Cascading
Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
2012-02-14 4354

Map

…

2012-02-14 4354

2012-02-15 21we2

…

Reduc
e

2012-02-14 20

2012-02-15 21we2

2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2

2012-02-14 qa334
…
2012-02-15 23aq2

Dataiku - Pig, Hive and Cascading

2012-02-16 1
Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
Map
2012-02-14 4354

2012-02-14 8

…

2012-02-15 12

Reduc
e

2012-02-14 20

2012-02-15 21we2

2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2

2012-02-14 12
2012-02-15 23
2012-02-16 1

Reduced network bandwith. Better
parallelism
Dataiku - Pig, Hive and Cascading

2012-02-16 1
Join Optimization
Map Join
Hive
set hive.auto.convert.join =
true;
Pig

Cascadin
g

( no aggregation support after HashJoin)

Dataiku - Pig, Hive and Cascading
Number of Reducers


Critical for performance



Estimated per the size of input file

◦ Hive
 divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
◦ Pig
 divide size pig.exec.reducers.bytes.per.reducer (default 1GB)

Dataiku - Pig, Hive and Cascading
Performance  Optimization
Summary

Combiner
Optimization

Pig
Cascading
Hive

Dataiku - Pig, Hive and Cascading

Join
Optimization

Number of
reducers
optimization

Automatic

Option

Estimate or DIY

DIY

HashJoin

DIY

Partial
DIY

Automatic
(Map Join)

Estimate or DIY
Date • Titre de la présentation

CAS D’USAGE DU BIG DATA ET
MACHINE LEARNING

Qualité du search
•

ERWAN PIGNEUL

•

TEAM LEADER – RESPONSABLE DE PROJET

90
CONTEXTE PAGESJAUNES
CŒUR DE MÉTIER : LA RECHERCHE LOCALE DE PROFESSIONNELS

PAGESJAUNES UTILISE UN MOTEUR D'INTERPRÉTATION SPÉCIFIQUE
NÉCESSITANT UNE INDEXATION MANUELLE

CELA PERMET DE BIEN TRAITER LES REQUÊTES LES PLUS JOUÉES
MAIS CELA NE GÈRE PAS LA LONGUE TRAINE
COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES
VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?



20 M
1,4M



10
occurrences

requêtes

Analyse 
corrections

200M
recherches

0,5M requêtes
priorisées

automatisation
SOLUTION
pagesjaunes.fr

crawl

hadoop
PIG+Hive

Moteur
d‟interprétation

Sickit-learn

indexation

Autres
Annuaire
référentiels

Export
ENSEIGNEMENTS TECHNIQUES
HADOOP / PIG / HIVE :
Efficace
Remet en question certaines logiques test/prod (apparition de pbs sur gros volumes)
Attention, ca reste jeune (compatibilité, …)

DATAIKU STUDIO :
Accélérateur de dev big data
Ordonnanceur des traitements en intégrant tous nos jobs et gère les dépendances
Easy Machine learning

ELASTICSEARCH :
Volume indexé et rapidité de search
EFFICACITÉ DE L’APPROCHE
Evolution de la fragilité de la requête ‘Parc enfant’

Fragile

Requête
‘Parc
enfant’
Moyenne
générale

Not fragile
Mahout 102
Clustering
Goal for Today
•

Quick Introduction To Clustering

•

How does it work in Practice

•

How does it work in Mahout

•

Overview of Mahout Algorithms
Clustering
Revenu
e

c

Age
Clustering
Revenu
e

One Cluster
Centroid
== Center of
the cluster

c

Age
clustering applications
•

Fraud: Detect Outliers

•

CRM : Mine for customer segments

•

Image Processing : Similar Images

•

Search : Similar documents

•

Search : Allocate Topics
K-Means
Guess an initial placement for centroids

Assign each point to closest Center

Reposition Center

MAP

REDUCE
clustering challenges
•

Curse of Dimensionality

•

Choice of distance / number of parameters

•

Performance

•

Choice # of clusters
Mahout Clustering
Challenges
•

No Integrated Feature Engineering Stack:
Get ready to write data processing in Java

•

Hadoop SequenceFile required as an input

•

Iterations as Map/Reduce read and write to disks:
Relatively slow compared to in-memory
processing
Data Processing

Image

Voice

Log / DB

Data Processing

Vectorized
Data
Mahout K-Means on Text
Workflow
Text
Files
mahout
seqdirectory

Mahout Sequence Files
mahout
seq2parse

Tfidf Vectors
mahout
kmeans

Clusters
Mahout K-Means on
Database Extract Worflow
Database Dump (CSV)
org.apache.mahout.clustering.conve
rsion.InputDriver

Mahout Vectors
mahout
kmeans

Clusters
Convert a CSV File to
Mahout Vector
•

Real Code would have
•

Converting Categorical
variables to dimensions

•

Variable Rescaling

•

Dropping IDs (name,
forname …)
Mahout Algorithms
Parameters

Implicit Assumption

Ouput

K-Means

K (number of clusters)
Convergence

Circles

Point - ClusterId

Fuzzy K-Means

K (number of clusters)
Convergence

Circles

Point - ClusterId * , Probability

Expectation
Maximization

K (Number of clusterS)
Convergence

Gaussian distribution

Point - ClusterId*, Probability

Mean-Shift
Clustering

Distance boundaries,
Convergence

Gradient like distribution

Point - Cluster ID

Top Down
Clustering

Two Clustering Algorithns

Hierarchy

Point - Large ClusterId, Small
ClusterId

Dirichlet
Process

Model Distribution

Points are a mixture of
distribution

Point - ClusterId, Probability

Spectral
Clustering

-

-

Point - ClusterId

MinHash
Clustering

Number of hash / keys
Hash Type

High Dimension

Point - Hash*
Comparing Clustering
KMeans

MeanShif
t

Dirichlet

Fuzzy
KMeans
Canopy Optimization
T2

T2

Surely in
Cluster

T1

Pick a random point

Surely not in cluster

More Related Content

What's hot

Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013Dataiku
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2Cdiscount
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchKlaas Bosteels
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku
 
How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs ConnectHow to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs ConnectPAPIs.io
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...DataWorks Summit
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitzRaghu Kashyap
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLJen Stirrup
 
Machine learning in real-time - the next frontier
Machine learning in real-time - the next frontierMachine learning in real-time - the next frontier
Machine learning in real-time - the next frontierSnowplow Analytics
 
What makes an effective data team?
What makes an effective data team?What makes an effective data team?
What makes an effective data team?Snowplow Analytics
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Benchmarking Digital Readiness: Moving at the Speed of the Market
Benchmarking Digital Readiness: Moving at the Speed of the MarketBenchmarking Digital Readiness: Moving at the Speed of the Market
Benchmarking Digital Readiness: Moving at the Speed of the MarketApigee | Google Cloud
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 

What's hot (20)

Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine Learning
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from Scratch
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
 
How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs ConnectHow to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureML
 
Machine learning in real-time - the next frontier
Machine learning in real-time - the next frontierMachine learning in real-time - the next frontier
Machine learning in real-time - the next frontier
 
What makes an effective data team?
What makes an effective data team?What makes an effective data team?
What makes an effective data team?
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Benchmarking Digital Readiness: Moving at the Speed of the Market
Benchmarking Digital Readiness: Moving at the Speed of the MarketBenchmarking Digital Readiness: Moving at the Speed of the Market
Benchmarking Digital Readiness: Moving at the Speed of the Market
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 

Viewers also liked

"Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ..."Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ...Dataconomy Media
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTODatabase Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO✔ Eric David Benari, PMP
 
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, SisenseDatabase Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense✔ Eric David Benari, PMP
 
Petit Club "Le Commerce On/Off" - Présentation d'Alkemics
Petit Club "Le Commerce On/Off" - Présentation d'AlkemicsPetit Club "Le Commerce On/Off" - Présentation d'Alkemics
Petit Club "Le Commerce On/Off" - Présentation d'AlkemicsPetit Web
 
Petit Club "Le Commerce On/Off" - présentation DigitasLBI
Petit Club "Le Commerce On/Off" - présentation DigitasLBIPetit Club "Le Commerce On/Off" - présentation DigitasLBI
Petit Club "Le Commerce On/Off" - présentation DigitasLBIPetit Web
 
A taste of Snowplow Analytics data
A taste of Snowplow Analytics dataA taste of Snowplow Analytics data
A taste of Snowplow Analytics dataRobert Kingston
 
Reinventing the Modern Information Pipeline: Paxata and MapR
Reinventing the Modern Information Pipeline: Paxata and MapRReinventing the Modern Information Pipeline: Paxata and MapR
Reinventing the Modern Information Pipeline: Paxata and MapRLilia Gutnik
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkJen Aman
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...yalisassoon
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data PlatformVikas Manoria
 

Viewers also liked (13)

"Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ..."Machine Learning and Internet of Things, the future of medical prevention", ...
"Machine Learning and Internet of Things, the future of medical prevention", ...
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTODatabase Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO
 
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, SisenseDatabase Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense
 
Petit Club "Le Commerce On/Off" - Présentation d'Alkemics
Petit Club "Le Commerce On/Off" - Présentation d'AlkemicsPetit Club "Le Commerce On/Off" - Présentation d'Alkemics
Petit Club "Le Commerce On/Off" - Présentation d'Alkemics
 
Petit Club "Le Commerce On/Off" - présentation DigitasLBI
Petit Club "Le Commerce On/Off" - présentation DigitasLBIPetit Club "Le Commerce On/Off" - présentation DigitasLBI
Petit Club "Le Commerce On/Off" - présentation DigitasLBI
 
A taste of Snowplow Analytics data
A taste of Snowplow Analytics dataA taste of Snowplow Analytics data
A taste of Snowplow Analytics data
 
Reinventing the Modern Information Pipeline: Paxata and MapR
Reinventing the Modern Information Pipeline: Paxata and MapRReinventing the Modern Information Pipeline: Paxata and MapR
Reinventing the Modern Information Pipeline: Paxata and MapR
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
 

Similar to BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshopFang Mac
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 

Similar to BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes (20)

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Final deck
Final deckFinal deck
Final deck
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 

More from Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data CircleDataiku
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thDataiku
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku
 

More from Dataiku (13)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch
 

Recently uploaded

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 

Recently uploaded (20)

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 

BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes

  • 1. BIG DATA How do elephant make babies Florian Douetteau CEO, Dataiku
  • 2. Agenda • Big Data & Hadoop Overview • Practical Big Data Coding: Pig / Hive / Cascading • PagesJaunes Big Data Use Case • Machine Learning For Big Data
  • 4. Collocation Dataiku C o l l o c a t A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association. Big Apple Big Mama Big Data 4 1/8/14
  • 5. “Big” Data in 1999 struct Element { Key key; void* stat_data ; } …. C Optimized Data structures Perfect Hashing HP-UNIX Servers – 4GB Ram 100 GB data Web Crawler – Socket reuse HTTP 0.9 Dataiku 1 Month 5 1/8/14
  • 6. Big Data in 2013      Hadoop Java / Pig / Hive / Scala / Closure / … A Dozen NoSQL data store MPP Databases Real-Time 1 Hour 6 Dataiku 1/8/14
  • 7. Data Analytics: The Stakes 1 TB 1B $ 1 TB ?$ 1 TB 100M $ Web Search 1999 Logistics 2004 Dataiku 10 TB 10M $ 100 TB ?$ Banking CRM 2008 50TB 1B$ 1000TB 500M $ E-Commerce 2013 Social Gaming 2011 Web Search 2010 Online Advertising 2012 1/8/14 7
  • 8. Meet Hal Alowne Hal Alowne BI Manager Dim‟s Private Showroom European E-commerce Web site • 100M$ Revenue • 1 Million customer • 1 Data Analyst (Hal Himself) Dataiku - Data Tuesday ‟ Dim Sum CEO & Founder Dim‟s Private Showroom Hey Hal ! We need a big data platform like the big guys. Let‟s just do as they do! Big Data Copy Cat Project ” Big Guys • 10B$+ Revenue • 100M+ customers • 100+ Data Scientist 1/8/14 8
  • 9. QUESTION #1 IS IT EASY OR NOT ?
  • 17. I SEND IT TO THE MARKETING CLOUD
  • 18. QUERSTION #4 IS IT BIG OR NOT
  • 19. WE ALL LIVE IN A BIG DATA LAKE
  • 20. ALL MY DATA PROBABLY FITS IN HERE
  • 21. QUESTION #5 (at last) HUMAN OR NOT ?
  • 24. MERIT = TIME + ROI TIME : 6 MONTHS ROI : APPS 2014 2013 Find the right people (6 months?) Choose the technology (6 months?) Make it work (6 months?) 2013 Build the lab (6 months) • Train People • Reuse working patterns Build a lab in 6 months (rather than 18 months) Dataiku Targeted Newsletter Recommender Systems Adapted Product / Promotions Deploy apps 24 that actually deliver value 1/9/14
  • 25. Statistics and Machine Learning is complex ! Try to understand myself 25 Dataiku 1/9/14
  • 26. (Some Book you might want to read) 26 Dataiku 1/9/14
  • 27. CHOOSE TECHNOLOGY NoSQL-Slavia Hadoop Elastic Search Ceph SOLR Riak Machine Learning Mystery Land Scalability Central Cassandra MongoDB Membase Scikit-Learn GraphLAB prediction.io jubatus Mahout WEKA Sphere Kafka Flume Real-time island Spark Storm SQL Colunnar Republic MLBase RapidMiner Vertica Netezza QlickView Kibana SpotFire D3 Cascading Tableau Dataiku - Pig, Hive and Cascading SPSS Panda Pig Vizualization County R SAS InfiniDB Drill GreenPlum Impala LibSVM Talend Data Clean Wasteland Statistician Old House
  • 28. Large E-Retailer    Business Intelligence Stack as Scalability and maintenance issues Backoffice implements business rules that are challenged Existing infrastructure cannot cope with per-user information Main Pain Point: 23 hours 52 minutes to compute Business Intelligence aggregates for one day. 29 Dataiku 1/9/14
  • 29. Large E-Retailer : The Datalab • • • Relieve their current DWH and accelerate production of some aggregates/KPIs Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc., Train existing people around machine learning and segmentation experience 1h12 to perform the aggregate, available every morning New home page personalization deployed in a few weeks Hadoop Cluster (24 cores) Google Compute Engine Python + R + Vertica 12 TB dataset 6 weeks projects 30 Dataiku - Data Tuesday 1/9/14
  • 30. Example (Social Gaming) Social Gaming Communities  Correlation ◦ between community size and engagement / virality  Some mid-size communities Meaningul patterns ◦ 2 players / Family / Group  What is the minimum number of friends to have in the application to get additional engagement ? A very large community Lots of small clusters mostly 2 players) 31 Dataiku 1/9/14
  • 31. How do I (pre)process data? Implicit User Data (Views, Searches…) Online User Information Transformation Predictor 500TB Transformation Matrix Explicit User Data Predictor Runtime (Click, Buy, …) Per User Stats Rank Predictor 50TB Per Content Stats User Information (Location, Graph…) User Similarity 1TB Content Data (Title, Categories, Price, …) 200GB Content Similarity A/B Test Data Dataiku - Pig, Hive and Cascading
  • 32. Always the same Pour Data In Compute Something Smart About It Make Available
  • 33. The Questions Pour Data In How often ? What kind of interaction? How much ? Compute Something Smart About It How complex ? Do you need all data at once ? How incremental ? Make Available Interaction ? Random Access ?
  • 34. At the Beginning was the elephant
  • 35.
  • 36. MapReduce How to count works in many many boxes 37 Dataiku - Innovation Services 1/8/14
  • 38. After Hadoop Random Access In Memory MultiCore Machine Learning Faster in Memory Computation Massive Batch Map Reduce Over HDFS Real-Time Distributed Computation Faster SQL Analytics Queries
  • 39. MapReduce Simplicity is a complexity 40 Dataiku - Innovation Services 1/8/14
  • 40. Agenda       Dataiku - Pig, Hive and Cascading Hadoop and Context (-0:03) Pig, Hive, Cascading, … (-0:09) How they work (-0:15) Comparing the tools (-0:35) Make them work together (-0:40) Wrap‟up and question (-Beer)
  • 41. Pig History  Yahoo Research in 2006 Inspired from Sawzall, a Google Paper from 2003 2007 as an Apache Project  Initial motivation   ◦ Search Log Analytics: how long is the average user session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? … words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟) AS (word:chararray, count:int); sorted_words = ORDER words BY count DESC; first_words = LIMIT sorted_words 10; DUMP first_words; Dataiku - Pig, Hive and Cascading
  • 42. Hive History  Developed by Facebook in January 2007  Open source in August 2008  Initial Motivation ◦ Provide a SQL like abstraction to perform statistics on status updates create external table wordcounts ( word string, count int ) row format delimited fields terminated by 't' location '/training/hadoop-wordcount/output'; select * from wordcounts order by count desc limit 10; select SUM(count) from wordcounts where word like „th%‟; Dataiku - Pig, Hive and Cascading
  • 43. Cascading History  Authored by Chris Wensel 2008  Associated Projects ◦ Cascalog : Cascading in Closure ◦ Scalding : Cascading in Scala (Twitter in 2012) ◦ Lingual ( to be released soon): SQL layer on top of cascading Dataiku - Pig, Hive and Cascading
  • 44. Agenda       Dataiku - Pig, Hive and Cascading Hadoop and Context (-0:03) Pig, Hive, Cascading, … (-0:09) How they work (-0:15) Comparing the tools (-0:35) Make them work together (-0:40) Wrap‟up and question (-Beer)
  • 45. Pig Hive Mapping to Mapreduce jobs events = LOAD „/events‟ USING PigStorage(„t‟) AS (type:chararray, user:chararray, price:int, timestamp:int); events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price 1000; Job 1 : Mapper LOAD FILTER Job 1 : Reducer1 Shuffle and sort by user GROUP FOREACH FILTER * VAT excluded Dataiku - Innovation Services 1/8/14 46
  • 46. Pig Hive Mapping to Mapreduce jobs = LOAD „/events‟ USING PigStorage(„t‟) AS (type:chararray, user:chararray, price:int, timestamp:int); events events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price 1000; recent_high = ORDER high_pbu BY max_ts DESC; STORE recent_high INTO „/output‟; Job 1: Mapper LOAD FILTER Job 1 :Reducer Shuffle and sort by user Job 2: Mapper LOAD (from tmp) GROUP FOREACH FILTER Job 2: Reducer Shuffle and sort by max_ts STORE 47 Dataiku - Innovation Services 1/8/14
  • 47. Pig How does it work Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not) Dataiku - Pig, Hive and Cascading
  • 48. Hive Joins How to join with MapReduce ? Uid tbl_idx uid 1 2 1 1 2 Dupont Type2 Type1 2 Type2 type Tbl_idx Name Type Uid 1 Type Durand Type1 Durand Type2 2 Name 2 Type1 2 2 Type1 Reducer 1 2 2 Dupont 1 2 Durand Uid 2 Type Dupont Shuffle by uid Sort by (uid, tbl_idx) uid Name 1 1 Dupont 1 tbl_idx Type Uid 1 1 Name name 1 1 Tbl_idx Type1 Type1 Mappers output Reducer 2 49 Dataiku - Innovation Services 1/8/14
  • 49. Agenda       Dataiku - Pig, Hive and Cascading Hadoop and Context (-0:03) Pig, Hive, Cascading, … (-0:09) How they work (-0:15) Comparing the tools (-0:35) Make them work together (-0:40) Wrap‟up and question (-Beer)
  • 50. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 51. Procedural Vs Declarative  Transformation as a sequence of operations Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';  Transformation as a set of formulas insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value 0; ) using ipaddr group by dma; Dataiku - Pig, Hive and Cascading
  • 52. Data type and Model Rationale  All three Extend basic data model with extended data types ◦ array-like [ event1, event2, event3] ◦ map-like { type1:value1, type2:value2, …}  Different approach ◦ Resilient Schema ◦ Static Typing ◦ No Static Typing Dataiku - Pig, Hive and Cascading
  • 53. Hive Data Type and Schema CREATE TABLE visit ( user_name user_id user_details ); STRING, INT, STRUCTage:INT, zipcode:INT Simple type Details TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes FLOAT, DOUBLE 4 and 8 bytes BOOLEAN STRING Arbitrary-length, replaces VARCHAR TIMESTAMP Complex type Details ARRAY Array of typed items (0-indexed) MAP Associative map STRUCT Complex class-like objects 54 Dataiku Training – Hadoop for Data Science 1/8/14
  • 54. Data types and Schema Pig rel = LOAD '/folder/path/' USING PigStorage(„t‟) AS (col:type, col:type, col:type); Simple type Details int, long, float, double 32 and 64 bits, signed chararray A string bytearray An array of … bytes boolean A boolean Complex type Details tuple a tuple is an ordered fieldname:value map bag a bag is a set of tuples 55 Dataiku Training – Hadoop for Data Science 1/8/14
  • 55. Data Type and Schema Cascading   Support for Any Java Types, provided they can be serialized in Hadoop No support for Typing Simple type Details Int, Long, Float, Double 32 and 64 bits, signed String A string byte[] An array of … bytes Boolean A boolean Complex type Object Dataiku - Pig, Hive and Cascading Details Object must be « Hadoop serializable »
  • 56. Style Summary Style Typing Data Model Metadata store Pig Procedural Static + Dynamic scalar + tuple+ bag (fully recursive) No (HCatalog) Hive Declarative Static + Dynamic, enforced at execution time scalar+ list + map Integrated Cascading Procedural Weak scalar+ java objects No Dataiku - Pig, Hive and Cascading
  • 57. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing, error management and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 58. Headachility Motivation  Does debugging the tool lead to bad headaches ? Dataiku - Pig, Hive and Cascading
  • 59. Headaches Pig  Out Of Memory Error (Reducer)  Exception in Building / Extended Functions (handling of null)  Null vs “”  Nested Foreach and scoping  Date Management (pig 0.10)  Field implicit ordering Dataiku - Pig, Hive and Cascading
  • 60. A Pig Error Dataiku - Pig, Hive and Cascading
  • 61. Headaches Hive  Out of Memory Errors in Reducers  Few Debugging Options  Null / “”  No builtin “first” Dataiku - Pig, Hive and Cascading
  • 62. Headaches Cascading  Weak Typing Errors (comparing Int and String … )  Illegal Operation Sequence (Group after group …)  Field Implicit Ordering Dataiku - Pig, Hive and Cascading
  • 63. Testing Motivation   How to perform unit tests ? How to have different versions of the same script (parameter) ? Dataiku - Pig, Hive and Cascading
  • 64. Testing Pig     System Variables Comment to test No Meta Programming pig –x local to execute on local files Dataiku - Pig, Hive and Cascading
  • 65. Testing / Environment Cascading   Junit Tests are possible Ability to use code to actually comment out some variables Dataiku - Pig, Hive and Cascading
  • 66. Checkpointing Motivation    Lots of iteration while developing on Hadoop Sometime jobs fail Sometimes need to restart from the start … Parse Logs Per Page Stats Page User Correlation FIX and relaunch Dataiku - Pig, Hive and Cascading Filtering Output
  • 67. Pig Manual Checkpointing  STORE Command to manually store files Parse Logs Per Page Stats Page User Correlation // COMMENT Beginning of script and relaunch Dataiku - Pig, Hive and Cascading Filtering Output
  • 68. Cascading Automated Checkpointing  Ability to re-run a flow automatically from the last saved checkpoint addCheckpoint(… ) Dataiku - Pig, Hive and Cascading
  • 69. Cascading Topological Scheduler   Check each file intermediate timestamp Execute only if more recent Parse Logs Per Page Stats Page User Correlation Filtering Dataiku - Pig, Hive and Cascading Output
  • 70. Productivity Summary Headaches Pig Hive Cascading Checkpointing/Rep lay Testing / Metaprogrammation Lots Manual Save Difficult Meta programming, easy local testing Few, but without None (That‟s SQL) debugging options Weak Typing Complexity Dataiku - Pig, Hive and Cascading Checkpointing Partial Updates None (That‟s SQL) Possible
  • 71. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 72. Formats Integration Motivation  Ability to integrate different file formats  Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …) ◦ Text Delimited ◦ Sequence File (Binary Hadoop format) ◦ Avro, Thrift .. Format impact on size and performance Format Size on Disk (GB) HIVE Processing time (24 cores) Text File, uncompressed 18.7 1m32s 1 Text File, Gzipped 3.89 6m23s JSON compressed 7.89 2m42s multiple text file gzipped 4.02 43s Sequence File, Block, Gzip 5.32 1m18s Text File, LZO Indexed 7.03 1m22s Dataiku - Pig, Hive and Cascading (no parallelization)
  • 73. Format Integration    Hive: Serde (Serialize-Deserializer) Pig : Storage Cascading: Tap Dataiku - Pig, Hive and Cascading
  • 74. Partitions Motivation   No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition Common partition schemas on Hadoop ◦ ◦ ◦ ◦ ◦ By Date /apache_logs/dt=2013-01-23 By Data center /apache_logs/dc=redbus01/… By Country … Or any combination of the above Dataiku - Pig, Hive and Cascading
  • 75. Hive Partitioning Partitioned tables CREATE TABLE event ( user_id INT, type STRING, message STRING) PARTITIONED BY (day STRING, server_id STRING); Disk structure /hive/event/day=2013-01-27/server_id=s1/file0 /hive/event/day=2013-01-27/server_id=s1/file1 /hive/event/day=2013-01-27/server_id=s2/file0 /hive/event/day=2013-01-27/server_id=s2/file1 … /hive/event/day=2013-01-28/server_id=s2/file0 /hive/event/day=2013-01-28/server_id=s2/file1 INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27', server_id=„s1‟) SELECT * FROM event_tmp; Dataiku Training – Hadoop for Data Science 1/8/14 76
  • 76. Cascading Partition No Direct support for partition  Support for “Glob” Tap, to build read from files using patterns   ➔ You can code your own custom or virtual partition schemes Dataiku - Pig, Hive and Cascading
  • 77. External Code Integration Simple UDF Pig Hive Cascadin g Dataiku - Pig, Hive and Cascading
  • 78. Hive Complex UDF (Aggregators) Dataiku - Pig, Hive and Cascading
  • 79. Cascading Direct Code Evaluation Uses Janino, a very cool project: http://docs.codehaus.org/display/JANINO Dataiku - Pig, Hive and Cascading
  • 80. Spring Batch Cascading Integration   Allow to call a cascading flow from a Spring Batch No full Integration with Spring MessageSource or MessageHandler yet (only for local flows) Dataiku - Pig, Hive and Cascading
  • 81. Integration Summary Partition/Increme External Code ntal Updates Pig No Direct Support Hive Cascading Dataiku - Pig, Hive and Cascading Fully integrated, SQL Like With Coding Simple Format Integration Doable and rich community Very simple, but Doable and existing complex dev setup community Complex UDFS but regular, and Java Expression embeddable Doable and growing commuinty
  • 82. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 83. Optimization  Several Common Map Reduce Optimization Patterns ◦ ◦ ◦ ◦ ◦  Combiners MapJoin Job Fusion Job Parallelism Reducer Parallelism Different support per framework ◦ Fully Automatic ◦ Pragma / Directives / Options ◦ Coding style / Code to write Dataiku - Pig, Hive and Cascading
  • 84. Combiner Perform Partial Aggregate at Mapper Stage SELECT date, COUNT(*) FROM product GROUP BY date 2012-02-14 4354 Map … 2012-02-14 4354 2012-02-15 21we2 … Reduc e 2012-02-14 20 2012-02-15 21we2 2012-02-15 35 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 qa334 … 2012-02-15 23aq2 Dataiku - Pig, Hive and Cascading 2012-02-16 1
  • 85. Combiner Perform Partial Aggregate at Mapper Stage SELECT date, COUNT(*) FROM product GROUP BY date Map 2012-02-14 4354 2012-02-14 8 … 2012-02-15 12 Reduc e 2012-02-14 20 2012-02-15 21we2 2012-02-15 35 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 12 2012-02-15 23 2012-02-16 1 Reduced network bandwith. Better parallelism Dataiku - Pig, Hive and Cascading 2012-02-16 1
  • 86. Join Optimization Map Join Hive set hive.auto.convert.join = true; Pig Cascadin g ( no aggregation support after HashJoin) Dataiku - Pig, Hive and Cascading
  • 87. Number of Reducers  Critical for performance  Estimated per the size of input file ◦ Hive  divide size per hive.exec.reducers.bytes.per.reducer (default 1GB) ◦ Pig  divide size pig.exec.reducers.bytes.per.reducer (default 1GB) Dataiku - Pig, Hive and Cascading
  • 88. Performance Optimization Summary Combiner Optimization Pig Cascading Hive Dataiku - Pig, Hive and Cascading Join Optimization Number of reducers optimization Automatic Option Estimate or DIY DIY HashJoin DIY Partial DIY Automatic (Map Join) Estimate or DIY
  • 89. Date • Titre de la présentation CAS D’USAGE DU BIG DATA ET MACHINE LEARNING Qualité du search • ERWAN PIGNEUL • TEAM LEADER – RESPONSABLE DE PROJET 90
  • 90. CONTEXTE PAGESJAUNES CŒUR DE MÉTIER : LA RECHERCHE LOCALE DE PROFESSIONNELS PAGESJAUNES UTILISE UN MOTEUR D'INTERPRÉTATION SPÉCIFIQUE NÉCESSITANT UNE INDEXATION MANUELLE CELA PERMET DE BIEN TRAITER LES REQUÊTES LES PLUS JOUÉES MAIS CELA NE GÈRE PAS LA LONGUE TRAINE
  • 91. COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?  20 M 1,4M  10 occurrences requêtes Analyse corrections 200M recherches 0,5M requêtes priorisées automatisation
  • 93. ENSEIGNEMENTS TECHNIQUES HADOOP / PIG / HIVE : Efficace Remet en question certaines logiques test/prod (apparition de pbs sur gros volumes) Attention, ca reste jeune (compatibilité, …) DATAIKU STUDIO : Accélérateur de dev big data Ordonnanceur des traitements en intégrant tous nos jobs et gère les dépendances Easy Machine learning ELASTICSEARCH : Volume indexé et rapidité de search
  • 94. EFFICACITÉ DE L’APPROCHE Evolution de la fragilité de la requête ‘Parc enfant’ Fragile Requête ‘Parc enfant’ Moyenne générale Not fragile
  • 96. Goal for Today • Quick Introduction To Clustering • How does it work in Practice • How does it work in Mahout • Overview of Mahout Algorithms
  • 99. clustering applications • Fraud: Detect Outliers • CRM : Mine for customer segments • Image Processing : Similar Images • Search : Similar documents • Search : Allocate Topics
  • 100. K-Means Guess an initial placement for centroids Assign each point to closest Center Reposition Center MAP REDUCE
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.
  • 110. clustering challenges • Curse of Dimensionality • Choice of distance / number of parameters • Performance • Choice # of clusters
  • 111. Mahout Clustering Challenges • No Integrated Feature Engineering Stack: Get ready to write data processing in Java • Hadoop SequenceFile required as an input • Iterations as Map/Reduce read and write to disks: Relatively slow compared to in-memory processing
  • 112. Data Processing Image Voice Log / DB Data Processing Vectorized Data
  • 113. Mahout K-Means on Text Workflow Text Files mahout seqdirectory Mahout Sequence Files mahout seq2parse Tfidf Vectors mahout kmeans Clusters
  • 114. Mahout K-Means on Database Extract Worflow Database Dump (CSV) org.apache.mahout.clustering.conve rsion.InputDriver Mahout Vectors mahout kmeans Clusters
  • 115. Convert a CSV File to Mahout Vector • Real Code would have • Converting Categorical variables to dimensions • Variable Rescaling • Dropping IDs (name, forname …)
  • 116. Mahout Algorithms Parameters Implicit Assumption Ouput K-Means K (number of clusters) Convergence Circles Point - ClusterId Fuzzy K-Means K (number of clusters) Convergence Circles Point - ClusterId * , Probability Expectation Maximization K (Number of clusterS) Convergence Gaussian distribution Point - ClusterId*, Probability Mean-Shift Clustering Distance boundaries, Convergence Gradient like distribution Point - Cluster ID Top Down Clustering Two Clustering Algorithns Hierarchy Point - Large ClusterId, Small ClusterId Dirichlet Process Model Distribution Points are a mixture of distribution Point - ClusterId, Probability Spectral Clustering - - Point - ClusterId MinHash Clustering Number of hash / keys Hash Type High Dimension Point - Hash*
  • 118. Canopy Optimization T2 T2 Surely in Cluster T1 Pick a random point Surely not in cluster