BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes

BIG DATA
How do elephant
make babies

Florian Douetteau
CEO, Dataiku

Agenda
•

Big Data & Hadoop Overview

•

Practical Big Data Coding: Pig / Hive / Cascading

•

PagesJaunes Big Data Use Case

•

Machine Learning For Big Data

Collocation

Dataiku

C
o
l
l
o
c
a
t

A familiar grouping of words,
especially words that habitually
appear together and thereby
convey meaning by association.

Big

Apple

Big

Mama

Big

Data
4
1/8/14

“Big” Data in 1999
struct Element {
Key key;
void* stat_data ;
}
….

C
Optimized Data structures
Perfect Hashing
HP-UNIX Servers – 4GB Ram
100 GB data
Web Crawler – Socket reuse
HTTP 0.9
Dataiku

1 Month
5
1/8/14

Big Data in 2013







Hadoop
Java / Pig / Hive / Scala / Closure / …
A Dozen NoSQL data store
MPP Databases
Real-Time

1 Hour
6
Dataiku 1/8/14

Data Analytics: The Stakes
1 TB
1B $

1 TB
?$
1 TB
100M $

Web Search
1999
Logistics
2004
Dataiku

10 TB
10M $
100 TB
?$

Banking
CRM
2008

50TB
1B$
1000TB
500M $
E-Commerce
2013

Social Gaming
2011
Web
Search
2010

Online
Advertising
2012
1/8/14

7

Meet Hal Alowne

Hal Alowne
BI Manager
Dim‟s Private Showroom
European E-commerce Web site
• 100M$ Revenue
• 1 Million customer
• 1 Data Analyst (Hal Himself)

Dataiku - Data Tuesday

‟

Dim Sum
CEO & Founder
Dim‟s Private Showroom

Hey Hal ! We need
a big data platform
like the big guys.
Let‟s just do as they do!
Big Data
Copy Cat Project

”

Big Guys
• 10B$+ Revenue
• 100M+ customers
• 100+ Data Scientist
1/8/14

8

QUESTION #1
IS IT EASY OR NOT
?

I SEND IT
TO THE
MARKETING
CLOUD

WE ALL LIVE
IN A BIG DATA
LAKE

ALL MY DATA
PROBABLY FITS
IN HERE

QUESTION #5 (at last)
HUMAN OR NOT ?

MACHINE
LEARNING
WILL SAVE
US ALL

MERIT = TIME + ROI
TIME : 6 MONTHS

ROI : APPS
2014

2013

Find the right
people
(6 months?)

Choose the
technology
(6 months?)

Make it work
(6 months?)

2013

Build the lab
(6 months)
• Train People
• Reuse working patterns

Build a lab in 6 months
(rather than 18 months)

Dataiku

Targeted
Newsletter
Recommender
Systems

Adapted Product
/ Promotions
Deploy apps
24
that actually deliver value
1/9/14

Statistics and Machine Learning is complex
!
Try to
understand
myself

25
Dataiku

1/9/14

(Some Book you might want to read)

26
Dataiku

1/9/14

CHOOSE TECHNOLOGY
NoSQL-Slavia

Hadoop

Elastic Search

Ceph

SOLR

Riak

Machine Learning
Mystery Land

Scalability Central

Cassandra

MongoDB
Membase

Scikit-Learn
GraphLAB
prediction.io jubatus
Mahout
WEKA

Sphere

Kafka Flume
Real-time island
Spark Storm

SQL Colunnar Republic

MLBase

RapidMiner

Vertica

Netezza

QlickView
Kibana
SpotFire D3

Cascading

Tableau

Dataiku - Pig, Hive and Cascading

SPSS

Panda

Pig

Vizualization County

R

SAS

InfiniDB Drill
GreenPlum
Impala

LibSVM

Talend

Data Clean Wasteland

Statistician Old
House

Large E-Retailer






Business Intelligence Stack as
Scalability and maintenance
issues
Backoffice implements
business rules that are
challenged
Existing infrastructure cannot
cope with per-user
information

Main Pain Point:

23 hours 52 minutes to
compute Business Intelligence
aggregates for one day.

29
Dataiku 1/9/14

Large E-Retailer : The
Datalab
•

•

•

Relieve their current DWH and
accelerate production of some
aggregates/KPIs
Be the backbone for new
personalized user experience on
their website: more
recommendations, more profiling,
etc.,
Train existing people around
machine learning and
segmentation experience

1h12

to perform the
aggregate, available every morning

New

home page personalization
deployed in a few weeks

Hadoop

Cluster (24 cores)
Google Compute Engine
Python + R + Vertica
12 TB dataset
6 weeks projects

30
Dataiku - Data Tuesday 1/9/14

Example (Social Gaming)
Social Gaming Communities


Correlation
◦ between community size and
engagement / virality



Some mid-size
communities

Meaningul patterns

◦ 2 players / Family / Group



What is the minimum
number of friends to have in
the application to get
additional engagement ?

A very large community

Lots of small clusters
mostly 2 players)

31
Dataiku

1/9/14

How do I (pre)process data?
Implicit User Data
(Views, Searches…)

Online User
Information
Transformation
Predictor

500TB
Transformation
Matrix

Explicit User Data

Predictor
Runtime

(Click, Buy, …)

Per User Stats

Rank Predictor

50TB
Per Content Stats

User Information
(Location, Graph…)
User Similarity

1TB
Content Data
(Title, Categories, Price, …)

200GB

Content Similarity

A/B Test Data


Always the same
Pour Data In

Compute Something
Smart About It

Make Available

The Questions
Pour Data In

How often ?
What kind of
interaction?
How much ?

Compute Something
Smart About It

How complex ?
Do you need all
data at once ?
How incremental
?

Make Available

Interaction ?
Random Access ?

At the Beginning was the
elephant

MapReduce
How to count works in many many boxes

37
Dataiku - Innovation Services

1/8/14

After Hadoop
Random Access
In Memory
MultiCore
Machine Learning

Faster in Memory
Computation

Massive Batch
Map Reduce Over HDFS

Real-Time
Distributed
Computation
Faster SQL Analytics
Queries

MapReduce
Simplicity is a complexity

40

1/8/14

Agenda










Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How they work (-0:15)
Comparing the tools (-0:35)
Make them work together (-0:40)
Wrap‟up and question (-Beer)

Pig History



Yahoo Research in 2006
Inspired from Sawzall, a Google Paper from
2003
2007 as an Apache Project



Initial motivation




◦ Search Log Analytics: how long is the average user
session ? how many links does a user click ? on before
leaving a website ? how do click patterns vary in the
course of a day/week/month ? …

words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;
DUMP first_words;

Hive History


Developed by Facebook in January 2007



Open source in August 2008



Initial Motivation

◦ Provide a SQL like abstraction to perform statistics on
status updates

create external table wordcounts (
word string,
count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit
10;
select SUM(count) from wordcounts where word like
„th%‟;

Cascading History


Authored by Chris Wensel 2008



Associated Projects

◦ Cascalog : Cascading in Closure
◦ Scalding : Cascading in Scala (Twitter in 2012)
◦ Lingual ( to be released soon): SQL layer on top
of cascading


Pig Hive

Mapping to Mapreduce jobs
events

= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);

events_filtered = FILTER events BY type;
by_user

= GROUP events_filtered BY user;

price_by_user

= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;

high_pbu

= FILTER price_by_user BY total_price 1000;

Job 1 : Mapper
LOAD

FILTER

Job 1 : Reducer1
Shuffle and
sort by user

GROUP

FOREACH

FILTER

* VAT
excluded

1/8/14

46

Pig Hive

Mapping to Mapreduce jobs
= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);

events

events_filtered = FILTER events BY type;
by_user

= GROUP events_filtered BY user;

price_by_user

= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;

high_pbu

= FILTER price_by_user BY total_price 1000;

recent_high

= ORDER high_pbu BY max_ts DESC;

STORE recent_high INTO „/output‟;

Job 1: Mapper
LOAD

FILTER

Job 1 :Reducer
Shuffle and
sort by user

Job 2: Mapper
LOAD
(from tmp)

GROUP

FOREACH

FILTER

Job 2: Reducer
Shuffle and
sort by max_ts

STORE
47


1/8/14

Pig
How does it work
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)


Hive Joins

How to join with MapReduce ?
Uid
tbl_idx

uid

1
2

1
1
2

Dupont

Type2

Type1

2

Type2

type

Tbl_idx

Name

Type
Uid

1

Type

Durand

Type1

Durand

Type2
2

Name

2

Type1

2
2

Type1

Reducer 1

2
2

Dupont

1

2

Durand

Uid
2

Type

Dupont

Shuffle by uid
Sort by (uid, tbl_idx)
uid

Name

1

1

Dupont

1

tbl_idx

Type
Uid

1
1

Name

name
1

1

Tbl_idx

Type1

Type1

Mappers output

Reducer 2
49


1/8/14

Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment



Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration



Performance and optimization


Procedural Vs Declarative


Transformation as a
sequence of operations

Users
= load 'users' as (name, age, ipaddr);
Clicks
= load 'clicks' as (user, url, value);
ValuableClicks
= filter Clicks by value 0;
UserClicks
= join Users by name, ValuableClicks by
user;
Geoinfo
= load 'geoinfo' as (ipaddr, dma);
UserGeo
= join UserClicks by ipaddr, Geoinfo by
ipaddr;
ByDMA
= group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group,
COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';



Transformation as a set of
formulas

insert into ValuableClicksPerDMA select
dma, count(*)
from geoinfo join (
select name, ipaddr from
users join clicks on (users.name =
clicks.user)
where value 0;
) using ipaddr
group by dma;


Data type and Model
Rationale


All three Extend basic data model with extended
data types
◦ array-like [ event1, event2, event3]
◦ map-like { type1:value1, type2:value2, …}



Different approach
◦ Resilient Schema
◦ Static Typing
◦ No Static Typing


Hive
Data Type and Schema
CREATE TABLE visit (
user_name
user_id
user_details
);

STRING,
INT,
STRUCTage:INT, zipcode:INT

Simple type

Details

TINYINT, SMALLINT, INT, BIGINT

1, 2, 4 and 8 bytes

FLOAT, DOUBLE

4 and 8 bytes

BOOLEAN
STRING

Arbitrary-length, replaces VARCHAR

TIMESTAMP
Complex type

Details

ARRAY

Array of typed items (0-indexed)

MAP

Associative map

STRUCT

Complex class-like objects
54

Dataiku Training – Hadoop for Data Science

1/8/14

Data types and Schema
Pig
rel = LOAD '/folder/path/'
USING PigStorage(„t‟)
AS (col:type, col:type, col:type);
Simple type

Details

int, long, float,
double

32 and 64 bits, signed

chararray

A string

bytearray

An array of … bytes

boolean

A boolean

Complex type

Details

tuple

a tuple is an ordered fieldname:value map

bag

a bag is a set of tuples
55


1/8/14

Data Type and Schema
Cascading




Support for Any Java Types, provided they can be
serialized in Hadoop
No support for Typing
Simple type

Details

Int, Long, Float,
Double

32 and 64 bits, signed

String

A string

byte[]

An array of … bytes

Boolean

A boolean

Complex type
Object


Details
Object must be « Hadoop serializable »

Style Summary
Style

Typing

Data Model

Metadata
store

Pig

Procedural

Static +
Dynamic

scalar +
tuple+ bag
(fully
recursive)

No
(HCatalog)

Hive

Declarative

Static +
Dynamic,
enforced at
execution
time

scalar+ list +
map

Integrated

Cascading

Procedural

Weak

scalar+ java
objects

No




Philosophy



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing, error management and environment



Integration
◦ Partitioning





Headachility
Motivation


Does debugging the
tool lead to bad
headaches ?


Headaches
Pig


Out Of Memory Error (Reducer)



Exception in Building /
Extended Functions
(handling of null)



Null vs “”



Nested Foreach and scoping



Date Management (pig 0.10)



Field implicit ordering


A Pig Error


Headaches
Hive


Out of Memory Errors in
Reducers



Few Debugging Options



Null / “”



No builtin “first”


Headaches
Cascading


Weak Typing Errors (comparing
Int and String … )



Illegal Operation Sequence
(Group after group …)



Field Implicit Ordering


Testing
Motivation



How to perform unit tests ?
How to have different versions of the same script
(parameter) ?


Testing
Pig





System Variables
Comment to test
No Meta Programming
pig –x local to execute on local files


Testing / Environment
Cascading



Junit Tests are possible
Ability to use code to actually comment out some
variables


Checkpointing
Motivation





Lots of iteration while developing on Hadoop
Sometime jobs fail
Sometimes need to restart from the start …

Parse Logs

Per Page Stats

Page User Correlation

FIX and
relaunch

Filtering

Output

Pig
Manual Checkpointing


STORE Command to manually
store files

Parse Logs

Per Page Stats


// COMMENT Beginning
of script and relaunch

Filtering

Output

Cascading
Automated Checkpointing


Ability to re-run a
flow automatically
from the last saved
checkpoint

addCheckpoint(…
)


Cascading
Topological Scheduler




Check each file intermediate timestamp
Execute only if more recent

Parse Logs

Per Page Stats


Filtering


Output

Productivity Summary
Headaches
Pig

Hive

Cascading

Checkpointing/Rep
lay

Testing /
Metaprogrammation

Lots

Manual Save

Difficult Meta
programming, easy local
testing

Few, but without None (That‟s SQL)
debugging
options
Weak Typing
Complexity


Checkpointing
Partial Updates

None (That‟s SQL)

Possible



Philosophy



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment



Integration
◦ Partitioning





Formats Integration
Motivation


Ability to integrate different file formats



Ability to integrate with external data sources or sink (
MongoDB, ElasticSearch, Database. …)

◦ Text Delimited
◦ Sequence File (Binary Hadoop format)
◦ Avro, Thrift ..

Format impact on size and performance
Format

Size on Disk (GB)

HIVE Processing time (24 cores)

Text File, uncompressed

18.7

1m32s

1 Text File, Gzipped

3.89

6m23s

JSON compressed

7.89

2m42s

multiple text file gzipped

4.02

43s

Sequence File, Block, Gzip

5.32

1m18s

Text File, LZO Indexed

7.03

1m22s


(no parallelization)

Format Integration





Hive: Serde (Serialize-Deserializer)
Pig : Storage
Cascading: Tap


Partitions
Motivation




No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition
Common partition schemas on Hadoop
◦
◦
◦
◦
◦

By Date /apache_logs/dt=2013-01-23
By Data center /apache_logs/dc=redbus01/…
By Country
…
Or any combination of the above


Hive Partitioning
Partitioned tables

CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0
…

INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27',
server_id=„s1‟)
SELECT * FROM event_tmp;

1/8/14

76

Cascading Partition
No Direct support for partition
 Support for “Glob” Tap, to build read from files using patterns




➔

You can code your own custom or virtual partition schemes


External Code Integration
Simple UDF
Pig

Hive

Cascadin
g


Hive Complex UDF
(Aggregators)


Cascading
Direct Code Evaluation

Uses Janino, a very cool project:
http://docs.codehaus.org/display/JANINO


Spring Batch
Cascading Integration




Allow to call a cascading flow from a Spring Batch

No full Integration with Spring MessageSource or
MessageHandler yet (only for local flows)


Integration
Summary

Partition/Increme External Code
ntal Updates
Pig

No Direct Support

Hive

Cascading


Fully integrated,
SQL Like

With Coding

Simple

Format
Integration
Doable and rich
community

Very simple, but
Doable and existing
complex dev setup
community

Complex UDFS
but regular, and
Java Expression
embeddable

Doable and
growing
commuinty

Optimization


Several Common Map Reduce Optimization Patterns
◦
◦
◦
◦
◦



Combiners
MapJoin
Job Fusion
Job Parallelism
Reducer Parallelism

Different support per framework
◦ Fully Automatic
◦ Pragma / Directives / Options
◦ Coding style / Code to write


Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
2012-02-14 4354

Map

…

2012-02-14 4354

2012-02-15 21we2

…

Reduc
e

2012-02-14 20

2012-02-15 21we2

2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2

2012-02-14 qa334
…
2012-02-15 23aq2


2012-02-16 1

Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
Map
2012-02-14 4354

2012-02-14 8

…

2012-02-15 12

Reduc
e

2012-02-14 20

2012-02-15 21we2

2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2

2012-02-14 12
2012-02-15 23
2012-02-16 1

Reduced network bandwith. Better
parallelism

2012-02-16 1

Join Optimization
Map Join
Hive
set hive.auto.convert.join =
true;
Pig

Cascadin
g

( no aggregation support after HashJoin)


Number of Reducers


Critical for performance



Estimated per the size of input file

◦ Hive
 divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
◦ Pig
 divide size pig.exec.reducers.bytes.per.reducer (default 1GB)


Performance Optimization
Summary

Combiner
Optimization

Pig
Cascading
Hive


Join
Optimization

Number of
reducers
optimization

Automatic

Option

Estimate or DIY

DIY

HashJoin

DIY

Partial
DIY

Automatic
(Map Join)

Estimate or DIY

Date • Titre de la présentation

CAS D’USAGE DU BIG DATA ET
MACHINE LEARNING

Qualité du search
•

ERWAN PIGNEUL

•

TEAM LEADER – RESPONSABLE DE PROJET

90

CONTEXTE PAGESJAUNES
CŒUR DE MÉTIER : LA RECHERCHE LOCALE DE PROFESSIONNELS

PAGESJAUNES UTILISE UN MOTEUR D'INTERPRÉTATION SPÉCIFIQUE
NÉCESSITANT UNE INDEXATION MANUELLE

CELA PERMET DE BIEN TRAITER LES REQUÊTES LES PLUS JOUÉES
MAIS CELA NE GÈRE PAS LA LONGUE TRAINE

COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES
VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?



20 M
1,4M



10
occurrences

requêtes

Analyse
corrections

200M
recherches

0,5M requêtes
priorisées

automatisation

SOLUTION
pagesjaunes.fr

crawl

hadoop
PIG+Hive

Moteur
d‟interprétation

Sickit-learn

indexation

Autres
Annuaire
référentiels

Export

ENSEIGNEMENTS TECHNIQUES
HADOOP / PIG / HIVE :
Efficace
Remet en question certaines logiques test/prod (apparition de pbs sur gros volumes)
Attention, ca reste jeune (compatibilité, …)

DATAIKU STUDIO :
Accélérateur de dev big data
Ordonnanceur des traitements en intégrant tous nos jobs et gère les dépendances
Easy Machine learning

ELASTICSEARCH :
Volume indexé et rapidité de search

EFFICACITÉ DE L’APPROCHE
Evolution de la fragilité de la requête ‘Parc enfant’

Fragile

Requête
‘Parc
enfant’
Moyenne
générale

Not fragile

Goal for Today
•

Quick Introduction To Clustering

•

How does it work in Practice

•

How does it work in Mahout

•

Overview of Mahout Algorithms

Clustering
Revenu
e

One Cluster
Centroid
== Center of
the cluster

c

Age

clustering applications
•

Fraud: Detect Outliers

•

CRM : Mine for customer segments

•

Image Processing : Similar Images

•

Search : Similar documents

•

Search : Allocate Topics

K-Means
Guess an initial placement for centroids

Assign each point to closest Center

Reposition Center

MAP

REDUCE

clustering challenges
•

Curse of Dimensionality

•

Choice of distance / number of parameters

•

Performance

•

Choice # of clusters

Mahout Clustering
Challenges
•

No Integrated Feature Engineering Stack:
Get ready to write data processing in Java

•

Hadoop SequenceFile required as an input

•

Iterations as Map/Reduce read and write to disks:
Relatively slow compared to in-memory
processing

Data Processing

Image

Voice

Log / DB

Data Processing

Vectorized
Data

Mahout K-Means on Text
Workflow
Text
Files
mahout
seqdirectory

Mahout Sequence Files
mahout
seq2parse

Tfidf Vectors
mahout
kmeans

Clusters

Mahout K-Means on
Database Extract Worflow
Database Dump (CSV)
org.apache.mahout.clustering.conve
rsion.InputDriver

Mahout Vectors
mahout
kmeans

Clusters

Convert a CSV File to
Mahout Vector
•

Real Code would have
•

Converting Categorical
variables to dimensions

•

Variable Rescaling

•

Dropping IDs (name,
forname …)

Mahout Algorithms
Parameters

Implicit Assumption

Ouput

K-Means

K (number of clusters)
Convergence

Circles

Point - ClusterId

Fuzzy K-Means

K (number of clusters)
Convergence

Circles

Point - ClusterId * , Probability

Expectation
Maximization

K (Number of clusterS)
Convergence

Gaussian distribution

Point - ClusterId*, Probability

Mean-Shift
Clustering

Distance boundaries,
Convergence

Gradient like distribution

Point - Cluster ID

Top Down
Clustering

Two Clustering Algorithns

Hierarchy

Point - Large ClusterId, Small
ClusterId

Dirichlet
Process

Model Distribution

Points are a mixture of
distribution

Point - ClusterId, Probability

Spectral
Clustering

-

-

Point - ClusterId

MinHash
Clustering

Number of hash / keys
Hash Type

High Dimension

Point - Hash*

Comparing Clustering
KMeans

MeanShif
t

Dirichlet

Fuzzy
KMeans

Canopy Optimization
T2

T2

Surely in
Cluster

T1

Pick a random point

Surely not in cluster

BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes

Similar to BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes (20)

More from Dataiku

More from Dataiku (13)

Recently uploaded

Recently uploaded (20)

BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes