0
BIG DATA
How do elephant
make babies

Florian Douetteau
CEO, Dataiku
Agenda
•

Big Data & Hadoop Overview

•

Practical Big Data Coding: Pig / Hive / Cascading

•

PagesJaunes Big Data Use Ca...
Motivation

3
Dataiku 1/8/14
Collocation

Dataiku

C
o
l
l
o
c
a
t

A familiar grouping of words,
especially words that habitually
appear together and ...
“Big” Data in 1999
struct Element {
Key key;
void* stat_data ;
}
….

C
Optimized Data structures
Perfect Hashing
HP-UNIX S...
Big Data in 2013







Hadoop
Java / Pig / Hive / Scala / Closure / …
A Dozen NoSQL data store
MPP Databases
Real-Ti...
Data Analytics: The Stakes
1 TB
1B $

1 TB
?$
1 TB
100M $

Web Search
1999
Logistics
2004
Dataiku

10 TB
10M $
100 TB
?$

...
Meet Hal Alowne

Hal Alowne
BI Manager
Dim‟s Private Showroom
European E-commerce Web site
• 100M$ Revenue
• 1 Million cus...
QUESTION #1
IS IT EASY OR NOT
?
SUBTLE
PATTERN
S
"MORE
BUSINESS"
BUTTONS
QUESTION #2
WHO TO HIRE
?
DATA SCIENTIST
AT NIGHT
DATA CLEANER
THE DAY
PARADOX #3
WHERE ?
MY DATA
IS WORTH
MILLIONS
I SEND IT
TO THE
MARKETING
CLOUD
QUERSTION #4
IS IT BIG OR NOT
WE ALL LIVE
IN A BIG DATA
LAKE
ALL MY DATA
PROBABLY FITS
IN HERE
QUESTION #5 (at last)
HUMAN OR NOT ?
MACHINE
LEARNING
WILL SAVE
US ALL
I JUST WANT
MORE
REPORTS
MERIT = TIME + ROI
TIME : 6 MONTHS

ROI : APPS
2014

2013

Find the right
people
(6 months?)

Choose the
technology
(6 mon...
Statistics and Machine Learning is complex
!
 Try to
understand
myself

25
Dataiku

1/9/14
(Some Book you might want to read)

26
Dataiku

1/9/14
CHOOSE TECHNOLOGY
NoSQL-Slavia

Hadoop

Elastic Search

Ceph

SOLR

Riak

Machine Learning
Mystery Land

Scalability Centr...
Large E-Retailer






Business Intelligence Stack as
Scalability and maintenance
issues
Backoffice implements
business...
Large E-Retailer : The
Datalab
•

•

•

Relieve their current DWH and
accelerate production of some
aggregates/KPIs
Be the...
Example (Social Gaming)
Social Gaming Communities


Correlation
◦ between community size and
engagement / virality



So...
How do I (pre)process data?
Implicit User Data
(Views, Searches…)

Online User
Information
Transformation
Predictor

500TB...
Always the same
Pour Data In

Compute Something
Smart About It

Make Available
The Questions
Pour Data In

How often ?
What kind of
interaction?
How much ?

Compute Something
Smart About It

How comple...
At the Beginning was the
elephant
MapReduce
How to count works in many many boxes

37
Dataiku - Innovation Services

1/8/14
ELEPHANT MAKE BABIES
After Hadoop
Random Access
In Memory
MultiCore
Machine Learning

Faster in Memory
Computation

Massive Batch
Map Reduce Ov...
MapReduce
Simplicity is a complexity

40
Dataiku - Innovation Services

1/8/14
Agenda









Dataiku - Pig, Hive and Cascading

Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How th...
Pig History



Yahoo Research in 2006
Inspired from Sawzall, a Google Paper from
2003
2007 as an Apache Project



Initi...
Hive History


Developed by Facebook in January 2007



Open source in August 2008



Initial Motivation

◦ Provide a S...
Cascading History


Authored by Chris Wensel 2008



Associated Projects

◦ Cascalog : Cascading in Closure
◦ Scalding :...
Agenda









Dataiku - Pig, Hive and Cascading

Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How th...
Pig  Hive

Mapping to Mapreduce jobs
events

= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, pr...
Pig  Hive

Mapping to Mapreduce jobs
= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int,...
Pig
How does it work
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)

Dataiku - Pig, Hi...
Hive Joins

How to join with MapReduce ?
Uid
tbl_idx

uid

1
2

1
1
2

Dupont

Type2

Type1

2

Type2

type

Tbl_idx

Name...
Agenda









Dataiku - Pig, Hive and Cascading

Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How th...
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headacha...
Procedural Vs Declarative


Transformation as a
sequence of operations

Users
= load 'users' as (name, age, ipaddr);
Clic...
Data type and Model
Rationale


All three Extend basic data model with extended
data types
◦ array-like [ event1, event2,...
Hive
Data Type and Schema
CREATE TABLE visit (
user_name
user_id
user_details
);

STRING,
INT,
STRUCTage:INT, zipcode:INT
...
Data types and Schema
Pig
rel = LOAD '/folder/path/'
USING PigStorage(„t‟)
AS (col:type, col:type, col:type);
Simple type
...
Data Type and Schema
Cascading




Support for Any Java Types, provided they can be
serialized in Hadoop
No support for ...
Style Summary
Style

Typing

Data Model

Metadata
store

Pig

Procedural

Static +
Dynamic

scalar +
tuple+ bag
(fully
rec...
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headacha...
Headachility
Motivation


Does debugging the
tool lead to bad
headaches ?

Dataiku - Pig, Hive and Cascading
Headaches
Pig


Out Of Memory Error (Reducer)



Exception in Building /
Extended Functions
(handling of null)



Null ...
A Pig Error

Dataiku - Pig, Hive and Cascading
Headaches
Hive


Out of Memory Errors in
Reducers



Few Debugging Options



Null / “”



No builtin “first”

Dataiku...
Headaches
Cascading


Weak Typing Errors (comparing
Int and String … )



Illegal Operation Sequence
(Group after group ...
Testing
Motivation



How to perform unit tests ?
How to have different versions of the same script
(parameter) ?

Datai...
Testing
Pig





System Variables
Comment to test
No Meta Programming
pig –x local to execute on local files

Dataiku ...
Testing / Environment
Cascading



Junit Tests are possible
Ability to use code to actually comment out some
variables

...
Checkpointing
Motivation





Lots of iteration while developing on Hadoop
Sometime jobs fail
Sometimes need to restart...
Pig
Manual Checkpointing


STORE Command to manually
store files

Parse Logs

Per Page Stats

Page User Correlation

// C...
Cascading
Automated Checkpointing


Ability to re-run a
flow automatically
from the last saved
checkpoint

addCheckpoint(...
Cascading
Topological Scheduler




Check each file intermediate timestamp
Execute only if more recent

Parse Logs

Per ...
Productivity Summary
Headaches
Pig

Hive

Cascading

Checkpointing/Rep
lay

Testing /
Metaprogrammation

Lots

Manual Save...
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headacha...
Formats Integration
Motivation


Ability to integrate different file formats



Ability to integrate with external data ...
Format Integration





Hive: Serde (Serialize-Deserializer)
Pig : Storage
Cascading: Tap

Dataiku - Pig, Hive and Casc...
Partitions
Motivation




No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition...
Hive Partitioning
Partitioned tables

CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day S...
Cascading Partition
No Direct support for partition
 Support for “Glob” Tap, to build read from files using patterns


...
External Code Integration
Simple UDF
Pig

Hive

Cascadin
g

Dataiku - Pig, Hive and Cascading
Hive Complex UDF
(Aggregators)

Dataiku - Pig, Hive and Cascading
Cascading
Direct Code Evaluation

Uses Janino, a very cool project:
http://docs.codehaus.org/display/JANINO

Dataiku - Pig...
Spring Batch
Cascading Integration




Allow to call a cascading flow from a Spring Batch

No full Integration with Spri...
Integration
Summary

Partition/Increme External Code
ntal Updates
Pig

No Direct Support

Hive

Cascading

Dataiku - Pig, ...
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headacha...
Optimization


Several Common Map Reduce Optimization Patterns
◦
◦
◦
◦
◦



Combiners
MapJoin
Job Fusion
Job Parallelism...
Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
2012-02-14 4354

Map

...
Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
Map
2012-02-14 4354

2...
Join Optimization
Map Join
Hive
set hive.auto.convert.join =
true;
Pig

Cascadin
g

( no aggregation support after HashJoi...
Number of Reducers


Critical for performance



Estimated per the size of input file

◦ Hive
 divide size per hive.exe...
Performance  Optimization
Summary

Combiner
Optimization

Pig
Cascading
Hive

Dataiku - Pig, Hive and Cascading

Join
Opti...
Date • Titre de la présentation

CAS D’USAGE DU BIG DATA ET
MACHINE LEARNING

Qualité du search
•

ERWAN PIGNEUL

•

TEAM ...
CONTEXTE PAGESJAUNES
CŒUR DE MÉTIER : LA RECHERCHE LOCALE DE PROFESSIONNELS

PAGESJAUNES UTILISE UN MOTEUR D'INTERPRÉTATIO...
COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES
VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?



20 M
1,4M



10
occurrenc...
SOLUTION
pagesjaunes.fr

crawl

hadoop
PIG+Hive

Moteur
d‟interprétation

Sickit-learn

indexation

Autres
Annuaire
référe...
ENSEIGNEMENTS TECHNIQUES
HADOOP / PIG / HIVE :
Efficace
Remet en question certaines logiques test/prod (apparition de pbs ...
EFFICACITÉ DE L’APPROCHE
Evolution de la fragilité de la requête ‘Parc enfant’

Fragile

Requête
‘Parc
enfant’
Moyenne
gén...
Mahout 102
Clustering
Goal for Today
•

Quick Introduction To Clustering

•

How does it work in Practice

•

How does it work in Mahout

•

Ove...
Clustering
Revenu
e

c

Age
Clustering
Revenu
e

One Cluster
Centroid
== Center of
the cluster

c

Age
clustering applications
•

Fraud: Detect Outliers

•

CRM : Mine for customer segments

•

Image Processing : Similar Imag...
K-Means
Guess an initial placement for centroids

Assign each point to closest Center

Reposition Center

MAP

REDUCE
clustering challenges
•

Curse of Dimensionality

•

Choice of distance / number of parameters

•

Performance

•

Choice ...
Mahout Clustering
Challenges
•

No Integrated Feature Engineering Stack:
Get ready to write data processing in Java

•

Ha...
Data Processing

Image

Voice

Log / DB

Data Processing

Vectorized
Data
Mahout K-Means on Text
Workflow
Text
Files
mahout
seqdirectory

Mahout Sequence Files
mahout
seq2parse

Tfidf Vectors
maho...
Mahout K-Means on
Database Extract Worflow
Database Dump (CSV)
org.apache.mahout.clustering.conve
rsion.InputDriver

Mahou...
Convert a CSV File to
Mahout Vector
•

Real Code would have
•

Converting Categorical
variables to dimensions

•

Variable...
Mahout Algorithms
Parameters

Implicit Assumption

Ouput

K-Means

K (number of clusters)
Convergence

Circles

Point - Cl...
Comparing Clustering
KMeans

MeanShif
t

Dirichlet

Fuzzy
KMeans
Canopy Optimization
T2

T2

Surely in
Cluster

T1

Pick a random point

Surely not in cluster
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
Upcoming SlideShare
Loading in...5
×

BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes

3,771

Published on

Presentation made at Rennes in January for the handsome BreizhJUG. This is a mixed presentation for big data technologies, which covers topics such as : Why Hadoop ? What next ? Machine Learning for big data in practice.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,771
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
40
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes"

  1. 1. BIG DATA How do elephant make babies Florian Douetteau CEO, Dataiku
  2. 2. Agenda • Big Data & Hadoop Overview • Practical Big Data Coding: Pig / Hive / Cascading • PagesJaunes Big Data Use Case • Machine Learning For Big Data
  3. 3. Motivation 3 Dataiku 1/8/14
  4. 4. Collocation Dataiku C o l l o c a t A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association. Big Apple Big Mama Big Data 4 1/8/14
  5. 5. “Big” Data in 1999 struct Element { Key key; void* stat_data ; } …. C Optimized Data structures Perfect Hashing HP-UNIX Servers – 4GB Ram 100 GB data Web Crawler – Socket reuse HTTP 0.9 Dataiku 1 Month 5 1/8/14
  6. 6. Big Data in 2013      Hadoop Java / Pig / Hive / Scala / Closure / … A Dozen NoSQL data store MPP Databases Real-Time 1 Hour 6 Dataiku 1/8/14
  7. 7. Data Analytics: The Stakes 1 TB 1B $ 1 TB ?$ 1 TB 100M $ Web Search 1999 Logistics 2004 Dataiku 10 TB 10M $ 100 TB ?$ Banking CRM 2008 50TB 1B$ 1000TB 500M $ E-Commerce 2013 Social Gaming 2011 Web Search 2010 Online Advertising 2012 1/8/14 7
  8. 8. Meet Hal Alowne Hal Alowne BI Manager Dim‟s Private Showroom European E-commerce Web site • 100M$ Revenue • 1 Million customer • 1 Data Analyst (Hal Himself) Dataiku - Data Tuesday ‟ Dim Sum CEO & Founder Dim‟s Private Showroom Hey Hal ! We need a big data platform like the big guys. Let‟s just do as they do! Big Data Copy Cat Project ” Big Guys • 10B$+ Revenue • 100M+ customers • 100+ Data Scientist 1/8/14 8
  9. 9. QUESTION #1 IS IT EASY OR NOT ?
  10. 10. SUBTLE PATTERN S
  11. 11. "MORE BUSINESS" BUTTONS
  12. 12. QUESTION #2 WHO TO HIRE ?
  13. 13. DATA SCIENTIST AT NIGHT
  14. 14. DATA CLEANER THE DAY
  15. 15. PARADOX #3 WHERE ?
  16. 16. MY DATA IS WORTH MILLIONS
  17. 17. I SEND IT TO THE MARKETING CLOUD
  18. 18. QUERSTION #4 IS IT BIG OR NOT
  19. 19. WE ALL LIVE IN A BIG DATA LAKE
  20. 20. ALL MY DATA PROBABLY FITS IN HERE
  21. 21. QUESTION #5 (at last) HUMAN OR NOT ?
  22. 22. MACHINE LEARNING WILL SAVE US ALL
  23. 23. I JUST WANT MORE REPORTS
  24. 24. MERIT = TIME + ROI TIME : 6 MONTHS ROI : APPS 2014 2013 Find the right people (6 months?) Choose the technology (6 months?) Make it work (6 months?) 2013 Build the lab (6 months) • Train People • Reuse working patterns Build a lab in 6 months (rather than 18 months) Dataiku Targeted Newsletter Recommender Systems Adapted Product / Promotions Deploy apps 24 that actually deliver value 1/9/14
  25. 25. Statistics and Machine Learning is complex ! Try to understand myself 25 Dataiku 1/9/14
  26. 26. (Some Book you might want to read) 26 Dataiku 1/9/14
  27. 27. CHOOSE TECHNOLOGY NoSQL-Slavia Hadoop Elastic Search Ceph SOLR Riak Machine Learning Mystery Land Scalability Central Cassandra MongoDB Membase Scikit-Learn GraphLAB prediction.io jubatus Mahout WEKA Sphere Kafka Flume Real-time island Spark Storm SQL Colunnar Republic MLBase RapidMiner Vertica Netezza QlickView Kibana SpotFire D3 Cascading Tableau Dataiku - Pig, Hive and Cascading SPSS Panda Pig Vizualization County R SAS InfiniDB Drill GreenPlum Impala LibSVM Talend Data Clean Wasteland Statistician Old House
  28. 28. Large E-Retailer    Business Intelligence Stack as Scalability and maintenance issues Backoffice implements business rules that are challenged Existing infrastructure cannot cope with per-user information Main Pain Point: 23 hours 52 minutes to compute Business Intelligence aggregates for one day. 29 Dataiku 1/9/14
  29. 29. Large E-Retailer : The Datalab • • • Relieve their current DWH and accelerate production of some aggregates/KPIs Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc., Train existing people around machine learning and segmentation experience 1h12 to perform the aggregate, available every morning New home page personalization deployed in a few weeks Hadoop Cluster (24 cores) Google Compute Engine Python + R + Vertica 12 TB dataset 6 weeks projects 30 Dataiku - Data Tuesday 1/9/14
  30. 30. Example (Social Gaming) Social Gaming Communities  Correlation ◦ between community size and engagement / virality  Some mid-size communities Meaningul patterns ◦ 2 players / Family / Group  What is the minimum number of friends to have in the application to get additional engagement ? A very large community Lots of small clusters mostly 2 players) 31 Dataiku 1/9/14
  31. 31. How do I (pre)process data? Implicit User Data (Views, Searches…) Online User Information Transformation Predictor 500TB Transformation Matrix Explicit User Data Predictor Runtime (Click, Buy, …) Per User Stats Rank Predictor 50TB Per Content Stats User Information (Location, Graph…) User Similarity 1TB Content Data (Title, Categories, Price, …) 200GB Content Similarity A/B Test Data Dataiku - Pig, Hive and Cascading
  32. 32. Always the same Pour Data In Compute Something Smart About It Make Available
  33. 33. The Questions Pour Data In How often ? What kind of interaction? How much ? Compute Something Smart About It How complex ? Do you need all data at once ? How incremental ? Make Available Interaction ? Random Access ?
  34. 34. At the Beginning was the elephant
  35. 35. MapReduce How to count works in many many boxes 37 Dataiku - Innovation Services 1/8/14
  36. 36. ELEPHANT MAKE BABIES
  37. 37. After Hadoop Random Access In Memory MultiCore Machine Learning Faster in Memory Computation Massive Batch Map Reduce Over HDFS Real-Time Distributed Computation Faster SQL Analytics Queries
  38. 38. MapReduce Simplicity is a complexity 40 Dataiku - Innovation Services 1/8/14
  39. 39. Agenda       Dataiku - Pig, Hive and Cascading Hadoop and Context (-0:03) Pig, Hive, Cascading, … (-0:09) How they work (-0:15) Comparing the tools (-0:35) Make them work together (-0:40) Wrap‟up and question (-Beer)
  40. 40. Pig History  Yahoo Research in 2006 Inspired from Sawzall, a Google Paper from 2003 2007 as an Apache Project  Initial motivation   ◦ Search Log Analytics: how long is the average user session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? … words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟) AS (word:chararray, count:int); sorted_words = ORDER words BY count DESC; first_words = LIMIT sorted_words 10; DUMP first_words; Dataiku - Pig, Hive and Cascading
  41. 41. Hive History  Developed by Facebook in January 2007  Open source in August 2008  Initial Motivation ◦ Provide a SQL like abstraction to perform statistics on status updates create external table wordcounts ( word string, count int ) row format delimited fields terminated by 't' location '/training/hadoop-wordcount/output'; select * from wordcounts order by count desc limit 10; select SUM(count) from wordcounts where word like „th%‟; Dataiku - Pig, Hive and Cascading
  42. 42. Cascading History  Authored by Chris Wensel 2008  Associated Projects ◦ Cascalog : Cascading in Closure ◦ Scalding : Cascading in Scala (Twitter in 2012) ◦ Lingual ( to be released soon): SQL layer on top of cascading Dataiku - Pig, Hive and Cascading
  43. 43. Agenda       Dataiku - Pig, Hive and Cascading Hadoop and Context (-0:03) Pig, Hive, Cascading, … (-0:09) How they work (-0:15) Comparing the tools (-0:35) Make them work together (-0:40) Wrap‟up and question (-Beer)
  44. 44. Pig Hive Mapping to Mapreduce jobs events = LOAD „/events‟ USING PigStorage(„t‟) AS (type:chararray, user:chararray, price:int, timestamp:int); events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price 1000; Job 1 : Mapper LOAD FILTER Job 1 : Reducer1 Shuffle and sort by user GROUP FOREACH FILTER * VAT excluded Dataiku - Innovation Services 1/8/14 46
  45. 45. Pig Hive Mapping to Mapreduce jobs = LOAD „/events‟ USING PigStorage(„t‟) AS (type:chararray, user:chararray, price:int, timestamp:int); events events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price 1000; recent_high = ORDER high_pbu BY max_ts DESC; STORE recent_high INTO „/output‟; Job 1: Mapper LOAD FILTER Job 1 :Reducer Shuffle and sort by user Job 2: Mapper LOAD (from tmp) GROUP FOREACH FILTER Job 2: Reducer Shuffle and sort by max_ts STORE 47 Dataiku - Innovation Services 1/8/14
  46. 46. Pig How does it work Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not) Dataiku - Pig, Hive and Cascading
  47. 47. Hive Joins How to join with MapReduce ? Uid tbl_idx uid 1 2 1 1 2 Dupont Type2 Type1 2 Type2 type Tbl_idx Name Type Uid 1 Type Durand Type1 Durand Type2 2 Name 2 Type1 2 2 Type1 Reducer 1 2 2 Dupont 1 2 Durand Uid 2 Type Dupont Shuffle by uid Sort by (uid, tbl_idx) uid Name 1 1 Dupont 1 tbl_idx Type Uid 1 1 Name name 1 1 Tbl_idx Type1 Type1 Mappers output Reducer 2 49 Dataiku - Innovation Services 1/8/14
  48. 48. Agenda       Dataiku - Pig, Hive and Cascading Hadoop and Context (-0:03) Pig, Hive, Cascading, … (-0:09) How they work (-0:15) Comparing the tools (-0:35) Make them work together (-0:40) Wrap‟up and question (-Beer)
  49. 49. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  50. 50. Procedural Vs Declarative  Transformation as a sequence of operations Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';  Transformation as a set of formulas insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value 0; ) using ipaddr group by dma; Dataiku - Pig, Hive and Cascading
  51. 51. Data type and Model Rationale  All three Extend basic data model with extended data types ◦ array-like [ event1, event2, event3] ◦ map-like { type1:value1, type2:value2, …}  Different approach ◦ Resilient Schema ◦ Static Typing ◦ No Static Typing Dataiku - Pig, Hive and Cascading
  52. 52. Hive Data Type and Schema CREATE TABLE visit ( user_name user_id user_details ); STRING, INT, STRUCTage:INT, zipcode:INT Simple type Details TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes FLOAT, DOUBLE 4 and 8 bytes BOOLEAN STRING Arbitrary-length, replaces VARCHAR TIMESTAMP Complex type Details ARRAY Array of typed items (0-indexed) MAP Associative map STRUCT Complex class-like objects 54 Dataiku Training – Hadoop for Data Science 1/8/14
  53. 53. Data types and Schema Pig rel = LOAD '/folder/path/' USING PigStorage(„t‟) AS (col:type, col:type, col:type); Simple type Details int, long, float, double 32 and 64 bits, signed chararray A string bytearray An array of … bytes boolean A boolean Complex type Details tuple a tuple is an ordered fieldname:value map bag a bag is a set of tuples 55 Dataiku Training – Hadoop for Data Science 1/8/14
  54. 54. Data Type and Schema Cascading   Support for Any Java Types, provided they can be serialized in Hadoop No support for Typing Simple type Details Int, Long, Float, Double 32 and 64 bits, signed String A string byte[] An array of … bytes Boolean A boolean Complex type Object Dataiku - Pig, Hive and Cascading Details Object must be « Hadoop serializable »
  55. 55. Style Summary Style Typing Data Model Metadata store Pig Procedural Static + Dynamic scalar + tuple+ bag (fully recursive) No (HCatalog) Hive Declarative Static + Dynamic, enforced at execution time scalar+ list + map Integrated Cascading Procedural Weak scalar+ java objects No Dataiku - Pig, Hive and Cascading
  56. 56. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing, error management and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  57. 57. Headachility Motivation  Does debugging the tool lead to bad headaches ? Dataiku - Pig, Hive and Cascading
  58. 58. Headaches Pig  Out Of Memory Error (Reducer)  Exception in Building / Extended Functions (handling of null)  Null vs “”  Nested Foreach and scoping  Date Management (pig 0.10)  Field implicit ordering Dataiku - Pig, Hive and Cascading
  59. 59. A Pig Error Dataiku - Pig, Hive and Cascading
  60. 60. Headaches Hive  Out of Memory Errors in Reducers  Few Debugging Options  Null / “”  No builtin “first” Dataiku - Pig, Hive and Cascading
  61. 61. Headaches Cascading  Weak Typing Errors (comparing Int and String … )  Illegal Operation Sequence (Group after group …)  Field Implicit Ordering Dataiku - Pig, Hive and Cascading
  62. 62. Testing Motivation   How to perform unit tests ? How to have different versions of the same script (parameter) ? Dataiku - Pig, Hive and Cascading
  63. 63. Testing Pig     System Variables Comment to test No Meta Programming pig –x local to execute on local files Dataiku - Pig, Hive and Cascading
  64. 64. Testing / Environment Cascading   Junit Tests are possible Ability to use code to actually comment out some variables Dataiku - Pig, Hive and Cascading
  65. 65. Checkpointing Motivation    Lots of iteration while developing on Hadoop Sometime jobs fail Sometimes need to restart from the start … Parse Logs Per Page Stats Page User Correlation FIX and relaunch Dataiku - Pig, Hive and Cascading Filtering Output
  66. 66. Pig Manual Checkpointing  STORE Command to manually store files Parse Logs Per Page Stats Page User Correlation // COMMENT Beginning of script and relaunch Dataiku - Pig, Hive and Cascading Filtering Output
  67. 67. Cascading Automated Checkpointing  Ability to re-run a flow automatically from the last saved checkpoint addCheckpoint(… ) Dataiku - Pig, Hive and Cascading
  68. 68. Cascading Topological Scheduler   Check each file intermediate timestamp Execute only if more recent Parse Logs Per Page Stats Page User Correlation Filtering Dataiku - Pig, Hive and Cascading Output
  69. 69. Productivity Summary Headaches Pig Hive Cascading Checkpointing/Rep lay Testing / Metaprogrammation Lots Manual Save Difficult Meta programming, easy local testing Few, but without None (That‟s SQL) debugging options Weak Typing Complexity Dataiku - Pig, Hive and Cascading Checkpointing Partial Updates None (That‟s SQL) Possible
  70. 70. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  71. 71. Formats Integration Motivation  Ability to integrate different file formats  Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …) ◦ Text Delimited ◦ Sequence File (Binary Hadoop format) ◦ Avro, Thrift .. Format impact on size and performance Format Size on Disk (GB) HIVE Processing time (24 cores) Text File, uncompressed 18.7 1m32s 1 Text File, Gzipped 3.89 6m23s JSON compressed 7.89 2m42s multiple text file gzipped 4.02 43s Sequence File, Block, Gzip 5.32 1m18s Text File, LZO Indexed 7.03 1m22s Dataiku - Pig, Hive and Cascading (no parallelization)
  72. 72. Format Integration    Hive: Serde (Serialize-Deserializer) Pig : Storage Cascading: Tap Dataiku - Pig, Hive and Cascading
  73. 73. Partitions Motivation   No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition Common partition schemas on Hadoop ◦ ◦ ◦ ◦ ◦ By Date /apache_logs/dt=2013-01-23 By Data center /apache_logs/dc=redbus01/… By Country … Or any combination of the above Dataiku - Pig, Hive and Cascading
  74. 74. Hive Partitioning Partitioned tables CREATE TABLE event ( user_id INT, type STRING, message STRING) PARTITIONED BY (day STRING, server_id STRING); Disk structure /hive/event/day=2013-01-27/server_id=s1/file0 /hive/event/day=2013-01-27/server_id=s1/file1 /hive/event/day=2013-01-27/server_id=s2/file0 /hive/event/day=2013-01-27/server_id=s2/file1 … /hive/event/day=2013-01-28/server_id=s2/file0 /hive/event/day=2013-01-28/server_id=s2/file1 INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27', server_id=„s1‟) SELECT * FROM event_tmp; Dataiku Training – Hadoop for Data Science 1/8/14 76
  75. 75. Cascading Partition No Direct support for partition  Support for “Glob” Tap, to build read from files using patterns   ➔ You can code your own custom or virtual partition schemes Dataiku - Pig, Hive and Cascading
  76. 76. External Code Integration Simple UDF Pig Hive Cascadin g Dataiku - Pig, Hive and Cascading
  77. 77. Hive Complex UDF (Aggregators) Dataiku - Pig, Hive and Cascading
  78. 78. Cascading Direct Code Evaluation Uses Janino, a very cool project: http://docs.codehaus.org/display/JANINO Dataiku - Pig, Hive and Cascading
  79. 79. Spring Batch Cascading Integration   Allow to call a cascading flow from a Spring Batch No full Integration with Spring MessageSource or MessageHandler yet (only for local flows) Dataiku - Pig, Hive and Cascading
  80. 80. Integration Summary Partition/Increme External Code ntal Updates Pig No Direct Support Hive Cascading Dataiku - Pig, Hive and Cascading Fully integrated, SQL Like With Coding Simple Format Integration Doable and rich community Very simple, but Doable and existing complex dev setup community Complex UDFS but regular, and Java Expression embeddable Doable and growing commuinty
  81. 81. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  82. 82. Optimization  Several Common Map Reduce Optimization Patterns ◦ ◦ ◦ ◦ ◦  Combiners MapJoin Job Fusion Job Parallelism Reducer Parallelism Different support per framework ◦ Fully Automatic ◦ Pragma / Directives / Options ◦ Coding style / Code to write Dataiku - Pig, Hive and Cascading
  83. 83. Combiner Perform Partial Aggregate at Mapper Stage SELECT date, COUNT(*) FROM product GROUP BY date 2012-02-14 4354 Map … 2012-02-14 4354 2012-02-15 21we2 … Reduc e 2012-02-14 20 2012-02-15 21we2 2012-02-15 35 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 qa334 … 2012-02-15 23aq2 Dataiku - Pig, Hive and Cascading 2012-02-16 1
  84. 84. Combiner Perform Partial Aggregate at Mapper Stage SELECT date, COUNT(*) FROM product GROUP BY date Map 2012-02-14 4354 2012-02-14 8 … 2012-02-15 12 Reduc e 2012-02-14 20 2012-02-15 21we2 2012-02-15 35 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 12 2012-02-15 23 2012-02-16 1 Reduced network bandwith. Better parallelism Dataiku - Pig, Hive and Cascading 2012-02-16 1
  85. 85. Join Optimization Map Join Hive set hive.auto.convert.join = true; Pig Cascadin g ( no aggregation support after HashJoin) Dataiku - Pig, Hive and Cascading
  86. 86. Number of Reducers  Critical for performance  Estimated per the size of input file ◦ Hive  divide size per hive.exec.reducers.bytes.per.reducer (default 1GB) ◦ Pig  divide size pig.exec.reducers.bytes.per.reducer (default 1GB) Dataiku - Pig, Hive and Cascading
  87. 87. Performance Optimization Summary Combiner Optimization Pig Cascading Hive Dataiku - Pig, Hive and Cascading Join Optimization Number of reducers optimization Automatic Option Estimate or DIY DIY HashJoin DIY Partial DIY Automatic (Map Join) Estimate or DIY
  88. 88. Date • Titre de la présentation CAS D’USAGE DU BIG DATA ET MACHINE LEARNING Qualité du search • ERWAN PIGNEUL • TEAM LEADER – RESPONSABLE DE PROJET 90
  89. 89. CONTEXTE PAGESJAUNES CŒUR DE MÉTIER : LA RECHERCHE LOCALE DE PROFESSIONNELS PAGESJAUNES UTILISE UN MOTEUR D'INTERPRÉTATION SPÉCIFIQUE NÉCESSITANT UNE INDEXATION MANUELLE CELA PERMET DE BIEN TRAITER LES REQUÊTES LES PLUS JOUÉES MAIS CELA NE GÈRE PAS LA LONGUE TRAINE
  90. 90. COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?  20 M 1,4M  10 occurrences requêtes Analyse corrections 200M recherches 0,5M requêtes priorisées automatisation
  91. 91. SOLUTION pagesjaunes.fr crawl hadoop PIG+Hive Moteur d‟interprétation Sickit-learn indexation Autres Annuaire référentiels Export
  92. 92. ENSEIGNEMENTS TECHNIQUES HADOOP / PIG / HIVE : Efficace Remet en question certaines logiques test/prod (apparition de pbs sur gros volumes) Attention, ca reste jeune (compatibilité, …) DATAIKU STUDIO : Accélérateur de dev big data Ordonnanceur des traitements en intégrant tous nos jobs et gère les dépendances Easy Machine learning ELASTICSEARCH : Volume indexé et rapidité de search
  93. 93. EFFICACITÉ DE L’APPROCHE Evolution de la fragilité de la requête ‘Parc enfant’ Fragile Requête ‘Parc enfant’ Moyenne générale Not fragile
  94. 94. Mahout 102 Clustering
  95. 95. Goal for Today • Quick Introduction To Clustering • How does it work in Practice • How does it work in Mahout • Overview of Mahout Algorithms
  96. 96. Clustering Revenu e c Age
  97. 97. Clustering Revenu e One Cluster Centroid == Center of the cluster c Age
  98. 98. clustering applications • Fraud: Detect Outliers • CRM : Mine for customer segments • Image Processing : Similar Images • Search : Similar documents • Search : Allocate Topics
  99. 99. K-Means Guess an initial placement for centroids Assign each point to closest Center Reposition Center MAP REDUCE
  100. 100. clustering challenges • Curse of Dimensionality • Choice of distance / number of parameters • Performance • Choice # of clusters
  101. 101. Mahout Clustering Challenges • No Integrated Feature Engineering Stack: Get ready to write data processing in Java • Hadoop SequenceFile required as an input • Iterations as Map/Reduce read and write to disks: Relatively slow compared to in-memory processing
  102. 102. Data Processing Image Voice Log / DB Data Processing Vectorized Data
  103. 103. Mahout K-Means on Text Workflow Text Files mahout seqdirectory Mahout Sequence Files mahout seq2parse Tfidf Vectors mahout kmeans Clusters
  104. 104. Mahout K-Means on Database Extract Worflow Database Dump (CSV) org.apache.mahout.clustering.conve rsion.InputDriver Mahout Vectors mahout kmeans Clusters
  105. 105. Convert a CSV File to Mahout Vector • Real Code would have • Converting Categorical variables to dimensions • Variable Rescaling • Dropping IDs (name, forname …)
  106. 106. Mahout Algorithms Parameters Implicit Assumption Ouput K-Means K (number of clusters) Convergence Circles Point - ClusterId Fuzzy K-Means K (number of clusters) Convergence Circles Point - ClusterId * , Probability Expectation Maximization K (Number of clusterS) Convergence Gaussian distribution Point - ClusterId*, Probability Mean-Shift Clustering Distance boundaries, Convergence Gradient like distribution Point - Cluster ID Top Down Clustering Two Clustering Algorithns Hierarchy Point - Large ClusterId, Small ClusterId Dirichlet Process Model Distribution Points are a mixture of distribution Point - ClusterId, Probability Spectral Clustering - - Point - ClusterId MinHash Clustering Number of hash / keys Hash Type High Dimension Point - Hash*
  107. 107. Comparing Clustering KMeans MeanShif t Dirichlet Fuzzy KMeans
  108. 108. Canopy Optimization T2 T2 Surely in Cluster T1 Pick a random point Surely not in cluster
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×