SlideShare a Scribd company logo
1 of 66
Download to read offline
Hadoop Is A Batch
Pig, Hive, Cascading …
 Paris Jug May 2013
 Florian Douetteau
Florian Douetteau <florian.douetteau@dataiku.com>
 CEO at Dataiku
 Freelance at Criteo (Online Ads)
 CTO at IsCool Ent. (#1 French Social Gamer)
 VP R&D Exalead (Search Engine Technology)
About me
15/05/2013Dataiku Training – Hadoop for Data Science 2
 Hadoop and Context (->0:03)
 Pig, Hive, Cascading, … (->0:09)
 How they work (->0:15)
 Comparing the tools (->0:35)
 Make them work together (->0:40)
 Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
CHOOSE TECHNOLOGY
Dataiku - Pig, Hive and Cascading
Hadoop
Ceph
Sphere
Cassandra
Spark
Scikit-Learn
Mahout
WEKA
MLBase LibSVM
SAS
RapidMiner
SPSS
Panda
QlickView
Tableau
SpotFire
HTML5/D3
InfiniDB
Vertica
GreenPlum
Impala
Netezza
Elastic Search
SOLR
MongoDB
Riak
Membase
Pig
Cascading
Talend
Machine Learning
Mystery Land
Scalability CentralNoSQL-Slavia
SQL Colunnar Republic
Vizualization County
Data Clean Wasteland
Statistician Old
House
R
How do I (pre)process data?
Implicit User Data
(Views, Searches…)
Content Data
(Title, Categories, Price, …)
Dataiku - Pig, Hive and Cascading
Explicit User Data
(Click, Buy, …)
User Information
(Location, Graph…)
500TB
50TB
1TB
200GB
Transformation
Matrix
Transformation
Predictor
Per User Stats
Per Content Stats
User Similarity
Rank Predictor
Content Similarity
A/B Test Data
Predictor Runtime
Online User Information
 Analyse Raw Logs
(Trackers, Web Logs)
 Extract IP, Page, …
 Detect and remove
robots
 Build Statistics
◦ Number of page view, per
produt
◦ Best Referers
◦ Traffic Analysis
◦ Funnel
◦ SEO Analysis
◦ …
Dataiku - Pig, Hive and Cascading
Typical Use Case 1
Web Analytics Processing
 Extract Query Logs
 Perform query
normalization
 Compute Ngrams
 Compute Search
“Sessions”
 Compute Log-
Likehood Ratio for
ngrams across
sesions
Dataiku - Pig, Hive and Cascading
Typical Use Case 2
Mining Search Logs for Synonyms
 Compute User –
Product Association
Matrix
 Compute different
similarities ratio
(Ochiai, Cosine, …)
 Filter out bad
predictions
 For each user, select
best recommendable
products
Dataiku - Pig, Hive and Cascading
Typical Use Case 3
Product Recommender
 Hadoop and Context (->0:03)
 Pig, Hive, Cascading, … (->0:09)
 How they work (->0:15)
 Comparing the tools (->0:35)
 Make them work together (->0:40)
 Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
 Yahoo Research in 2006
 Inspired from Sawzall, a Google Paper
from 2003
 2007 as an Apache Project
 Initial motivation
◦ Search Log Analytics: how long is the
average user session ? how many links does
a user click ? on before leaving a website ?
how do click patterns vary in the course of a
day/week/month ? …
Pig History
Dataiku - Pig, Hive and Cascading
words = LOAD '/training/hadoop-wordcount/output‘
USING PigStorage(‘t’)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;
DUMP first_words;
 Developed by Facebook in January 2007
 Open source in August 2008
 Initial Motivation
◦ Provide a SQL like abstraction to perform
statistics on status updates
Hive History
Dataiku - Pig, Hive and Cascading
create external table wordcounts (
word string,
count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit
10;
select SUM(count) from wordcounts where word like
‘th%’;
 Authored by Chris Wensel 2008
 Associated Projects
◦ Cascalog : Cascading in Closure
◦ Scalding : Cascading in Scala (Twitter
in 2012)
◦ Lingual ( to be released soon): SQL
layer on top of cascading
Cascading History
Dataiku - Pig, Hive and Cascading
 Hadoop and Context (->0:03)
 Pig, Hive, Cascading, … (->0:09)
 How they work (->0:15)
 Comparing the tools (->0:35)
 Make them work together (->0:40)
 Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
MapReduce
Simplicity is a complexity
5/15/2013Dataiku - Innovation Services 14
Pig & Hive
Mapping to Mapreduce jobs
5/15/2013Dataiku - Innovation Services 15
* VAT excluded
events = LOAD ‘/events’ USING PigStorage(‘t’) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;
Job 1 : Mapper Job 1 : Reducer1
LOAD FILTER GROUP FOREACH FILTER
Shuffle and
sort by user
Pig & Hive
Mapping to Mapreduce jobs
5/15/2013Dataiku - Innovation Services 16
events = LOAD ‘/events’ USING PigStorage(‘t’) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;
recent_high = ORDER high_pbu BY max_ts DESC;
STORE recent_high INTO ‘/output’;
Job 1: Mapper Job 1 :Reducer
LOAD FILTER GROUP FOREACH FILTER
Shuffle and
sort by user
Job 2: Mapper Job 2: Reducer
LOAD
(from tmp)
STOREShuffle and
sort by max_ts
Pig
How does it work
Dataiku - Pig, Hive and Cascading
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)
84 TResolution = LOAD '$PREFIX/dwh_dim_external_tracking_resolution/dt=$DAY' USING PigStorage('u0001');
85 TResolution = FOREACH TResolution GENERATE $0 AS SKResolutionId, $1 as ResolutionId;
86
87
88 TSiteMap = LOAD '$PREFIX/dwh_dim_sitemapnode/dt=$DAY' USING PigStorage('u0001');
89 TSiteMap = FOREACH TSiteMap GENERATE $0 AS SKSimteMapNodeId, $2 as SiteMapNodeId;
90
91
92 TCustomer = LOAD '$PREFIX/customer_relation/dt=$DAY' USING PigStorage('u0001')
93 as (SKCustomerId:chararray,
94 CustomerId:chararray);
95
96 F1 = FOREACH F1 GENERATE *, (date_time IS NOT NULL ? CustomFormatToISO(date_time, 'yyyy-MM-dd HH:mm:ss'
97
98 F2 = FOREACH F1 GENERATE *,
99 CONCAT(CONCAT(CONCAT(CONCAT(visid_high,'-'), visid_low), '-'), visit_num) as VisitId,
100 (referrer matches '.*cdiscount.com.*' OR referrer matches 'cdscdn.com' ? NULL :referrer ) as Referrer,
101 (iso IS NOT NULL ? ISODaysBetween(iso, '1899-12-31T00:00:00') : NULL)
102 AS SkDateId,
103 (iso IS NOT NULL ? ISOSecondsBetween(iso, ISOToDay(iso)) : NULL)
104 AS SkTimeId,
105 ((event_list is not null and event_list matches '.*b202b.*') ? 'Y' : 'N') as is_202,
106 ((event_list is not null and event_list matches '.*b10b.*') ? 'Y' : 'N') as is_10,
107 ((event_list is not null and event_list matches '.*b12b.*') ? 'Y' : 'N') as is_12,
108 ((event_list is not null and event_list matches '.*b13b.*') ? 'Y' : 'N') as is_13,
109 ((event_list is not null and event_list matches '.*b14b.*') ? 'Y' : 'N') as is_14,
110 ((event_list is not null and event_list matches '.*b11b.*') ? 'Y' : 'N') as is_11,
111 ((event_list is not null and event_list matches '.*b1b.*') ? 'Y' : 'N') as is_1,
112 REGEX_EXTRACT(pagename, 'F-(.*):.*', 1) AS ProductReferenceId,
113 NULL AS OriginFile;
114
115 SET DEFAULT_PARALLEL 24;
116
117 F3 = JOIN F2 BY post_search_engine LEFT, TSearchEngine BY SearchEngineId USING 'replicated' PARALLEL 20 ;
118 F3 = FOREACH F3 GENERATE *, (SKSearchEngineId IS NULL ? '-1' : SKSearchEngineId) as SKSearchEngineId;
119 --F3 = FOREACH F2 GENERATE *, NULL AS SKSearchEngineId, NULL AS SearchEngineId;
120
121 F4 = JOIN F3 BY browser LEFT, TBrowser BY BrowserId USING 'replicated' PARALLEL 20;
122 F4 = FOREACH F4 GENERATE *, (SKBrowserId IS NULL ? '-1' : SKBrowserId) as SKBrowserId;
123
124 --F4 = FOREACH F3 GENERATE *, NULL AS SKBrowserId, NULL AS BrowserId;
125
126
127 F5 = JOIN F4 BY os LEFT, TOperatingSystem BY OperatingSystemId USING 'replicated' PARALLEL 20;
128 F5 = FOREACH F5 GENERATE *, (SKOperatingSystemId IS NULL ? '-1' : SKOperatingSystemId) as SKOperatingSystemId;
129
130 --F5 = FOREACH F4 GENERATE *, NULL AS SKOperatingSystemId, NULL AS OperatingSystemId;
131
132
133 F6 = JOIN F5 BY resolution LEFT, TResolution BY ResolutionId USING 'replicated' PARALLEL 20;
134 F6 = FOREACH F6 GENERATE *, (SKResolutionId IS NULL ? '-1' : SKResolutionId) as SKResolutionId;
135
136 --F6 = FOREACH F5 GENERATE *, NULL AS SKResolutionId, NULL AS ResolutionId;
137
138 F7 = JOIN F6 BY post_evar4 LEFT, TSiteMap BY SiteMapNodeId USING 'replicated' PARALLEL 20;
139 F7 = FOREACH F7 GENERATE *, (SKSimteMapNodeId IS NULL ? '-1' : SKSimteMapNodeId) as SKSimteMapNodeId;
140
141 --F7 = FOREACH F6 GENERATE *, NULL AS SKSimteMapNodeId, NULL AS SiteMapNodeId;
142
143
144 SPLIT F7 INTO WITHOUT_CUSTOMER IF post_evar30 IS NULL, WITH_CUSTOMER IF post_evar30 IS NOT NULL;
145
146 F8 = JOIN WITH_CUSTOMER BY post_evar30 LEFT, TCustomer BY CustomerId USING 'skewed' PARALLEL 20;
147 WITHOUT_CUSTOMER = FOREACH WITHOUT_CUSTOMER GENERATE *, NULL as SKCustomerId, NULL as CustomerId;
148
149 --F8_UNION = FOREACH F7 GENERATE *, NULL as SKCustomerId, NULL as CustomerId;
150 F8_UNION = UNION F8, WITHOUT_CUSTOMER;
151 --DESCRIBE F8;
152 --DESCRIBE WITHOUT_CUSTOMER;
153 --DESCRIBE F8_UNION;
154
155 F9 = FOREACH F8_UNION GENERATE
156 visid_high,
157 visid_low,
158 VisitId,
159 post_evar30,
160 SKCustomerId,
161 visit_num,
162 SkDateId,
163 SkTimeId,
164 post_evar16,
165 post_evar52,
166 visit_page_num,
167 is_202,
168 is_10,
169 is_12,
Reducer 2Mappers output
Reducer 1
Hive Joins
How to join with MapReduce ?
15/05/2013Dataiku - Innovation Services 19
tbl_idx uid name
1 1 Dupont
1 2 Durand
tbl_idx uid type
2 1 Type1
2 1 Type2
2 2 Type1
Shuffle by uid
Sort by (uid, tbl_idx)
Uid Tbl_idx Name Type
1 1 Dupont
1 2 Type1
1 2 Type2
Uid Tbl_idx Name Type
2 1 Durand
2 2 Type1
Uid Name Type
1 Dupont Type1
1 Dupont Type2
Uid Name Type
2 Durand Type1
 Hadoop and Context (->0:03)
 Pig, Hive, Cascading, … (->0:09)
 How they work (->0:15)
 Comparing the tools (->0:35)
 Make them work together (->0:40)
 Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
 Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
 Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
 Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
 Performance and optimization
Comparing without Comparable
Dataiku - Pig, Hive and Cascading
 Transformation as a
sequence of
operations
 Transformation as a
set of formulas
Dataiku - Pig, Hive and Cascading
Procedural Vs Declarative
insert into ValuableClicksPerDMA select
dma, count(*)
from geoinfo join (
select name, ipaddr from
users join clicks on (users.name =
clicks.user)
where value > 0;
) using ipaddr
group by dma;
Users = load 'users' as (name, age, ipaddr);
Clicks = load 'clicks' as (user, url, value);
ValuableClicks = filter Clicks by value > 0;
UserClicks = join Users by name, ValuableClicks by
user;
Geoinfo = load 'geoinfo' as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr, Geoinfo by
ipaddr;
ByDMA = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group,
COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
 All three Extend basic data model with extended
data types
◦ array-like [ event1, event2, event3]
◦ map-like { type1:value1, type2:value2, …}
 Different approach
◦ Resilient Schema
◦ Static Typing
◦ No Static Typing
Data type and Model
Rationale
Dataiku - Pig, Hive and Cascading
Hive
Data Type and Schema
5/15/2013 24
Simple type Details
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes
FLOAT, DOUBLE 4 and 8 bytes
BOOLEAN
STRING Arbitrary-length, replaces VARCHAR
TIMESTAMP
Complex type Details
ARRAY Array of typed items (0-indexed)
MAP Associative map
STRUCT Complex class-like objects
Dataiku Training – Hadoop for Data Science
CREATE TABLE visit (
user_name STRING,
user_id INT,
user_details STRUCT<age:INT, zipcode:INT>
);
rel = LOAD '/folder/path/'
USING PigStorage(‘t’)
AS (col:type, col:type, col:type);
Data types and Schema
Pig
5/15/2013 25
Simple type Details
int, long, float,
double
32 and 64 bits, signed
chararray A string
bytearray An array of … bytes
boolean A boolean
Complex type Details
tuple a tuple is an ordered fieldname:value map
bag a bag is a set of tuples
Dataiku Training – Hadoop for Data Science
 Support for Any Java Types, provided they can be
serialized in Hadoop
 No support for Typing
Data Type and Schema
Cascading
Dataiku - Pig, Hive and Cascading
Simple type Details
Int, Long, Float,
Double
32 and 64 bits, signed
String A string
byte[] An array of … bytes
Boolean A boolean
Complex type Details
Object Object must be « Hadoop serializable »
Style Summary
Dataiku - Pig, Hive and Cascading
Style Typing Data Model Metadata
store
Pig Procedural Static +
Dynamic
scalar +
tuple+ bag
(fully
recursive)
No
(HCatalog)
Hive Declarative Static +
Dynamic,
enforced at
execution
time
scalar+ list
+ map
Integrated
Cascading Procedural Weak scalar+ java
objects
No
 Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
 Productivity
◦ Headachability
◦ Checkpointing
◦ Testing, error management and environment
 Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
 Performance and optimization
Comparing without Comparable
Dataiku - Pig, Hive and Cascading
 Does debugging
the tool lead to bad
headaches ?
Dataiku - Pig, Hive and Cascading
Headachility
Motivation
 Out Of Memory Error (Reducer)
 Exception in Building /
Extended Functions
(handling of null)
 Null vs “”
 Nested Foreach and scoping
 Date Management (pig 0.10)
 Field implicit ordering
Dataiku - Pig, Hive and Cascading
Headaches
Pig
A Pig Error
Dataiku - Pig, Hive and Cascading
 Out of Memory Errors in
Reducers
 Few Debugging Options
 Null / “”
 No builtin “first”
Dataiku - Pig, Hive and Cascading
Headaches
Hive
 Weak Typing Errors (comparing
Int and String … )
 Illegal Operation Sequence
(Group after group …)
 Field Implicit Ordering
Dataiku - Pig, Hive and Cascading
Headaches
Cascading
 How to perform unit tests ?
 How to have different versions of the same script
(parameter) ?
Testing
Motivation
Dataiku - Pig, Hive and Cascading
 System Variables
 Comment to test
 No Meta Programming
 pig –x local to execute on local files
Testing
Pig
Dataiku - Pig, Hive and Cascading
 Junit Tests are possible
 Ability to use code to actually comment out some
variables
Testing / Environment
Cascading
Dataiku - Pig, Hive and Cascading
 Lots of iteration while developing on Hadoop
 Sometime jobs fail
 Sometimes need to restart from the start …
Checkpointing
Motivation
Dataiku - Pig, Hive and Cascading
Page User Correlation OutputFilteringParse Logs Per Page Stats
FIX and relaunch
 STORE Command to manually
store files
Pig
Manual Checkpointing
Dataiku - Pig, Hive and Cascading
Page User Correlation OutputFilteringParse Logs Per Page Stats
// COMMENT Beginning
of script and relaunch
 Ability to re-run a
flow automatically
from the last saved
checkpoint
Dataiku - Pig, Hive and Cascading
Cascading
Automated Checkpointing
addCheckpoint(…)
 Check each file intermediate timestamp
 Execute only if more recent
Dataiku - Pig, Hive and Cascading
Cascading
Topological Scheduler
Page User Correlation OutputFilteringParse Logs Per Page Stats
Productivity Summary
Dataiku - Pig, Hive and Cascading
Headaches Checkpointing/Re
play
Testing /
Metaprogrammation
Pig Lots Manual Save Difficult Meta
programming, easy
local testing
Hive Few, but
without
debugging
options
None (That’s SQL) None (That’s SQL)
Cascading Weak Typing
Complexity
Checkpointing
Partial Updates
Possible
 Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
 Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
 Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
 Performance and optimization
Comparing without Comparable
Dataiku - Pig, Hive and Cascading
 Ability to integrate different file formats
◦ Text Delimited
◦ Sequence File (Binary Hadoop format)
◦ Avro, Thrift ..
 Ability to integrate with external data sources or
sink ( MongoDB, ElasticSearch, Database. …)
Formats Integration
Motivation
Dataiku - Pig, Hive and Cascading
Format Size on Disk (GB) HIVE Processing time (24 cores)
Text File, uncompressed 18.7 1m32s
1 Text File, Gzipped 3.89 6m23s
(no parallelization)
JSON compressed 7.89 2m42s
multiple text file gzipped 4.02 43s
Sequence File, Block, Gzip 5.32 1m18s
Text File, LZO Indexed 7.03 1m22s
Format impact on size and performance
 Hive: Serde (Serialize-Deserializer)
 Pig : Storage
 Cascading: Tap
Format Integration
Dataiku - Pig, Hive and Cascading
 No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition
 Common partition schemas on Hadoop
◦ By Date /apache_logs/dt=2013-01-23
◦ By Data center /apache_logs/dc=redbus01/…
◦ By Country
◦ …
◦ Or any combination of the above
Partitions
Motivation
Dataiku - Pig, Hive and Cascading
Hive Partitioning
Partitioned tables
5/15/2013 46
CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0
/hive/event/day=2013-01-27/server_id=s1/file1
/hive/event/day=2013-01-27/server_id=s2/file0
/hive/event/day=2013-01-27/server_id=s2/file1
…
/hive/event/day=2013-01-28/server_id=s2/file0
/hive/event/day=2013-01-28/server_id=s2/file1
Dataiku Training – Hadoop for Data Science
INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27',
server_id=‘s1’)
SELECT * FROM event_tmp;
 No Direct support for partition
 Support for “Glob” Tap, to build read from files
using patterns
  You can code your own custom or virtual
partition schemes
Cascading Partition
Dataiku - Pig, Hive and Cascading
External Code Integration
Simple UDF
Dataiku - Pig, Hive and Cascading
Pig Hive
Cascading
Hive Complex UDF
(Aggregators)
Dataiku - Pig, Hive and Cascading
Cascading
Direct Code Evaluation
Dataiku - Pig, Hive and Cascading
Uses Janino, a very cool project:
http://docs.codehaus.org/display/JANINO
 Allow to call a cascading flow from a Spring Batch
Spring Batch
Cascading Integration
Dataiku - Pig, Hive and Cascading
 No full Integration with Spring MessageSource or
MessageHandler yet (only for local flows)
Integration
Summary
Dataiku - Pig, Hive and Cascading
Partition/Increme
ntal Updates
External Code Format
Integration
Pig No Direct
Support
Simple Doable and rich
community
Hive Fully integrated,
SQL Like
Very simple, but
complex dev setup
Doable and
existing
community
Cascading With Coding Complex UDFS
but regular, and
Java Expression
embeddable
Doable and
growing
commuinty
 Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
 Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
 Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
 Performance and optimization
Comparing without Comparable
Dataiku - Pig, Hive and Cascading
 Several Common Map Reduce Optimization
Patterns
◦ Combiners
◦ MapJoin
◦ Job Fusion
◦ Job Parallelism
◦ Reducer Parallelism
 Different support per framework
◦ Fully Automatic
◦ Pragma / Directives / Options
◦ Coding style / Code to write
Optimization
Dataiku - Pig, Hive and Cascading
SELECT date, COUNT(*) FROM product GROUP BY date
Combiner
Perform Partial Aggregate at Mapper Stage
Dataiku - Pig, Hive and Cascading
Map Reduce
2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 20
2012-02-15 35
2012-02-16 1
2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq2
SELECT date, COUNT(*) FROM product GROUP BY date
Combiner
Perform Partial Aggregate at Mapper Stage
Dataiku - Pig, Hive and Cascading
Map Reduce
2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 12
2012-02-15 23
2012-02-16 1
2012-02-14 8
2012-02-15 12
2012-02-14 20
2012-02-15 35
2012-02-16 1
Reduced network bandwith. Better parallelism
Join Optimization
Map Join
Dataiku - Pig, Hive and Cascading
set hive.auto.convert.join = true;
Hive
Pig
Cascading
( no aggregation support after HashJoin)
 Critical for performance
 Estimated per the size of input file
◦ Hive
 divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
◦ Pig
 divide size pig.exec.reducers.bytes.per.reducer (default 1GB)
Number of Reducers
Dataiku - Pig, Hive and Cascading
Combiner
Optimization
Join
Optimization
Number of
reducers
optimization
Pig Automatic Option Estimate or DIY
Cascading DIY HashJoin DIY
Hive Partial
DIY
Automatic
(Map Join)
Estimate or DIY
Performance & Optimization
Summary
Dataiku - Pig, Hive and Cascading
 Hadoop and Context (->0:03)
 Pig, Hive, Cascading, … (->0:09)
 How they work (->0:15)
 Comparing the tools (->0:35)
 Make them work together (->0:40)
 Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
Follow the Flow
Dataiku - Pig, Hive and Cascading
Tracker Log
MongoDB
MySQL
MySQL
Syslog
Product
Catalog
Order
Apache Logs
Session
Product Transformation
Category Affinity
Category Targeting
Customer Profile
Product Recommender
S3
Search Logs (External) Search Engine
Optimization
(Internal) Search
Ranking
MongoDB
MySQL
Partner FTP
Sync In Sync Out
Pig
Pig
Hive
Hive
ElasticSearch
E.g. Product Recommender
Dataiku - Pig, Hive and Cascading
Page Views
Orders
Catalog
Bots, Special Users
Filtered Page Views
User Affinity
Product Popularity
User Similarity (Per Category)
Recommendation Graph
Recommendation
Order Summary
User Similarity (Per Brand)
Machine Learning
 Schema Maintenance between tools
 Proper incremental and efficient synchronization
between tools and NoSQL Store and Logs Systems
 Proper “management” partition (daily jobs, …)
 Job Sequence and Management
◦ How to handle properly a new field ? a missing data ?
recompute everything ?
Pain Points
On Large Projects
Dataiku - Pig, Hive and Cascading
 Hcatalog provides an interoberability between Hive
and Pig in term of schema
Integration Option
HCatalog
Dataiku - Pig, Hive and Cascading
Hive Pig
HCatalog
 1970 Shell script
 1977 Makefile
 1980 Makedeps
 1999 Cons/CMake
 2001 Maven
 2004 Ivy
 2008 Gradle
 Shell Script
 2008 HaMake
 2009 Oozie
 … ETL Hadoop
 Next … ?
Dataiku - Pig, Hive and Cascading
Similar to “Build”
 Hadoop and Context (->0:03)
 Pig, Hive, Cascading, … (->0:09)
 How they work (->0:15)
 Comparing the tools (->0:35)
 Make them work together (->0:40)
 Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
 Want to keep close to SQL ?
◦ Hive
 Want to write large flows ?
◦ Pig
 Want to integrate in large scale programming
projects
◦ Cascading (cascalog / scalding)
Dataiku - Pig, Hive and Cascading
Presentation Available On
http://www.slideshare.net/Dataiku

More Related Content

What's hot

Semantic Web & TYPO3
Semantic Web & TYPO3Semantic Web & TYPO3
Semantic Web & TYPO3André Wuttig
 
6 things about perl 6
6 things about perl 66 things about perl 6
6 things about perl 6brian d foy
 
Get into the FLOW with Extbase
Get into the FLOW with ExtbaseGet into the FLOW with Extbase
Get into the FLOW with ExtbaseJochen Rau
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
The History of PHPersistence
The History of PHPersistenceThe History of PHPersistence
The History of PHPersistenceHugo Hamon
 
Database Design Patterns
Database Design PatternsDatabase Design Patterns
Database Design PatternsHugo Hamon
 
Indices APIs - Elasticsearch Reference
Indices APIs - Elasticsearch ReferenceIndices APIs - Elasticsearch Reference
Indices APIs - Elasticsearch ReferenceDaniel Ku
 
vfsStream - effective filesystem mocking
vfsStream - effective filesystem mocking vfsStream - effective filesystem mocking
vfsStream - effective filesystem mocking Sebastian Marek
 
CO2 sequestration in a different manner
CO2 sequestration in a different mannerCO2 sequestration in a different manner
CO2 sequestration in a different mannerGreen Minerals B.V.
 
Web Development with CoffeeScript and Sass
Web Development with CoffeeScript and SassWeb Development with CoffeeScript and Sass
Web Development with CoffeeScript and SassBrian Hogan
 
Simple Photo Processing and Web Display with Perl
Simple Photo Processing and Web Display with PerlSimple Photo Processing and Web Display with Perl
Simple Photo Processing and Web Display with PerlKent Cowgill
 
The Joy of Smartmatch
The Joy of SmartmatchThe Joy of Smartmatch
The Joy of SmartmatchAndrew Shitov
 
vfsStream - a better approach for file system dependent tests
vfsStream - a better approach for file system dependent testsvfsStream - a better approach for file system dependent tests
vfsStream - a better approach for file system dependent testsFrank Kleine
 

What's hot (20)

Perl Web Client
Perl Web ClientPerl Web Client
Perl Web Client
 
Semantic Web & TYPO3
Semantic Web & TYPO3Semantic Web & TYPO3
Semantic Web & TYPO3
 
Wsomdp
WsomdpWsomdp
Wsomdp
 
Nubilus Perl
Nubilus PerlNubilus Perl
Nubilus Perl
 
6 things about perl 6
6 things about perl 66 things about perl 6
6 things about perl 6
 
Get into the FLOW with Extbase
Get into the FLOW with ExtbaseGet into the FLOW with Extbase
Get into the FLOW with Extbase
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Perl6 grammars
Perl6 grammarsPerl6 grammars
Perl6 grammars
 
The History of PHPersistence
The History of PHPersistenceThe History of PHPersistence
The History of PHPersistence
 
Database Design Patterns
Database Design PatternsDatabase Design Patterns
Database Design Patterns
 
Perl6 in-production
Perl6 in-productionPerl6 in-production
Perl6 in-production
 
Php
PhpPhp
Php
 
Indices APIs - Elasticsearch Reference
Indices APIs - Elasticsearch ReferenceIndices APIs - Elasticsearch Reference
Indices APIs - Elasticsearch Reference
 
vfsStream - effective filesystem mocking
vfsStream - effective filesystem mocking vfsStream - effective filesystem mocking
vfsStream - effective filesystem mocking
 
CO2 sequestration in a different manner
CO2 sequestration in a different mannerCO2 sequestration in a different manner
CO2 sequestration in a different manner
 
Web Development with CoffeeScript and Sass
Web Development with CoffeeScript and SassWeb Development with CoffeeScript and Sass
Web Development with CoffeeScript and Sass
 
Simple Photo Processing and Web Display with Perl
Simple Photo Processing and Web Display with PerlSimple Photo Processing and Web Display with Perl
Simple Photo Processing and Web Display with Perl
 
The Joy of Smartmatch
The Joy of SmartmatchThe Joy of Smartmatch
The Joy of Smartmatch
 
Perl 6 by example
Perl 6 by examplePerl 6 by example
Perl 6 by example
 
vfsStream - a better approach for file system dependent tests
vfsStream - a better approach for file system dependent testsvfsStream - a better approach for file system dependent tests
vfsStream - a better approach for file system dependent tests
 

Similar to Hadoop Batch Tools Pig Hive Cascading

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NYahoo Developer Network
 
WordPress Realtime - WordCamp São Paulo 2015
WordPress Realtime - WordCamp São Paulo 2015WordPress Realtime - WordCamp São Paulo 2015
WordPress Realtime - WordCamp São Paulo 2015Fernando Daciuk
 
Operation Oriented Web Applications / Yokohama pm7
Operation Oriented Web Applications / Yokohama pm7Operation Oriented Web Applications / Yokohama pm7
Operation Oriented Web Applications / Yokohama pm7Masahiro Nagano
 
WIRED and the WP REST API
WIRED and the WP REST APIWIRED and the WP REST API
WIRED and the WP REST APIkvignos
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applicationselliando dias
 
Drupal Development (Part 2)
Drupal Development (Part 2)Drupal Development (Part 2)
Drupal Development (Part 2)Jeff Eaton
 
WordPressでIoTをはじめよう
WordPressでIoTをはじめようWordPressでIoTをはじめよう
WordPressでIoTをはじめようYuriko IKEDA
 
Using R for Building a Simple and Effective Dashboard
Using R for Building a Simple and Effective DashboardUsing R for Building a Simple and Effective Dashboard
Using R for Building a Simple and Effective DashboardAndrea Gigli
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Cena-DTA PHP Conference 2011 Slides
Cena-DTA PHP Conference 2011 SlidesCena-DTA PHP Conference 2011 Slides
Cena-DTA PHP Conference 2011 SlidesAsao Kamei
 
Real-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @MoldcampReal-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @MoldcampAlexei Gorobets
 
Accessible Web Components_Techshare India 2014
Accessible Web Components_Techshare India 2014Accessible Web Components_Techshare India 2014
Accessible Web Components_Techshare India 2014BarrierBreak
 
Serverless Functions and Vue.js
Serverless Functions and Vue.jsServerless Functions and Vue.js
Serverless Functions and Vue.jsSarah Drasner
 
Think Generic - Add API's To Your Custom Modules
Think Generic - Add API's To Your Custom ModulesThink Generic - Add API's To Your Custom Modules
Think Generic - Add API's To Your Custom ModulesJens Sørensen
 
Let's write secure Drupal code! Drupal MountainCamp 2019
Let's write secure Drupal code! Drupal MountainCamp 2019Let's write secure Drupal code! Drupal MountainCamp 2019
Let's write secure Drupal code! Drupal MountainCamp 2019Balázs Tatár
 
Sydney Oracle Meetup - execution plans
Sydney Oracle Meetup - execution plansSydney Oracle Meetup - execution plans
Sydney Oracle Meetup - execution planspaulguerin
 
R57shell
R57shellR57shell
R57shellady36
 

Similar to Hadoop Batch Tools Pig Hive Cascading (20)

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
WordPress Realtime - WordCamp São Paulo 2015
WordPress Realtime - WordCamp São Paulo 2015WordPress Realtime - WordCamp São Paulo 2015
WordPress Realtime - WordCamp São Paulo 2015
 
Operation Oriented Web Applications / Yokohama pm7
Operation Oriented Web Applications / Yokohama pm7Operation Oriented Web Applications / Yokohama pm7
Operation Oriented Web Applications / Yokohama pm7
 
Database api
Database apiDatabase api
Database api
 
WIRED and the WP REST API
WIRED and the WP REST APIWIRED and the WP REST API
WIRED and the WP REST API
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
 
JQuery Flot
JQuery FlotJQuery Flot
JQuery Flot
 
Drupal Development (Part 2)
Drupal Development (Part 2)Drupal Development (Part 2)
Drupal Development (Part 2)
 
WordPressでIoTをはじめよう
WordPressでIoTをはじめようWordPressでIoTをはじめよう
WordPressでIoTをはじめよう
 
Playing With The Web
Playing With The WebPlaying With The Web
Playing With The Web
 
Using R for Building a Simple and Effective Dashboard
Using R for Building a Simple and Effective DashboardUsing R for Building a Simple and Effective Dashboard
Using R for Building a Simple and Effective Dashboard
 
Practical pig
Practical pigPractical pig
Practical pig
 
Cena-DTA PHP Conference 2011 Slides
Cena-DTA PHP Conference 2011 SlidesCena-DTA PHP Conference 2011 Slides
Cena-DTA PHP Conference 2011 Slides
 
Real-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @MoldcampReal-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @Moldcamp
 
Accessible Web Components_Techshare India 2014
Accessible Web Components_Techshare India 2014Accessible Web Components_Techshare India 2014
Accessible Web Components_Techshare India 2014
 
Serverless Functions and Vue.js
Serverless Functions and Vue.jsServerless Functions and Vue.js
Serverless Functions and Vue.js
 
Think Generic - Add API's To Your Custom Modules
Think Generic - Add API's To Your Custom ModulesThink Generic - Add API's To Your Custom Modules
Think Generic - Add API's To Your Custom Modules
 
Let's write secure Drupal code! Drupal MountainCamp 2019
Let's write secure Drupal code! Drupal MountainCamp 2019Let's write secure Drupal code! Drupal MountainCamp 2019
Let's write secure Drupal code! Drupal MountainCamp 2019
 
Sydney Oracle Meetup - execution plans
Sydney Oracle Meetup - execution plansSydney Oracle Meetup - execution plans
Sydney Oracle Meetup - execution plans
 
R57shell
R57shellR57shell
R57shell
 

More from Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015 Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 

More from Dataiku (20)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 

Recently uploaded

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Recently uploaded (20)

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

Hadoop Batch Tools Pig Hive Cascading

  • 1. Hadoop Is A Batch Pig, Hive, Cascading …  Paris Jug May 2013  Florian Douetteau
  • 2. Florian Douetteau <florian.douetteau@dataiku.com>  CEO at Dataiku  Freelance at Criteo (Online Ads)  CTO at IsCool Ent. (#1 French Social Gamer)  VP R&D Exalead (Search Engine Technology) About me 15/05/2013Dataiku Training – Hadoop for Data Science 2
  • 3.  Hadoop and Context (->0:03)  Pig, Hive, Cascading, … (->0:09)  How they work (->0:15)  Comparing the tools (->0:35)  Make them work together (->0:40)  Wrap’up and question (->Beer) Agenda Dataiku - Pig, Hive and Cascading
  • 4. CHOOSE TECHNOLOGY Dataiku - Pig, Hive and Cascading Hadoop Ceph Sphere Cassandra Spark Scikit-Learn Mahout WEKA MLBase LibSVM SAS RapidMiner SPSS Panda QlickView Tableau SpotFire HTML5/D3 InfiniDB Vertica GreenPlum Impala Netezza Elastic Search SOLR MongoDB Riak Membase Pig Cascading Talend Machine Learning Mystery Land Scalability CentralNoSQL-Slavia SQL Colunnar Republic Vizualization County Data Clean Wasteland Statistician Old House R
  • 5. How do I (pre)process data? Implicit User Data (Views, Searches…) Content Data (Title, Categories, Price, …) Dataiku - Pig, Hive and Cascading Explicit User Data (Click, Buy, …) User Information (Location, Graph…) 500TB 50TB 1TB 200GB Transformation Matrix Transformation Predictor Per User Stats Per Content Stats User Similarity Rank Predictor Content Similarity A/B Test Data Predictor Runtime Online User Information
  • 6.  Analyse Raw Logs (Trackers, Web Logs)  Extract IP, Page, …  Detect and remove robots  Build Statistics ◦ Number of page view, per produt ◦ Best Referers ◦ Traffic Analysis ◦ Funnel ◦ SEO Analysis ◦ … Dataiku - Pig, Hive and Cascading Typical Use Case 1 Web Analytics Processing
  • 7.  Extract Query Logs  Perform query normalization  Compute Ngrams  Compute Search “Sessions”  Compute Log- Likehood Ratio for ngrams across sesions Dataiku - Pig, Hive and Cascading Typical Use Case 2 Mining Search Logs for Synonyms
  • 8.  Compute User – Product Association Matrix  Compute different similarities ratio (Ochiai, Cosine, …)  Filter out bad predictions  For each user, select best recommendable products Dataiku - Pig, Hive and Cascading Typical Use Case 3 Product Recommender
  • 9.  Hadoop and Context (->0:03)  Pig, Hive, Cascading, … (->0:09)  How they work (->0:15)  Comparing the tools (->0:35)  Make them work together (->0:40)  Wrap’up and question (->Beer) Agenda Dataiku - Pig, Hive and Cascading
  • 10.  Yahoo Research in 2006  Inspired from Sawzall, a Google Paper from 2003  2007 as an Apache Project  Initial motivation ◦ Search Log Analytics: how long is the average user session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? … Pig History Dataiku - Pig, Hive and Cascading words = LOAD '/training/hadoop-wordcount/output‘ USING PigStorage(‘t’) AS (word:chararray, count:int); sorted_words = ORDER words BY count DESC; first_words = LIMIT sorted_words 10; DUMP first_words;
  • 11.  Developed by Facebook in January 2007  Open source in August 2008  Initial Motivation ◦ Provide a SQL like abstraction to perform statistics on status updates Hive History Dataiku - Pig, Hive and Cascading create external table wordcounts ( word string, count int ) row format delimited fields terminated by 't' location '/training/hadoop-wordcount/output'; select * from wordcounts order by count desc limit 10; select SUM(count) from wordcounts where word like ‘th%’;
  • 12.  Authored by Chris Wensel 2008  Associated Projects ◦ Cascalog : Cascading in Closure ◦ Scalding : Cascading in Scala (Twitter in 2012) ◦ Lingual ( to be released soon): SQL layer on top of cascading Cascading History Dataiku - Pig, Hive and Cascading
  • 13.  Hadoop and Context (->0:03)  Pig, Hive, Cascading, … (->0:09)  How they work (->0:15)  Comparing the tools (->0:35)  Make them work together (->0:40)  Wrap’up and question (->Beer) Agenda Dataiku - Pig, Hive and Cascading
  • 14. MapReduce Simplicity is a complexity 5/15/2013Dataiku - Innovation Services 14
  • 15. Pig & Hive Mapping to Mapreduce jobs 5/15/2013Dataiku - Innovation Services 15 * VAT excluded events = LOAD ‘/events’ USING PigStorage(‘t’) AS (type:chararray, user:chararray, price:int, timestamp:int); events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price > 1000; Job 1 : Mapper Job 1 : Reducer1 LOAD FILTER GROUP FOREACH FILTER Shuffle and sort by user
  • 16. Pig & Hive Mapping to Mapreduce jobs 5/15/2013Dataiku - Innovation Services 16 events = LOAD ‘/events’ USING PigStorage(‘t’) AS (type:chararray, user:chararray, price:int, timestamp:int); events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price > 1000; recent_high = ORDER high_pbu BY max_ts DESC; STORE recent_high INTO ‘/output’; Job 1: Mapper Job 1 :Reducer LOAD FILTER GROUP FOREACH FILTER Shuffle and sort by user Job 2: Mapper Job 2: Reducer LOAD (from tmp) STOREShuffle and sort by max_ts
  • 17. Pig How does it work Dataiku - Pig, Hive and Cascading Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not) 84 TResolution = LOAD '$PREFIX/dwh_dim_external_tracking_resolution/dt=$DAY' USING PigStorage('u0001'); 85 TResolution = FOREACH TResolution GENERATE $0 AS SKResolutionId, $1 as ResolutionId; 86 87 88 TSiteMap = LOAD '$PREFIX/dwh_dim_sitemapnode/dt=$DAY' USING PigStorage('u0001'); 89 TSiteMap = FOREACH TSiteMap GENERATE $0 AS SKSimteMapNodeId, $2 as SiteMapNodeId; 90 91 92 TCustomer = LOAD '$PREFIX/customer_relation/dt=$DAY' USING PigStorage('u0001') 93 as (SKCustomerId:chararray, 94 CustomerId:chararray); 95 96 F1 = FOREACH F1 GENERATE *, (date_time IS NOT NULL ? CustomFormatToISO(date_time, 'yyyy-MM-dd HH:mm:ss' 97 98 F2 = FOREACH F1 GENERATE *, 99 CONCAT(CONCAT(CONCAT(CONCAT(visid_high,'-'), visid_low), '-'), visit_num) as VisitId, 100 (referrer matches '.*cdiscount.com.*' OR referrer matches 'cdscdn.com' ? NULL :referrer ) as Referrer, 101 (iso IS NOT NULL ? ISODaysBetween(iso, '1899-12-31T00:00:00') : NULL) 102 AS SkDateId, 103 (iso IS NOT NULL ? ISOSecondsBetween(iso, ISOToDay(iso)) : NULL) 104 AS SkTimeId, 105 ((event_list is not null and event_list matches '.*b202b.*') ? 'Y' : 'N') as is_202, 106 ((event_list is not null and event_list matches '.*b10b.*') ? 'Y' : 'N') as is_10, 107 ((event_list is not null and event_list matches '.*b12b.*') ? 'Y' : 'N') as is_12, 108 ((event_list is not null and event_list matches '.*b13b.*') ? 'Y' : 'N') as is_13, 109 ((event_list is not null and event_list matches '.*b14b.*') ? 'Y' : 'N') as is_14, 110 ((event_list is not null and event_list matches '.*b11b.*') ? 'Y' : 'N') as is_11, 111 ((event_list is not null and event_list matches '.*b1b.*') ? 'Y' : 'N') as is_1, 112 REGEX_EXTRACT(pagename, 'F-(.*):.*', 1) AS ProductReferenceId, 113 NULL AS OriginFile; 114 115 SET DEFAULT_PARALLEL 24; 116 117 F3 = JOIN F2 BY post_search_engine LEFT, TSearchEngine BY SearchEngineId USING 'replicated' PARALLEL 20 ; 118 F3 = FOREACH F3 GENERATE *, (SKSearchEngineId IS NULL ? '-1' : SKSearchEngineId) as SKSearchEngineId; 119 --F3 = FOREACH F2 GENERATE *, NULL AS SKSearchEngineId, NULL AS SearchEngineId; 120 121 F4 = JOIN F3 BY browser LEFT, TBrowser BY BrowserId USING 'replicated' PARALLEL 20; 122 F4 = FOREACH F4 GENERATE *, (SKBrowserId IS NULL ? '-1' : SKBrowserId) as SKBrowserId; 123 124 --F4 = FOREACH F3 GENERATE *, NULL AS SKBrowserId, NULL AS BrowserId; 125 126 127 F5 = JOIN F4 BY os LEFT, TOperatingSystem BY OperatingSystemId USING 'replicated' PARALLEL 20; 128 F5 = FOREACH F5 GENERATE *, (SKOperatingSystemId IS NULL ? '-1' : SKOperatingSystemId) as SKOperatingSystemId; 129 130 --F5 = FOREACH F4 GENERATE *, NULL AS SKOperatingSystemId, NULL AS OperatingSystemId; 131 132 133 F6 = JOIN F5 BY resolution LEFT, TResolution BY ResolutionId USING 'replicated' PARALLEL 20; 134 F6 = FOREACH F6 GENERATE *, (SKResolutionId IS NULL ? '-1' : SKResolutionId) as SKResolutionId; 135 136 --F6 = FOREACH F5 GENERATE *, NULL AS SKResolutionId, NULL AS ResolutionId; 137 138 F7 = JOIN F6 BY post_evar4 LEFT, TSiteMap BY SiteMapNodeId USING 'replicated' PARALLEL 20; 139 F7 = FOREACH F7 GENERATE *, (SKSimteMapNodeId IS NULL ? '-1' : SKSimteMapNodeId) as SKSimteMapNodeId; 140 141 --F7 = FOREACH F6 GENERATE *, NULL AS SKSimteMapNodeId, NULL AS SiteMapNodeId; 142 143 144 SPLIT F7 INTO WITHOUT_CUSTOMER IF post_evar30 IS NULL, WITH_CUSTOMER IF post_evar30 IS NOT NULL; 145 146 F8 = JOIN WITH_CUSTOMER BY post_evar30 LEFT, TCustomer BY CustomerId USING 'skewed' PARALLEL 20; 147 WITHOUT_CUSTOMER = FOREACH WITHOUT_CUSTOMER GENERATE *, NULL as SKCustomerId, NULL as CustomerId; 148 149 --F8_UNION = FOREACH F7 GENERATE *, NULL as SKCustomerId, NULL as CustomerId; 150 F8_UNION = UNION F8, WITHOUT_CUSTOMER; 151 --DESCRIBE F8; 152 --DESCRIBE WITHOUT_CUSTOMER; 153 --DESCRIBE F8_UNION; 154 155 F9 = FOREACH F8_UNION GENERATE 156 visid_high, 157 visid_low, 158 VisitId, 159 post_evar30, 160 SKCustomerId, 161 visit_num, 162 SkDateId, 163 SkTimeId, 164 post_evar16, 165 post_evar52, 166 visit_page_num, 167 is_202, 168 is_10, 169 is_12,
  • 18. Reducer 2Mappers output Reducer 1 Hive Joins How to join with MapReduce ? 15/05/2013Dataiku - Innovation Services 19 tbl_idx uid name 1 1 Dupont 1 2 Durand tbl_idx uid type 2 1 Type1 2 1 Type2 2 2 Type1 Shuffle by uid Sort by (uid, tbl_idx) Uid Tbl_idx Name Type 1 1 Dupont 1 2 Type1 1 2 Type2 Uid Tbl_idx Name Type 2 1 Durand 2 2 Type1 Uid Name Type 1 Dupont Type1 1 Dupont Type2 Uid Name Type 2 Durand Type1
  • 19.  Hadoop and Context (->0:03)  Pig, Hive, Cascading, … (->0:09)  How they work (->0:15)  Comparing the tools (->0:35)  Make them work together (->0:40)  Wrap’up and question (->Beer) Agenda Dataiku - Pig, Hive and Cascading
  • 20.  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Comparing without Comparable Dataiku - Pig, Hive and Cascading
  • 21.  Transformation as a sequence of operations  Transformation as a set of formulas Dataiku - Pig, Hive and Cascading Procedural Vs Declarative insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
  • 22.  All three Extend basic data model with extended data types ◦ array-like [ event1, event2, event3] ◦ map-like { type1:value1, type2:value2, …}  Different approach ◦ Resilient Schema ◦ Static Typing ◦ No Static Typing Data type and Model Rationale Dataiku - Pig, Hive and Cascading
  • 23. Hive Data Type and Schema 5/15/2013 24 Simple type Details TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes FLOAT, DOUBLE 4 and 8 bytes BOOLEAN STRING Arbitrary-length, replaces VARCHAR TIMESTAMP Complex type Details ARRAY Array of typed items (0-indexed) MAP Associative map STRUCT Complex class-like objects Dataiku Training – Hadoop for Data Science CREATE TABLE visit ( user_name STRING, user_id INT, user_details STRUCT<age:INT, zipcode:INT> );
  • 24. rel = LOAD '/folder/path/' USING PigStorage(‘t’) AS (col:type, col:type, col:type); Data types and Schema Pig 5/15/2013 25 Simple type Details int, long, float, double 32 and 64 bits, signed chararray A string bytearray An array of … bytes boolean A boolean Complex type Details tuple a tuple is an ordered fieldname:value map bag a bag is a set of tuples Dataiku Training – Hadoop for Data Science
  • 25.  Support for Any Java Types, provided they can be serialized in Hadoop  No support for Typing Data Type and Schema Cascading Dataiku - Pig, Hive and Cascading Simple type Details Int, Long, Float, Double 32 and 64 bits, signed String A string byte[] An array of … bytes Boolean A boolean Complex type Details Object Object must be « Hadoop serializable »
  • 26. Style Summary Dataiku - Pig, Hive and Cascading Style Typing Data Model Metadata store Pig Procedural Static + Dynamic scalar + tuple+ bag (fully recursive) No (HCatalog) Hive Declarative Static + Dynamic, enforced at execution time scalar+ list + map Integrated Cascading Procedural Weak scalar+ java objects No
  • 27.  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing, error management and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Comparing without Comparable Dataiku - Pig, Hive and Cascading
  • 28.  Does debugging the tool lead to bad headaches ? Dataiku - Pig, Hive and Cascading Headachility Motivation
  • 29.  Out Of Memory Error (Reducer)  Exception in Building / Extended Functions (handling of null)  Null vs “”  Nested Foreach and scoping  Date Management (pig 0.10)  Field implicit ordering Dataiku - Pig, Hive and Cascading Headaches Pig
  • 30. A Pig Error Dataiku - Pig, Hive and Cascading
  • 31.  Out of Memory Errors in Reducers  Few Debugging Options  Null / “”  No builtin “first” Dataiku - Pig, Hive and Cascading Headaches Hive
  • 32.  Weak Typing Errors (comparing Int and String … )  Illegal Operation Sequence (Group after group …)  Field Implicit Ordering Dataiku - Pig, Hive and Cascading Headaches Cascading
  • 33.  How to perform unit tests ?  How to have different versions of the same script (parameter) ? Testing Motivation Dataiku - Pig, Hive and Cascading
  • 34.  System Variables  Comment to test  No Meta Programming  pig –x local to execute on local files Testing Pig Dataiku - Pig, Hive and Cascading
  • 35.  Junit Tests are possible  Ability to use code to actually comment out some variables Testing / Environment Cascading Dataiku - Pig, Hive and Cascading
  • 36.  Lots of iteration while developing on Hadoop  Sometime jobs fail  Sometimes need to restart from the start … Checkpointing Motivation Dataiku - Pig, Hive and Cascading Page User Correlation OutputFilteringParse Logs Per Page Stats FIX and relaunch
  • 37.  STORE Command to manually store files Pig Manual Checkpointing Dataiku - Pig, Hive and Cascading Page User Correlation OutputFilteringParse Logs Per Page Stats // COMMENT Beginning of script and relaunch
  • 38.  Ability to re-run a flow automatically from the last saved checkpoint Dataiku - Pig, Hive and Cascading Cascading Automated Checkpointing addCheckpoint(…)
  • 39.  Check each file intermediate timestamp  Execute only if more recent Dataiku - Pig, Hive and Cascading Cascading Topological Scheduler Page User Correlation OutputFilteringParse Logs Per Page Stats
  • 40. Productivity Summary Dataiku - Pig, Hive and Cascading Headaches Checkpointing/Re play Testing / Metaprogrammation Pig Lots Manual Save Difficult Meta programming, easy local testing Hive Few, but without debugging options None (That’s SQL) None (That’s SQL) Cascading Weak Typing Complexity Checkpointing Partial Updates Possible
  • 41.  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Comparing without Comparable Dataiku - Pig, Hive and Cascading
  • 42.  Ability to integrate different file formats ◦ Text Delimited ◦ Sequence File (Binary Hadoop format) ◦ Avro, Thrift ..  Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …) Formats Integration Motivation Dataiku - Pig, Hive and Cascading Format Size on Disk (GB) HIVE Processing time (24 cores) Text File, uncompressed 18.7 1m32s 1 Text File, Gzipped 3.89 6m23s (no parallelization) JSON compressed 7.89 2m42s multiple text file gzipped 4.02 43s Sequence File, Block, Gzip 5.32 1m18s Text File, LZO Indexed 7.03 1m22s Format impact on size and performance
  • 43.  Hive: Serde (Serialize-Deserializer)  Pig : Storage  Cascading: Tap Format Integration Dataiku - Pig, Hive and Cascading
  • 44.  No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition  Common partition schemas on Hadoop ◦ By Date /apache_logs/dt=2013-01-23 ◦ By Data center /apache_logs/dc=redbus01/… ◦ By Country ◦ … ◦ Or any combination of the above Partitions Motivation Dataiku - Pig, Hive and Cascading
  • 45. Hive Partitioning Partitioned tables 5/15/2013 46 CREATE TABLE event ( user_id INT, type STRING, message STRING) PARTITIONED BY (day STRING, server_id STRING); Disk structure /hive/event/day=2013-01-27/server_id=s1/file0 /hive/event/day=2013-01-27/server_id=s1/file1 /hive/event/day=2013-01-27/server_id=s2/file0 /hive/event/day=2013-01-27/server_id=s2/file1 … /hive/event/day=2013-01-28/server_id=s2/file0 /hive/event/day=2013-01-28/server_id=s2/file1 Dataiku Training – Hadoop for Data Science INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27', server_id=‘s1’) SELECT * FROM event_tmp;
  • 46.  No Direct support for partition  Support for “Glob” Tap, to build read from files using patterns   You can code your own custom or virtual partition schemes Cascading Partition Dataiku - Pig, Hive and Cascading
  • 47. External Code Integration Simple UDF Dataiku - Pig, Hive and Cascading Pig Hive Cascading
  • 48. Hive Complex UDF (Aggregators) Dataiku - Pig, Hive and Cascading
  • 49. Cascading Direct Code Evaluation Dataiku - Pig, Hive and Cascading Uses Janino, a very cool project: http://docs.codehaus.org/display/JANINO
  • 50.  Allow to call a cascading flow from a Spring Batch Spring Batch Cascading Integration Dataiku - Pig, Hive and Cascading  No full Integration with Spring MessageSource or MessageHandler yet (only for local flows)
  • 51. Integration Summary Dataiku - Pig, Hive and Cascading Partition/Increme ntal Updates External Code Format Integration Pig No Direct Support Simple Doable and rich community Hive Fully integrated, SQL Like Very simple, but complex dev setup Doable and existing community Cascading With Coding Complex UDFS but regular, and Java Expression embeddable Doable and growing commuinty
  • 52.  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Comparing without Comparable Dataiku - Pig, Hive and Cascading
  • 53.  Several Common Map Reduce Optimization Patterns ◦ Combiners ◦ MapJoin ◦ Job Fusion ◦ Job Parallelism ◦ Reducer Parallelism  Different support per framework ◦ Fully Automatic ◦ Pragma / Directives / Options ◦ Coding style / Code to write Optimization Dataiku - Pig, Hive and Cascading
  • 54. SELECT date, COUNT(*) FROM product GROUP BY date Combiner Perform Partial Aggregate at Mapper Stage Dataiku - Pig, Hive and Cascading Map Reduce 2012-02-14 4354 … 2012-02-15 21we2 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 20 2012-02-15 35 2012-02-16 1 2012-02-14 4354 … 2012-02-15 21we2 2012-02-14 qa334 … 2012-02-15 23aq2
  • 55. SELECT date, COUNT(*) FROM product GROUP BY date Combiner Perform Partial Aggregate at Mapper Stage Dataiku - Pig, Hive and Cascading Map Reduce 2012-02-14 4354 … 2012-02-15 21we2 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 12 2012-02-15 23 2012-02-16 1 2012-02-14 8 2012-02-15 12 2012-02-14 20 2012-02-15 35 2012-02-16 1 Reduced network bandwith. Better parallelism
  • 56. Join Optimization Map Join Dataiku - Pig, Hive and Cascading set hive.auto.convert.join = true; Hive Pig Cascading ( no aggregation support after HashJoin)
  • 57.  Critical for performance  Estimated per the size of input file ◦ Hive  divide size per hive.exec.reducers.bytes.per.reducer (default 1GB) ◦ Pig  divide size pig.exec.reducers.bytes.per.reducer (default 1GB) Number of Reducers Dataiku - Pig, Hive and Cascading
  • 58. Combiner Optimization Join Optimization Number of reducers optimization Pig Automatic Option Estimate or DIY Cascading DIY HashJoin DIY Hive Partial DIY Automatic (Map Join) Estimate or DIY Performance & Optimization Summary Dataiku - Pig, Hive and Cascading
  • 59.  Hadoop and Context (->0:03)  Pig, Hive, Cascading, … (->0:09)  How they work (->0:15)  Comparing the tools (->0:35)  Make them work together (->0:40)  Wrap’up and question (->Beer) Agenda Dataiku - Pig, Hive and Cascading
  • 60. Follow the Flow Dataiku - Pig, Hive and Cascading Tracker Log MongoDB MySQL MySQL Syslog Product Catalog Order Apache Logs Session Product Transformation Category Affinity Category Targeting Customer Profile Product Recommender S3 Search Logs (External) Search Engine Optimization (Internal) Search Ranking MongoDB MySQL Partner FTP Sync In Sync Out Pig Pig Hive Hive ElasticSearch
  • 61. E.g. Product Recommender Dataiku - Pig, Hive and Cascading Page Views Orders Catalog Bots, Special Users Filtered Page Views User Affinity Product Popularity User Similarity (Per Category) Recommendation Graph Recommendation Order Summary User Similarity (Per Brand) Machine Learning
  • 62.  Schema Maintenance between tools  Proper incremental and efficient synchronization between tools and NoSQL Store and Logs Systems  Proper “management” partition (daily jobs, …)  Job Sequence and Management ◦ How to handle properly a new field ? a missing data ? recompute everything ? Pain Points On Large Projects Dataiku - Pig, Hive and Cascading
  • 63.  Hcatalog provides an interoberability between Hive and Pig in term of schema Integration Option HCatalog Dataiku - Pig, Hive and Cascading Hive Pig HCatalog
  • 64.  1970 Shell script  1977 Makefile  1980 Makedeps  1999 Cons/CMake  2001 Maven  2004 Ivy  2008 Gradle  Shell Script  2008 HaMake  2009 Oozie  … ETL Hadoop  Next … ? Dataiku - Pig, Hive and Cascading Similar to “Build”
  • 65.  Hadoop and Context (->0:03)  Pig, Hive, Cascading, … (->0:09)  How they work (->0:15)  Comparing the tools (->0:35)  Make them work together (->0:40)  Wrap’up and question (->Beer) Agenda Dataiku - Pig, Hive and Cascading
  • 66.  Want to keep close to SQL ? ◦ Hive  Want to write large flows ? ◦ Pig  Want to integrate in large scale programming projects ◦ Cascading (cascalog / scalding) Dataiku - Pig, Hive and Cascading Presentation Available On http://www.slideshare.net/Dataiku