• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,231
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
60
Comments
0
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Pig Hive Cascading Hadoop In Practice }  Devoxx 2013 }  Florian Douetteau
  • 2. About me Florian Douetteau <florian.douetteau@dataiku.com> }  CEO at Dataiku }  Freelance at Criteo (Online Ads) }  CTO at IsCool Ent. (#1 French Social Gamer) }  VP R&D Exalead (Search Engine Technology)Dataiku Training – Hadoop for Data Science 4/14/13 2
  • 3. Agenda }  Hadoop and Context (->0:03) }  Pig, Hive, Cascading, … (->0:06) }  How they work (->0:09) }  Comparing the tools (->0:25) }  Wrap’up and question (->0:)Dataiku - Pig, Hive and Cascading
  • 4. CHOOSE TECHNOLOGYNoSQL-Slavia! Scalability Central! Machine Learning ! Mystery Land!Elastic Search Hadoop Scikit-LearnSOLR Ceph MongoDB Cassandra Sphere Mahout WEKARiak MLBase LibSVM Membase SparkSQL Colunnar Republic!InfiniDB SAS RapidMiner R Vertica SPSS PandaGreenPlum QlickView PigImpala Tableau Statistician Old !Netezza SpotFire Cascading Talend House! HTML5/D3 Vizualization County! Data Clean Wasteland! Dataiku - Pig, Hive and Cascading
  • 5. How do I (pre)process data? Implicit User Data (Views, Searches…) Online User Information Transformation 500TB Predictor Transformation MatrixExplicit User Data Predictor Runtime(Click, Buy, …) Per User Stats Rank Predictor 50TB Per Content StatsUser Information(Location, Graph…) User Similarity 1TBContent Data(Title, Categories, Price, …) 200GB Content Similarity A/B Test Data Dataiku - Pig, Hive and Cascading
  • 6. Typical Use Case 1
 Web Analytics Processing}  Analyse Raw Logs (Trackers, Web Logs)}  Extract IP, Page, …}  Detect and remove robots}  Build Statistics ◦  Number of page view, per produt ◦  Best Referers ◦  Traffic Analysis ◦  Funnel ◦  SEO Analysis ◦  … Dataiku - Pig, Hive and Cascading
  • 7. Typical Use Case 2
Mining Search Logs for Synonyms}  Extract Query Logs}  Perform query normalization}  Compute Ngrams}  Compute Search “Sessions”}  Compute Log- Likehood Ratio for ngrams across sesions Dataiku - Pig, Hive and Cascading
  • 8. Typical Use Case 3
Product Recommender}  Compute User – Product Association Matrix}  Compute different similarities ratio (Ochiai, Cosine, …)}  Filter out bad predictions}  For each user, select best recommendable products Dataiku - Pig, Hive and Cascading
  • 9. Agenda }  Hadoop and Context }  Pig, Hive, Cascading, … }  How they work }  Comparing the toolsDataiku - Pig, Hive and Cascading
  • 10. Pig History }  Yahoo Research in 2006 }  Inspired from Sawzall, a Google Paper from 2003 }  2007 as an Apache Project }  Initial motivation ◦  Search Log Analytics: how long is the average user session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? …words = LOAD /training/hadoop-wordcount/output‘USING PigStorage(‘t’) AS (word:chararray, count:int);sorted_words = ORDER words BY count DESC;first_words = LIMIT sorted_words 10;DUMP first_words; Dataiku - Pig, Hive and Cascading
  • 11. Hive History }  Developed by Facebook in January 2007 }  Open source in August 2008 }  Initial Motivation ◦  Provide a SQL like abstraction to perform statistics on status updatescreate external table wordcounts ( word string, count int) row format delimited fields terminated by tlocation /training/hadoop-wordcount/output;select * from wordcounts order by count desc limit10;select SUM(count) from wordcounts where word like‘th%’;Dataiku - Pig, Hive and Cascading
  • 12. Cascading History }  Authored by Chris Wensel 2008 }  Associated Projects ◦  Cascalog : Cascading in Closure ◦  Scalding : Cascading in Scala (Twitter in 2012) ◦  Lingual ( to be released soon): SQL layer on top of cascadingDataiku - Pig, Hive and Cascading
  • 13. Agenda }  Hadoop and Context }  Pig, Hive, Cascading, … }  How they work }  Comparing the toolsDataiku - Pig, Hive and Cascading
  • 14. MapReduce
 Simplicity is a complexityDataiku - Innovation Services 4/14/13 14
  • 15. Pig & Hive
 Mapping to Mapreduce jobs events = LOAD ‘/events’ USING PigStorage(‘t’) AS (type:chararray, user:chararray, price:int, timestamp:int); events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price > 1000; Job 1 : Mapper Job 1 : Reducer1 LOAD FILTER GROUP FOREACH FILTER Shuffle and 
 sort by user * VAT excludedDataiku - Innovation Services 4/14/13 15
  • 16. Pig & Hive
 Mapping to Mapreduce jobs events = LOAD ‘/events’ USING PigStorage(‘t’) AS (type:chararray, user:chararray, price:int, timestamp:int); events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price > 1000; recent_high = ORDER high_pbu BY max_ts DESC; STORE recent_high INTO ‘/output’; Job 1: Mapper Job 1 :Reducer LOAD FILTER GROUP FOREACH FILTER Shuffle and 
 sort by user Job 2: Mapper Job 2: Reducer LOAD
 Shuffle and 
 STORE (from tmp) sort by max_tsDataiku - Innovation Services 4/14/13 16
  • 17. Pig 
 How does it work Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not) 84 TResolution = LOAD $PREFIX/dwh_dim_external_tracking_resolution/dt=$DAY USING PigStorage(u0001); 85 TResolution = FOREACH TResolution GENERATE $0 AS SKResolutionId, $1 as ResolutionId; 86 87 88 TSiteMap = LOAD $PREFIX/dwh_dim_sitemapnode/dt=$DAY USING PigStorage(u0001); 89 TSiteMap = FOREACH TSiteMap GENERATE $0 AS SKSimteMapNodeId, $2 as SiteMapNodeId; 90 91 92 TCustomer = LOAD $PREFIX/customer_relation/dt=$DAY USING PigStorage(u0001) 93 as (SKCustomerId:chararray, 94 CustomerId:chararray); 95 96 F1 = FOREACH F1 GENERATE *, (date_time IS NOT NULL ? CustomFormatToISO(date_time, yyyy-MM-dd HH:mm:ss 97 98 F2 = FOREACH F1 GENERATE *, 99 CONCAT(CONCAT(CONCAT(CONCAT(visid_high,-), visid_low), -), visit_num) as VisitId, 100 (referrer matches .*cdiscount.com.* OR referrer matches cdscdn.com ? NULL :referrer ) as Referrer, 101 (iso IS NOT NULL ? ISODaysBetween(iso, 1899-12-31T00:00:00) : NULL) 102 AS SkDateId, 103 (iso IS NOT NULL ? ISOSecondsBetween(iso, ISOToDay(iso)) : NULL) 104 AS SkTimeId, 105 ((event_list is not null and event_list matches .*b202b.*) ? Y : N) as is_202, 106 ((event_list is not null and event_list matches .*b10b.*) ? Y : N) as is_10, 107 ((event_list is not null and event_list matches .*b12b.*) ? Y : N) as is_12, 108 ((event_list is not null and event_list matches .*b13b.*) ? Y : N) as is_13, 109 ((event_list is not null and event_list matches .*b14b.*) ? Y : N) as is_14, 110 ((event_list is not null and event_list matches .*b11b.*) ? Y : N) as is_11, 111 ((event_list is not null and event_list matches .*b1b.*) ? Y : N) as is_1, 112 REGEX_EXTRACT(pagename, F-(.*):.*, 1) AS ProductReferenceId, 113 NULL AS OriginFile; 114 115 SET DEFAULT_PARALLEL 24; 116 117 F3 = JOIN F2 BY post_search_engine LEFT, TSearchEngine BY SearchEngineId USING replicated PARALLEL 20 ; 118 F3 = FOREACH F3 GENERATE *, (SKSearchEngineId IS NULL ? -1 : SKSearchEngineId) as SKSearchEngineId; 119 --F3 = FOREACH F2 GENERATE *, NULL AS SKSearchEngineId, NULL AS SearchEngineId; 120 121 F4 = JOIN F3 BY browser LEFT, TBrowser BY BrowserId USING replicated PARALLEL 20; 122 F4 = FOREACH F4 GENERATE *, (SKBrowserId IS NULL ? -1 : SKBrowserId) as SKBrowserId; 123 124 --F4 = FOREACH F3 GENERATE *, NULL AS SKBrowserId, NULL AS BrowserId; 125 126 127 F5 = JOIN F4 BY os LEFT, TOperatingSystem BY OperatingSystemId USING replicated PARALLEL 20; 128 F5 = FOREACH F5 GENERATE *, (SKOperatingSystemId IS NULL ? -1 : SKOperatingSystemId) as SKOperatingSystemId; 129 130 --F5 = FOREACH F4 GENERATE *, NULL AS SKOperatingSystemId, NULL AS OperatingSystemId; 131 132 133 F6 = JOIN F5 BY resolution LEFT, TResolution BY ResolutionId USING replicated PARALLEL 20; 134 F6 = FOREACH F6 GENERATE *, (SKResolutionId IS NULL ? -1 : SKResolutionId) as SKResolutionId; 135 136 --F6 = FOREACH F5 GENERATE *, NULL AS SKResolutionId, NULL AS ResolutionId; 137 138 F7 = JOIN F6 BY post_evar4 LEFT, TSiteMap BY SiteMapNodeId USING replicated PARALLEL 20; 139 F7 = FOREACH F7 GENERATE *, (SKSimteMapNodeId IS NULL ? -1 : SKSimteMapNodeId) as SKSimteMapNodeId; 140 141 --F7 = FOREACH F6 GENERATE *, NULL AS SKSimteMapNodeId, NULL AS SiteMapNodeId; 142 143 144 SPLIT F7 INTO WITHOUT_CUSTOMER IF post_evar30 IS NULL, WITH_CUSTOMER IF post_evar30 IS NOT NULL; 145 146 F8 = JOIN WITH_CUSTOMER BY post_evar30 LEFT, TCustomer BY CustomerId USING skewed PARALLEL 20; 147 WITHOUT_CUSTOMER = FOREACH WITHOUT_CUSTOMER GENERATE *, NULL as SKCustomerId, NULL as CustomerId; 148 149 --F8_UNION = FOREACH F7 GENERATE *, NULL as SKCustomerId, NULL as CustomerId; 150 F8_UNION = UNION F8, WITHOUT_CUSTOMER; 151 --DESCRIBE F8; 152 --DESCRIBE WITHOUT_CUSTOMER; 153 --DESCRIBE F8_UNION; 154 155 F9 = FOREACH F8_UNION GENERATE 156 visid_high, 157 visid_low, 158 VisitId, 159 post_evar30, 160 SKCustomerId, 161 visit_num, 162 SkDateId, 163 SkTimeId, 164 post_evar16, 165 post_evar52, 166 visit_page_num, 167 is_202, 168 is_10, 169 is_12,Dataiku - Pig, Hive and Cascading
  • 18. Cascading
 From Code To JobsDataiku - Pig, Hive and Cascading
  • 19. Hive Joins
 How to join with MapReduce ? Uid Tbl_idx Name Typetbl_idx uid name Uid Name Type 1 1 Dupont1 1 Dupont 1 Dupont Type1 1 2 Type11 2 Durand 1 Dupont Type2 1 2 Type2 Shuffle by uid Reducer 1 Sort by (uid, tbl_idx)tbl_idx uid type Uid Tbl_idx Name Type2 1 Type1 Uid Name Type 2 1 Durand2 1 Type2 2 Durand Type1 2 2 Type12 2 Type1 Mappers output Reducer 2 Dataiku - Innovation Services 4/14/13 19
  • 20. Agenda }  Hadoop and Context }  Pig, Hive, Cascading, … }  How they work }  Comparing the toolsDataiku - Pig, Hive and Cascading
  • 21. Comparing without Comparable   }  Philosophy ◦  Procedural Vs Declarative ◦  Data Model and Schema }  Productivity ◦  Headachability ◦  Checkpointing ◦  Testing and environment }  Integration ◦  Partitioning ◦  Formats Integration ◦  External Code Integration }  Performance and optimizationDataiku - Pig, Hive and Cascading
  • 22. Procedural Vs Declarative }  Transformation as a }  Transformation as a sequence of set of formulas operationsUsers                                =  load  users  as  (name,  age,  ipaddr);  Clicks                              =  load  clicks  as  (user,  url,  value);  ValuableClicks              =  filter  Clicks  by  value  >  0;   insert  into  ValuableClicksPerDMA  select  UserClicks                      =  join  Users  by  name,  ValuableClicks  by   dma,  count(*)  user;   from  geoinfo  join  (    Geoinfo                            =  load  geoinfo  as  (ipaddr,  dma);    select  name,  ipaddr  from  UserGeo                            =  join  UserClicks  by  ipaddr,  Geoinfo  by   users  join  clicks  on  (users.name  =  ipaddr;   clicks.user)  ByDMA                                =  group  UserGeo  by  dma;    where  value  >  0;  ValuableClicksPerDMA  =  foreach  ByDMA  generate  group,    )  using  ipaddr  COUNT(UserGeo);   group  by  dma;  store  ValuableClicksPerDMA  into  ValuableClicksPerDMA;   Dataiku - Pig, Hive and Cascading
  • 23. Data type and Model
 Rationale }  All three Extend basic data model with extended data types ◦  array-like [ event1, event2, event3] ◦  map-like { type1:value1, type2:value2, …} }  Different approach ◦  Resilient Schema ◦  Static Typing ◦  No Static TypingDataiku - Pig, Hive and Cascading
  • 24. Hive
 Data Type and Schema CREATE TABLE visit ( user_name STRING, user_id INT, user_details STRUCT<age:INT, zipcode:INT> ); Simple type Details TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes FLOAT, DOUBLE 4 and 8 bytes BOOLEAN STRING Arbitrary-length, replaces VARCHAR TIMESTAMP Complex type Details ARRAY Array of typed items (0-indexed) MAP Associative map STRUCT Complex class-like objectsDataiku Training – Hadoop for Data Science 4/14/13 24
  • 25. Data types and Schema
 Pig rel = LOAD /folder/path/ USING PigStorage(‘t’) AS (col:type, col:type, col:type); Simple type Details int, long, float, 32 and 64 bits, signed double chararray A string bytearray An array of … bytes boolean A boolean Complex type Details tuple a tuple is an ordered fieldname:value map bag a bag is a set of tuplesDataiku Training – Hadoop for Data Science 4/14/13 25
  • 26. Data Type and Schema 
 Cascading }  Support for Any Java Types, provided they can be serialized in Hadoop }  No support for Typing Simple type Details Int, Long, Float, 32 and 64 bits, signed Double String A string byte[] An array of … bytes Boolean A boolean Complex type Details Object Object must be « Hadoop serializable »Dataiku - Pig, Hive and Cascading
  • 27. Style Summary Style Typing Data Model Metadata storePig Procedural Static + scalar + No Dynamic tuple+ bag (HCatalog) (fully recursive)Hive Declarative Static + scalar+ list Integrated Dynamic, + map enforced at execution timeCascading Procedural Weak scalar+ java No objectsDataiku - Pig, Hive and Cascading
  • 28. Comparing without Comparable   }  Philosophy ◦  Procedural Vs Declarative ◦  Data Model and Schema }  Productivity ◦  Headachability ◦  Checkpointing ◦  Testing, error management and environment }  Integration ◦  Partitioning ◦  Formats Integration ◦  External Code Integration }  Performance and optimizationDataiku - Pig, Hive and Cascading
  • 29. Headachility
Motivation}  Does debugging the tool lead to bad headaches ? Dataiku - Pig, Hive and Cascading
  • 30. Headaches
Pig}  Out Of Memory Error (Reducer)}  Exception in Building / Extended Functions 
 (handling of null)}  Null vs “”}  Nested Foreach and scoping}  Date Management (pig 0.10)}  Field implicit ordering Dataiku - Pig, Hive and Cascading
  • 31. A Pig ErrorDataiku - Pig, Hive and Cascading
  • 32. Headaches
Hive}  Out of Memory Errors in Reducers}  Few Debugging Options}  Null / “”}  No builtin “first” Dataiku - Pig, Hive and Cascading
  • 33. Headaches
Cascading}  Weak Typing Errors (comparing Int and String … )}  Illegal Operation Sequence (Group after group …)}  Field Implicit Ordering Dataiku - Pig, Hive and Cascading
  • 34. Testing
 Motivation }  How to perform unit tests ? }  How to have different versions of the same script (parameter) ?Dataiku - Pig, Hive and Cascading
  • 35. Testing
 Pig }  System Variables }  Comment to test }  No Meta Programming }  pig –x local to execute on local filesDataiku - Pig, Hive and Cascading
  • 36. Testing / Environment 
 Cascading }  Junit Tests are possible }  Ability to use code to actually comment out some variablesDataiku - Pig, Hive and Cascading
  • 37. Checkpointing 
 Motivation }  Lots of iteration while developing on Hadoop }  Sometime jobs fail }  Sometimes need to restart from the start … Parse Logs Per Page Stats Page User Correlation Filtering Output FIX and relaunchDataiku - Pig, Hive and Cascading
  • 38. Pig
 Manual Checkpointing }  STORE Command to manually 
 store files Parse Logs Per Page Stats Page User Correlation Filtering Output // COMMENT Beginning of script and relaunchDataiku - Pig, Hive and Cascading
  • 39. Cascading 
Automated Checkpointing}  Ability to re-run a flow automatically from the last saved checkpoint addCheckpoint(…)   Dataiku - Pig, Hive and Cascading
  • 40. Cascading 
Topological Scheduler}  Check each file intermediate timestamp}  Execute only if more recentParse Logs Per Page Stats Page User Correlation Filtering Output Dataiku - Pig, Hive and Cascading
  • 41. Productivity Summary Headaches Checkpointing/ Testing / Replay MetaprogrammationPig Lots Manual Save DifficultHive Few, but None (That’s SQL) None (That’s SQL) without debugging optionsCascading Weak Typing Checkpointing Possible Complexity Partial Updates Dataiku - Pig, Hive and Cascading
  • 42. Comparing without Comparable   }  Philosophy ◦  Procedural Vs Declarative ◦  Data Model and Schema }  Productivity ◦  Headachability ◦  Checkpointing ◦  Testing and environment }  Integration ◦  Formats Integration ◦  Partitioning ◦  External Code Integration }  Performance and optimizationDataiku - Pig, Hive and Cascading
  • 43. Formats Integration
 Motivation }  Ability to integrate different file formats ◦  Text Delimited ◦  Sequence File (Binary Hadoop format) ◦  Avro, Thrift .. }  Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …) Format impact on size and performance Format Size on Disk (GB) HIVE Processing time (24 cores) Text File, uncompressed 18.7 1m32s 1 Text File, Gzipped 3.89 6m23s (no parallelization) JSON compressed 7.89 2m42s multiple text file gzipped 4.02 43s Sequence File, Block, Gzip 5.32 1m18s Text File, LZO Indexed 7.03 1m22sDataiku - Pig, Hive and Cascading
  • 44. Format Integration
 }  Hive: Serde (Serialize-Deserializer) }  Pig : Storage }  Cascading: TapDataiku - Pig, Hive and Cascading
  • 45. Partitions
 Motivation }  No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition }  Common partition schemas on Hadoop ◦  By Date /apache_logs/dt=2013-01-23 ◦  By Data center /apache_logs/dc=redbus01/… ◦  By Country ◦  … ◦  Or any combination of the aboveDataiku - Pig, Hive and Cascading
  • 46. Hive Partitioning
 Partitioned tables CREATE TABLE event ( user_id INT, type STRING, message STRING) PARTITIONED BY (day STRING, server_id STRING);Disk structure/hive/event/day=2013-01-27/server_id=s1/file0/hive/event/day=2013-01-27/server_id=s1/file1/hive/event/day=2013-01-27/server_id=s2/file0/hive/event/day=2013-01-27/server_id=s2/file1…/hive/event/day=2013-01-28/server_id=s2/file0/hive/event/day=2013-01-28/server_id=s2/file1INSERT  OVERWRITE  TABLE    event  PARTITION(ds=2013-­‐01-­‐27,  server_id=‘s1’)  SELECT  *  FROM  event_tmp;  Dataiku Training – Hadoop for Data Science 4/14/13 46
  • 47. Cascading Partition }  No Direct support for partition }  Support for “Glob” Tap, to build read from files using patterns
 }  è You can code your own custom or virtual partition schemesDataiku - Pig, Hive and Cascading
  • 48. External Code Integration
 Simple UDF Pig Hive CascadingDataiku - Pig, Hive and Cascading
  • 49. Hive Complex UDF
 (Aggregators)Dataiku - Pig, Hive and Cascading
  • 50. Cascading 
 Direct Code EvaluationDataiku - Pig, Hive and Cascading
  • 51. Integration
 Summary Partition/ External Code Format Incremental Integration UpdatesPig No Direct Simple Doable and rich Support communityHive Fully integrated, Very simple, but Doable and SQL Like complex dev setup existing communityCascading With Coding Complex UDFS Doable and but regular, and growing Java Expression commuinty embeddable Dataiku - Pig, Hive and Cascading
  • 52. Comparing without Comparable   }  Philosophy ◦  Procedural Vs Declarative ◦  Data Model and Schema }  Productivity ◦  Headachability ◦  Checkpointing ◦  Testing and environment }  Integration ◦  Formats Integration ◦  Partitioning ◦  External Code Integration }  Performance and optimizationDataiku - Pig, Hive and Cascading
  • 53. Optimization }  Several Common Map Reduce Optimization Patterns ◦  Combiners ◦  MapJoin ◦  Job Fusion ◦  Job Parallelism ◦  Reducer Parallelism }  Different support per framework ◦  Fully Automatic ◦  Pragma / Directives / Options ◦  Coding style / Code to writeDataiku - Pig, Hive and Cascading
  • 54. Combiner
 Perform Partial Aggregate at Mapper Stage SELECT  date,  COUNT(*)  FROM  product  GROUP  BY  date   2012-­‐02-­‐14  4354   Map …   Reduce2012-­‐02-­‐14  4354   2012-­‐02-­‐14  20   2012-­‐02-­‐15  21we2  …    2012-­‐02-­‐15  21we2     2012-­‐02-­‐15  35  2012-­‐02-­‐14  qa334  …  2012-­‐02-­‐15  23aq2   2012-­‐02-­‐14  qa334     2012-­‐02-­‐16  1   …   2012-­‐02-­‐15  23aq2     Dataiku - Pig, Hive and Cascading
  • 55. Combiner
 Perform Partial Aggregate at Mapper Stage SELECT  date,  COUNT(*)  FROM  product  GROUP  BY  date   Map Reduce2012-­‐02-­‐14  4354   2012-­‐02-­‐14  8   2012-­‐02-­‐14  20  …   2012-­‐02-­‐15  12  2012-­‐02-­‐15  21we2         2012-­‐02-­‐15  35  2012-­‐02-­‐14  qa334  …  2012-­‐02-­‐15  23aq2     2012-­‐02-­‐14  12   2012-­‐02-­‐16  1   2012-­‐02-­‐15  23   2012-­‐02-­‐16  1         Reduced network bandwith. Better parallelism Dataiku - Pig, Hive and Cascading
  • 56. Join Optimization
 Map Join Hive set hive.auto.convert.join = true; Pig Cascading ( no aggregation support after HashJoin)Dataiku - Pig, Hive and Cascading
  • 57. Number of Reducers }  Critical for performance }  Estimated per the size of input file ◦  Hive –  divide size per hive.exec.reducers.bytes.per.reducer (default 1GB) ◦  Pig –  divide size pig.exec.reducers.bytes.per.reducer (default 1GB)Dataiku - Pig, Hive and Cascading
  • 58. Performance & Optimization 
 Summary Combiner Join Number of Optimization Optimization reducers optimizationPig Automatic Option Estimate or DIYCascading DIY HashJoin DIYHive Partial Automatic Estimate or DIY DIY (Map Join)Dataiku - Pig, Hive and Cascading
  • 59. Agenda }  Hadoop and Context (->0:03) }  Pig, Hive, Cascading, … (->0:06) }  How they work (->0:09) }  Comparing the tools (->0:25) }  Wrap’up and question (->0:30)Dataiku - Pig, Hive and Cascading
  • 60. }  Want to keep close to SQL ? ◦  Hive }  Want to write large flows ? ◦  Pig }  Want to integrate in large scale programming projects ◦  Cascading (cascalog / scalding)Dataiku - Pig, Hive and Cascading