Your SlideShare is downloading. ×
Dataiku - Paris JUG 2013 - Hadoop is a batch
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Dataiku - Paris JUG 2013 - Hadoop is a batch

854
views

Published on

Present

Present

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
854
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
33
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop Is A BatchPig, Hive, Cascading … Paris Jug May 2013 Florian Douetteau
  • 2. Florian Douetteau <florian.douetteau@dataiku.com> CEO at Dataiku Freelance at Criteo (Online Ads) CTO at IsCool Ent. (#1 French Social Gamer) VP R&D Exalead (Search Engine Technology)About me15/05/2013Dataiku Training – Hadoop for Data Science 2
  • 3.  Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)AgendaDataiku - Pig, Hive and Cascading
  • 4. CHOOSE TECHNOLOGYDataiku - Pig, Hive and CascadingHadoopCephSphereCassandraSparkScikit-LearnMahoutWEKAMLBase LibSVMSASRapidMinerSPSSPandaQlickViewTableauSpotFireHTML5/D3InfiniDBVerticaGreenPlumImpalaNetezzaElastic SearchSOLRMongoDBRiakMembasePigCascadingTalendMachine LearningMystery LandScalability CentralNoSQL-SlaviaSQL Colunnar RepublicVizualization CountyData Clean WastelandStatistician OldHouseR
  • 5. How do I (pre)process data?Implicit User Data(Views, Searches…)Content Data(Title, Categories, Price, …)Dataiku - Pig, Hive and CascadingExplicit User Data(Click, Buy, …)User Information(Location, Graph…)500TB50TB1TB200GBTransformationMatrixTransformationPredictorPer User StatsPer Content StatsUser SimilarityRank PredictorContent SimilarityA/B Test DataPredictor RuntimeOnline User Information
  • 6.  Analyse Raw Logs(Trackers, Web Logs) Extract IP, Page, … Detect and removerobots Build Statistics◦ Number of page view, perprodut◦ Best Referers◦ Traffic Analysis◦ Funnel◦ SEO Analysis◦ …Dataiku - Pig, Hive and CascadingTypical Use Case 1Web Analytics Processing
  • 7.  Extract Query Logs Perform querynormalization Compute Ngrams Compute Search“Sessions” Compute Log-Likehood Ratio forngrams acrosssesionsDataiku - Pig, Hive and CascadingTypical Use Case 2Mining Search Logs for Synonyms
  • 8.  Compute User –Product AssociationMatrix Compute differentsimilarities ratio(Ochiai, Cosine, …) Filter out badpredictions For each user, selectbest recommendableproductsDataiku - Pig, Hive and CascadingTypical Use Case 3Product Recommender
  • 9.  Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)AgendaDataiku - Pig, Hive and Cascading
  • 10.  Yahoo Research in 2006 Inspired from Sawzall, a Google Paperfrom 2003 2007 as an Apache Project Initial motivation◦ Search Log Analytics: how long is theaverage user session ? how many links doesa user click ? on before leaving a website ?how do click patterns vary in the course of aday/week/month ? …Pig HistoryDataiku - Pig, Hive and Cascadingwords = LOAD /training/hadoop-wordcount/output‘USING PigStorage(‘t’)AS (word:chararray, count:int);sorted_words = ORDER words BY count DESC;first_words = LIMIT sorted_words 10;DUMP first_words;
  • 11.  Developed by Facebook in January 2007 Open source in August 2008 Initial Motivation◦ Provide a SQL like abstraction to performstatistics on status updatesHive HistoryDataiku - Pig, Hive and Cascadingcreate external table wordcounts (word string,count int) row format delimited fields terminated by tlocation /training/hadoop-wordcount/output;select * from wordcounts order by count desc limit10;select SUM(count) from wordcounts where word like‘th%’;
  • 12.  Authored by Chris Wensel 2008 Associated Projects◦ Cascalog : Cascading in Closure◦ Scalding : Cascading in Scala (Twitterin 2012)◦ Lingual ( to be released soon): SQLlayer on top of cascadingCascading HistoryDataiku - Pig, Hive and Cascading
  • 13.  Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)AgendaDataiku - Pig, Hive and Cascading
  • 14. MapReduceSimplicity is a complexity5/15/2013Dataiku - Innovation Services 14
  • 15. Pig & HiveMapping to Mapreduce jobs5/15/2013Dataiku - Innovation Services 15* VAT excludedevents = LOAD ‘/events’ USING PigStorage(‘t’) AS(type:chararray, user:chararray, price:int, timestamp:int);events_filtered = FILTER events BY type;by_user = GROUP events_filtered BY user;price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price,MAX(timestamp) as max_ts;high_pbu = FILTER price_by_user BY total_price > 1000;Job 1 : Mapper Job 1 : Reducer1LOAD FILTER GROUP FOREACH FILTERShuffle andsort by user
  • 16. Pig & HiveMapping to Mapreduce jobs5/15/2013Dataiku - Innovation Services 16events = LOAD ‘/events’ USING PigStorage(‘t’) AS(type:chararray, user:chararray, price:int, timestamp:int);events_filtered = FILTER events BY type;by_user = GROUP events_filtered BY user;price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price,MAX(timestamp) as max_ts;high_pbu = FILTER price_by_user BY total_price > 1000;recent_high = ORDER high_pbu BY max_ts DESC;STORE recent_high INTO ‘/output’;Job 1: Mapper Job 1 :ReducerLOAD FILTER GROUP FOREACH FILTERShuffle andsort by userJob 2: Mapper Job 2: ReducerLOAD(from tmp)STOREShuffle andsort by max_ts
  • 17. PigHow does it workDataiku - Pig, Hive and CascadingData Execution Plan compiled into 10map reduce jobs executed in parallel(or not)84 TResolution = LOAD $PREFIX/dwh_dim_external_tracking_resolution/dt=$DAY USING PigStorage(u0001);85 TResolution = FOREACH TResolution GENERATE $0 AS SKResolutionId, $1 as ResolutionId;868788 TSiteMap = LOAD $PREFIX/dwh_dim_sitemapnode/dt=$DAY USING PigStorage(u0001);89 TSiteMap = FOREACH TSiteMap GENERATE $0 AS SKSimteMapNodeId, $2 as SiteMapNodeId;909192 TCustomer = LOAD $PREFIX/customer_relation/dt=$DAY USING PigStorage(u0001)93 as (SKCustomerId:chararray,94 CustomerId:chararray);9596 F1 = FOREACH F1 GENERATE *, (date_time IS NOT NULL ? CustomFormatToISO(date_time, yyyy-MM-dd HH:mm:ss9798 F2 = FOREACH F1 GENERATE *,99 CONCAT(CONCAT(CONCAT(CONCAT(visid_high,-), visid_low), -), visit_num) as VisitId,100 (referrer matches .*cdiscount.com.* OR referrer matches cdscdn.com ? NULL :referrer ) as Referrer,101 (iso IS NOT NULL ? ISODaysBetween(iso, 1899-12-31T00:00:00) : NULL)102 AS SkDateId,103 (iso IS NOT NULL ? ISOSecondsBetween(iso, ISOToDay(iso)) : NULL)104 AS SkTimeId,105 ((event_list is not null and event_list matches .*b202b.*) ? Y : N) as is_202,106 ((event_list is not null and event_list matches .*b10b.*) ? Y : N) as is_10,107 ((event_list is not null and event_list matches .*b12b.*) ? Y : N) as is_12,108 ((event_list is not null and event_list matches .*b13b.*) ? Y : N) as is_13,109 ((event_list is not null and event_list matches .*b14b.*) ? Y : N) as is_14,110 ((event_list is not null and event_list matches .*b11b.*) ? Y : N) as is_11,111 ((event_list is not null and event_list matches .*b1b.*) ? Y : N) as is_1,112 REGEX_EXTRACT(pagename, F-(.*):.*, 1) AS ProductReferenceId,113 NULL AS OriginFile;114115 SET DEFAULT_PARALLEL 24;116117 F3 = JOIN F2 BY post_search_engine LEFT, TSearchEngine BY SearchEngineId USING replicated PARALLEL 20 ;118 F3 = FOREACH F3 GENERATE *, (SKSearchEngineId IS NULL ? -1 : SKSearchEngineId) as SKSearchEngineId;119 --F3 = FOREACH F2 GENERATE *, NULL AS SKSearchEngineId, NULL AS SearchEngineId;120121 F4 = JOIN F3 BY browser LEFT, TBrowser BY BrowserId USING replicated PARALLEL 20;122 F4 = FOREACH F4 GENERATE *, (SKBrowserId IS NULL ? -1 : SKBrowserId) as SKBrowserId;123124 --F4 = FOREACH F3 GENERATE *, NULL AS SKBrowserId, NULL AS BrowserId;125126127 F5 = JOIN F4 BY os LEFT, TOperatingSystem BY OperatingSystemId USING replicated PARALLEL 20;128 F5 = FOREACH F5 GENERATE *, (SKOperatingSystemId IS NULL ? -1 : SKOperatingSystemId) as SKOperatingSystemId;129130 --F5 = FOREACH F4 GENERATE *, NULL AS SKOperatingSystemId, NULL AS OperatingSystemId;131132133 F6 = JOIN F5 BY resolution LEFT, TResolution BY ResolutionId USING replicated PARALLEL 20;134 F6 = FOREACH F6 GENERATE *, (SKResolutionId IS NULL ? -1 : SKResolutionId) as SKResolutionId;135136 --F6 = FOREACH F5 GENERATE *, NULL AS SKResolutionId, NULL AS ResolutionId;137138 F7 = JOIN F6 BY post_evar4 LEFT, TSiteMap BY SiteMapNodeId USING replicated PARALLEL 20;139 F7 = FOREACH F7 GENERATE *, (SKSimteMapNodeId IS NULL ? -1 : SKSimteMapNodeId) as SKSimteMapNodeId;140141 --F7 = FOREACH F6 GENERATE *, NULL AS SKSimteMapNodeId, NULL AS SiteMapNodeId;142143144 SPLIT F7 INTO WITHOUT_CUSTOMER IF post_evar30 IS NULL, WITH_CUSTOMER IF post_evar30 IS NOT NULL;145146 F8 = JOIN WITH_CUSTOMER BY post_evar30 LEFT, TCustomer BY CustomerId USING skewed PARALLEL 20;147 WITHOUT_CUSTOMER = FOREACH WITHOUT_CUSTOMER GENERATE *, NULL as SKCustomerId, NULL as CustomerId;148149 --F8_UNION = FOREACH F7 GENERATE *, NULL as SKCustomerId, NULL as CustomerId;150 F8_UNION = UNION F8, WITHOUT_CUSTOMER;151 --DESCRIBE F8;152 --DESCRIBE WITHOUT_CUSTOMER;153 --DESCRIBE F8_UNION;154155 F9 = FOREACH F8_UNION GENERATE156 visid_high,157 visid_low,158 VisitId,159 post_evar30,160 SKCustomerId,161 visit_num,162 SkDateId,163 SkTimeId,164 post_evar16,165 post_evar52,166 visit_page_num,167 is_202,168 is_10,169 is_12,
  • 18. Reducer 2Mappers outputReducer 1Hive JoinsHow to join with MapReduce ?15/05/2013Dataiku - Innovation Services 19tbl_idx uid name1 1 Dupont1 2 Durandtbl_idx uid type2 1 Type12 1 Type22 2 Type1Shuffle by uidSort by (uid, tbl_idx)Uid Tbl_idx Name Type1 1 Dupont1 2 Type11 2 Type2Uid Tbl_idx Name Type2 1 Durand2 2 Type1Uid Name Type1 Dupont Type11 Dupont Type2Uid Name Type2 Durand Type1
  • 19.  Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)AgendaDataiku - Pig, Hive and Cascading
  • 20.  Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema Productivity◦ Headachability◦ Checkpointing◦ Testing and environment Integration◦ Partitioning◦ Formats Integration◦ External Code Integration Performance and optimizationComparing without ComparableDataiku - Pig, Hive and Cascading
  • 21.  Transformation as asequence ofoperations Transformation as aset of formulasDataiku - Pig, Hive and CascadingProcedural Vs Declarativeinsert into ValuableClicksPerDMA selectdma, count(*)from geoinfo join (select name, ipaddr fromusers join clicks on (users.name =clicks.user)where value > 0;) using ipaddrgroup by dma;Users = load users as (name, age, ipaddr);Clicks = load clicks as (user, url, value);ValuableClicks = filter Clicks by value > 0;UserClicks = join Users by name, ValuableClicks byuser;Geoinfo = load geoinfo as (ipaddr, dma);UserGeo = join UserClicks by ipaddr, Geoinfo byipaddr;ByDMA = group UserGeo by dma;ValuableClicksPerDMA = foreach ByDMA generate group,COUNT(UserGeo);store ValuableClicksPerDMA into ValuableClicksPerDMA;
  • 22.  All three Extend basic data model with extendeddata types◦ array-like [ event1, event2, event3]◦ map-like { type1:value1, type2:value2, …} Different approach◦ Resilient Schema◦ Static Typing◦ No Static TypingData type and ModelRationaleDataiku - Pig, Hive and Cascading
  • 23. HiveData Type and Schema5/15/2013 24Simple type DetailsTINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytesFLOAT, DOUBLE 4 and 8 bytesBOOLEANSTRING Arbitrary-length, replaces VARCHARTIMESTAMPComplex type DetailsARRAY Array of typed items (0-indexed)MAP Associative mapSTRUCT Complex class-like objectsDataiku Training – Hadoop for Data ScienceCREATE TABLE visit (user_name STRING,user_id INT,user_details STRUCT<age:INT, zipcode:INT>);
  • 24. rel = LOAD /folder/path/USING PigStorage(‘t’)AS (col:type, col:type, col:type);Data types and SchemaPig5/15/2013 25Simple type Detailsint, long, float,double32 and 64 bits, signedchararray A stringbytearray An array of … bytesboolean A booleanComplex type Detailstuple a tuple is an ordered fieldname:value mapbag a bag is a set of tuplesDataiku Training – Hadoop for Data Science
  • 25.  Support for Any Java Types, provided they can beserialized in Hadoop No support for TypingData Type and SchemaCascadingDataiku - Pig, Hive and CascadingSimple type DetailsInt, Long, Float,Double32 and 64 bits, signedString A stringbyte[] An array of … bytesBoolean A booleanComplex type DetailsObject Object must be « Hadoop serializable »
  • 26. Style SummaryDataiku - Pig, Hive and CascadingStyle Typing Data Model MetadatastorePig Procedural Static +Dynamicscalar +tuple+ bag(fullyrecursive)No(HCatalog)Hive Declarative Static +Dynamic,enforced atexecutiontimescalar+ list+ mapIntegratedCascading Procedural Weak scalar+ javaobjectsNo
  • 27.  Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema Productivity◦ Headachability◦ Checkpointing◦ Testing, error management and environment Integration◦ Partitioning◦ Formats Integration◦ External Code Integration Performance and optimizationComparing without ComparableDataiku - Pig, Hive and Cascading
  • 28.  Does debuggingthe tool lead to badheadaches ?Dataiku - Pig, Hive and CascadingHeadachilityMotivation
  • 29.  Out Of Memory Error (Reducer) Exception in Building /Extended Functions(handling of null) Null vs “” Nested Foreach and scoping Date Management (pig 0.10) Field implicit orderingDataiku - Pig, Hive and CascadingHeadachesPig
  • 30. A Pig ErrorDataiku - Pig, Hive and Cascading
  • 31.  Out of Memory Errors inReducers Few Debugging Options Null / “” No builtin “first”Dataiku - Pig, Hive and CascadingHeadachesHive
  • 32.  Weak Typing Errors (comparingInt and String … ) Illegal Operation Sequence(Group after group …) Field Implicit OrderingDataiku - Pig, Hive and CascadingHeadachesCascading
  • 33.  How to perform unit tests ? How to have different versions of the same script(parameter) ?TestingMotivationDataiku - Pig, Hive and Cascading
  • 34.  System Variables Comment to test No Meta Programming pig –x local to execute on local filesTestingPigDataiku - Pig, Hive and Cascading
  • 35.  Junit Tests are possible Ability to use code to actually comment out somevariablesTesting / EnvironmentCascadingDataiku - Pig, Hive and Cascading
  • 36.  Lots of iteration while developing on Hadoop Sometime jobs fail Sometimes need to restart from the start …CheckpointingMotivationDataiku - Pig, Hive and CascadingPage User Correlation OutputFilteringParse Logs Per Page StatsFIX and relaunch
  • 37.  STORE Command to manuallystore filesPigManual CheckpointingDataiku - Pig, Hive and CascadingPage User Correlation OutputFilteringParse Logs Per Page Stats// COMMENT Beginningof script and relaunch
  • 38.  Ability to re-run aflow automaticallyfrom the last savedcheckpointDataiku - Pig, Hive and CascadingCascadingAutomated CheckpointingaddCheckpoint(…)
  • 39.  Check each file intermediate timestamp Execute only if more recentDataiku - Pig, Hive and CascadingCascadingTopological SchedulerPage User Correlation OutputFilteringParse Logs Per Page Stats
  • 40. Productivity SummaryDataiku - Pig, Hive and CascadingHeadaches Checkpointing/ReplayTesting /MetaprogrammationPig Lots Manual Save Difficult Metaprogramming, easylocal testingHive Few, butwithoutdebuggingoptionsNone (That’s SQL) None (That’s SQL)Cascading Weak TypingComplexityCheckpointingPartial UpdatesPossible
  • 41.  Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema Productivity◦ Headachability◦ Checkpointing◦ Testing and environment Integration◦ Formats Integration◦ Partitioning◦ External Code Integration Performance and optimizationComparing without ComparableDataiku - Pig, Hive and Cascading
  • 42.  Ability to integrate different file formats◦ Text Delimited◦ Sequence File (Binary Hadoop format)◦ Avro, Thrift .. Ability to integrate with external data sources orsink ( MongoDB, ElasticSearch, Database. …)Formats IntegrationMotivationDataiku - Pig, Hive and CascadingFormat Size on Disk (GB) HIVE Processing time (24 cores)Text File, uncompressed 18.7 1m32s1 Text File, Gzipped 3.89 6m23s(no parallelization)JSON compressed 7.89 2m42smultiple text file gzipped 4.02 43sSequence File, Block, Gzip 5.32 1m18sText File, LZO Indexed 7.03 1m22sFormat impact on size and performance
  • 43.  Hive: Serde (Serialize-Deserializer) Pig : Storage Cascading: TapFormat IntegrationDataiku - Pig, Hive and Cascading
  • 44.  No support for “UPDATE” patterns, any increment isperformed by adding or deleting a partition Common partition schemas on Hadoop◦ By Date /apache_logs/dt=2013-01-23◦ By Data center /apache_logs/dc=redbus01/…◦ By Country◦ …◦ Or any combination of the abovePartitionsMotivationDataiku - Pig, Hive and Cascading
  • 45. Hive PartitioningPartitioned tables5/15/2013 46CREATE TABLE event (user_id INT,type STRING,message STRING)PARTITIONED BY (day STRING, server_id STRING);Disk structure/hive/event/day=2013-01-27/server_id=s1/file0/hive/event/day=2013-01-27/server_id=s1/file1/hive/event/day=2013-01-27/server_id=s2/file0/hive/event/day=2013-01-27/server_id=s2/file1…/hive/event/day=2013-01-28/server_id=s2/file0/hive/event/day=2013-01-28/server_id=s2/file1Dataiku Training – Hadoop for Data ScienceINSERT OVERWRITE TABLE event PARTITION(ds=2013-01-27,server_id=‘s1’)SELECT * FROM event_tmp;
  • 46.  No Direct support for partition Support for “Glob” Tap, to build read from filesusing patterns  You can code your own custom or virtualpartition schemesCascading PartitionDataiku - Pig, Hive and Cascading
  • 47. External Code IntegrationSimple UDFDataiku - Pig, Hive and CascadingPig HiveCascading
  • 48. Hive Complex UDF(Aggregators)Dataiku - Pig, Hive and Cascading
  • 49. CascadingDirect Code EvaluationDataiku - Pig, Hive and CascadingUses Janino, a very cool project:http://docs.codehaus.org/display/JANINO
  • 50.  Allow to call a cascading flow from a Spring BatchSpring BatchCascading IntegrationDataiku - Pig, Hive and Cascading No full Integration with Spring MessageSource orMessageHandler yet (only for local flows)
  • 51. IntegrationSummaryDataiku - Pig, Hive and CascadingPartition/Incremental UpdatesExternal Code FormatIntegrationPig No DirectSupportSimple Doable and richcommunityHive Fully integrated,SQL LikeVery simple, butcomplex dev setupDoable andexistingcommunityCascading With Coding Complex UDFSbut regular, andJava ExpressionembeddableDoable andgrowingcommuinty
  • 52.  Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema Productivity◦ Headachability◦ Checkpointing◦ Testing and environment Integration◦ Formats Integration◦ Partitioning◦ External Code Integration Performance and optimizationComparing without ComparableDataiku - Pig, Hive and Cascading
  • 53.  Several Common Map Reduce OptimizationPatterns◦ Combiners◦ MapJoin◦ Job Fusion◦ Job Parallelism◦ Reducer Parallelism Different support per framework◦ Fully Automatic◦ Pragma / Directives / Options◦ Coding style / Code to writeOptimizationDataiku - Pig, Hive and Cascading
  • 54. SELECT date, COUNT(*) FROM product GROUP BY dateCombinerPerform Partial Aggregate at Mapper StageDataiku - Pig, Hive and CascadingMap Reduce2012-02-14 4354…2012-02-15 21we22012-02-14 qa334…2012-02-15 23aq22012-02-14 202012-02-15 352012-02-16 12012-02-14 4354…2012-02-15 21we22012-02-14 qa334…2012-02-15 23aq2
  • 55. SELECT date, COUNT(*) FROM product GROUP BY dateCombinerPerform Partial Aggregate at Mapper StageDataiku - Pig, Hive and CascadingMap Reduce2012-02-14 4354…2012-02-15 21we22012-02-14 qa334…2012-02-15 23aq22012-02-14 122012-02-15 232012-02-16 12012-02-14 82012-02-15 122012-02-14 202012-02-15 352012-02-16 1Reduced network bandwith. Better parallelism
  • 56. Join OptimizationMap JoinDataiku - Pig, Hive and Cascadingset hive.auto.convert.join = true;HivePigCascading( no aggregation support after HashJoin)
  • 57.  Critical for performance Estimated per the size of input file◦ Hive divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)◦ Pig divide size pig.exec.reducers.bytes.per.reducer (default 1GB)Number of ReducersDataiku - Pig, Hive and Cascading
  • 58. CombinerOptimizationJoinOptimizationNumber ofreducersoptimizationPig Automatic Option Estimate or DIYCascading DIY HashJoin DIYHive PartialDIYAutomatic(Map Join)Estimate or DIYPerformance & OptimizationSummaryDataiku - Pig, Hive and Cascading
  • 59.  Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)AgendaDataiku - Pig, Hive and Cascading
  • 60. Follow the FlowDataiku - Pig, Hive and CascadingTracker LogMongoDBMySQLMySQLSyslogProductCatalogOrderApache LogsSessionProduct TransformationCategory AffinityCategory TargetingCustomer ProfileProduct RecommenderS3Search Logs (External) Search EngineOptimization(Internal) SearchRankingMongoDBMySQLPartner FTPSync In Sync OutPigPigHiveHiveElasticSearch
  • 61. E.g. Product RecommenderDataiku - Pig, Hive and CascadingPage ViewsOrdersCatalogBots, Special UsersFiltered Page ViewsUser AffinityProduct PopularityUser Similarity (Per Category)Recommendation GraphRecommendationOrder SummaryUser Similarity (Per Brand)Machine Learning
  • 62.  Schema Maintenance between tools Proper incremental and efficient synchronizationbetween tools and NoSQL Store and Logs Systems Proper “management” partition (daily jobs, …) Job Sequence and Management◦ How to handle properly a new field ? a missing data ?recompute everything ?Pain PointsOn Large ProjectsDataiku - Pig, Hive and Cascading
  • 63.  Hcatalog provides an interoberability between Hiveand Pig in term of schemaIntegration OptionHCatalogDataiku - Pig, Hive and CascadingHive PigHCatalog
  • 64.  1970 Shell script 1977 Makefile 1980 Makedeps 1999 Cons/CMake 2001 Maven 2004 Ivy 2008 Gradle Shell Script 2008 HaMake 2009 Oozie … ETL Hadoop Next … ?Dataiku - Pig, Hive and CascadingSimilar to “Build”
  • 65.  Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)AgendaDataiku - Pig, Hive and Cascading
  • 66.  Want to keep close to SQL ?◦ Hive Want to write large flows ?◦ Pig Want to integrate in large scale programmingprojects◦ Cascading (cascalog / scalding)Dataiku - Pig, Hive and CascadingPresentation Available Onhttp://www.slideshare.net/Dataiku