• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cascading and BigData Problems
 

Cascading and BigData Problems

on

  • 1,998 views

A series of case-studies using Cascading.

A series of case-studies using Cascading.

Statistics

Views

Total Views
1,998
Views on SlideShare
1,998
Embed Views
0

Actions

Likes
11
Downloads
91
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Startups expecting to need 'web scale' implementations are committing to technologies that might not be a good fit. Doing so can be a dramatic waste of time, money and resources when they can ill afford to do so. Do you really have a Big Data problem? Do you have a plan for what you are going to do with it? Chris will try to explain where he sees Hadoop being used most successfully and will offer up some guidelines on when to consider adopting it and any complimentary technologies.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Cascading and BigData Problems Cascading and BigData Problems Presentation Transcript

  • Cascading and BigData Problems Chris K Wensel Concurrent, Inc. Copyright Concurrent, Inc. 2011. All rights reserved.
  • About Me• Concurrent, Inc., Founder • Cascading support and tools • http://concurrentinc.com/• Cascading, Lead Developer (started Sept 2007) • An alternative API to MapReduce • http://cascading.org/• Formerly Hadoop mentoring and training • Sun - Apple - HP - LexisNexis - startups - etc• Formerly Systems Architect & Consultant • Thomson/Reuters - TeleAtlas - startups - etc Copyright Concurrent, Inc. 2011. All rights reserved.
  • Overview• Case Studies• What’s in common?• Where does Hadoop fit?• Processing vs Innovation Copyright Concurrent, Inc. 2011. All rights reserved.
  • Case Studies• ShareThis• BestBuy• FlightCaster• Etsy• Ion Flux Copyright Concurrent, Inc. 2011. All rights reserved.
  • Summary• All running in production with Hadoop• All use AWS, most use Elastic MapReduce• All production processing was implemented in Cascading• Various other tools used at different stages of development Copyright Concurrent, Inc. 2011. All rights reserved.
  • Share This• Cascading + AWS (pre-EMR)• Daily event log processing, initially multiple TB and growing• Details in the O’Reilly Hadoop book from Tom White Copyright Concurrent, Inc. 2011. All rights reserved.
  • Lessons every Y hrs on crawl completion every X hrs logprocessor crawler indexer ... ...• Mark data as bad and why, never discard • useful for upstream debugging• Data is seasonal, cyclical, and bursty• Tune your app and cluster to the workload• (garbage collect Hadoop clusters) Copyright Concurrent, Inc. 2011. All rights reserved.
  • BestBuy - Behavioral Ad- Targeting• Cascading + AWS (Elastic MapReduce)• Daily automated User Behavior Segmentation• 6wks dev, 3T/day, $13k/mo• 500% increase in return on ad spend from a similar campaign a year before• http://aws.amazon.com/solutions/case-studies/ razorfish/ Copyright Concurrent, Inc. 2011. All rights reserved.
  • Cluster Amazon Web Services Elastic MapReduce Slaves Ad System Map/Reduce behavior app HDFS input output S3 E-Commerce Site• 200+ nodes, 9-12 hour runs• 30+ days of history + 3TB daily• Remote HTTP update of ad-server • of only changed data Copyright Concurrent, Inc. 2011. All rights reserved.
  • Road Blocks• No one really understood the data • Character formats (UTF-8 vs ...) • Zero byte chars • Unique columns not unique • Outliers in the data• Creating test data• QAing the data • result data was also big Copyright Concurrent, Inc. 2011. All rights reserved.
  • FlightCaster - Predicting Flight Delays• Clojure + Cascading + AWS• Scours data on every domestic flight for the past 10-years and matches it to real- time conditions• Machine learning on Cascading, Scoring on app server• 3mos dev, 10G day, <$2k/mos Copyright Concurrent, Inc. 2011. All rights reserved.
  • Lessons• Even with a good abstraction, you must intuit the underlying model (MapReduce) to improve throughput• i.e. Logical vs Physical plans • we still need DBAs after decades of query planner dev Copyright Concurrent, Inc. 2011. All rights reserved.
  • Etsy - Online Marketplace• JRuby + Cascading + AWS• 1B page-views & multi-T data/mo, of logs• 40-50 cascading.jruby jobs a night• http://codeascraft.etsy.com/2010/02/24/ analyzing-etsys-data-with-hadoop-and-cascading/• http://www.concurrentinc.com/casestudies/etsy Copyright Concurrent, Inc. 2011. All rights reserved.
  • Initially• JRuby for the ‘analysts’• Log pre-processing,• db snap shot diffs,• nightly and ad-hoc analytics Copyright Concurrent, Inc. 2011. All rights reserved.
  • Data Driven Products• Search index/scoring (under dev)• Taste Test• Facebook gift recommender• Suggested shops• Top query list, etc...• Many more on the way Copyright Concurrent, Inc. 2011. All rights reserved.
  • Ion Flux - Gene Sequencing• Cascading + AWS• Sequence Alignment• http://aws.amazon.com/solutions/case- studies/ion-flux/ Copyright Concurrent, Inc. 2011. All rights reserved.
  • Cluster• 10-30 nodes, using new HPC instances• 200-500 cores,• runs up to 50 hours Copyright Concurrent, Inc. 2011. All rights reserved.
  • Architecture Delivery Ion Flux - Annotation Server Clinical Lab Ion Torrent - Torrent Sequencer Ion Torrent - Torrent Server (EC2) Ion Flux - Pipeline Controller FastQ Ion Torrent RAW Data FTP FTP RAW Data FTP Upload AnnotationSample Prep Chip Measure DNA File File Basecalling Sequence Wait Start Pipeline Third Party Clients Client Server File Server Complete? Database (RDS) Client App Annotation ReST Server Ion Flux - Flux Capacitor Ion Flux - LIMS FastQ FastQ CompressedDNA Sample FTP Transfer Cloud LIMS ReST Chip LIMS Sequence Split File Sequence Compress Sequence Transfer Agent Client Agent Input Server (EC2) Metadata Database File Chunk File (RDS) Ion Flux - Variant Server (EC2) Variants ReST Variant Server Database Ion Flux - (RDS) Client Website External Variant Complete Partners Report Runs (EC2) AWS - S3 Storage FastQ Software & Performance PILEUP Sequence Data Data Variants Chunks Heavy Lifting Ion Flux - Sequencing Pipeline AWS - EMR Cluster SAM Corrected Bootstrap Cluster Configure SAM Sort by Sorted SAM Split to PILEUP TMAP Alignment SRMA SAM PILEUP Nodes Pipeline Alignments position Alignments Bins Variants Bins Alignments Create Cluster Shutdown Cascading Cluster Cleanup Cluster Start Node Performance Profiler Data Copyright Concurrent, Inc. 2011. All rights reserved.
  • Common Architecture intermediate data raw data loggers ? valuable loggers data loggers Analyst Producer Consumer Developer Value• New data continuously arriving• Actively incorporating the new with the old• Updating backend systems Copyright Concurrent, Inc. 2011. All rights reserved.
  • Common Constraints• Speed of light• Understanding the data• Creating tests and validating the results• Lifecycle phases have different environments • dev vs. integration vs. prod• Better algorithms, less cost, more complexity Copyright Concurrent, Inc. 2011. All rights reserved.
  • Apps Have Many Stages• Heavy Lifting • Scoring• Modeling & • Processing Learning Copyright Concurrent, Inc. 2011. All rights reserved.
  • Heavy Lifting• ETL Style processes hampered by physics• Moving/Transferring/Packaging data• Data cleansing and value normalization Copyright Concurrent, Inc. 2011. All rights reserved.
  • Modeling & Learning• Also known as “Data Mining”• Ask lots of questions to understand the data• Machine learning, or• Ad-hoc queries• Where the innovation happens Copyright Concurrent, Inc. 2011. All rights reserved.
  • Processing• Transforming and/or combining multiple data sets into new data sets or models • Analytics, • indexing (w/ • statistics, scoring), • enrichment, • feature reduction, • entity extraction, • matching Copyright Concurrent, Inc. 2011. All rights reserved.
  • Scoring• Apply what’s learned• Sometimes batch (as part of Processing) • indices with search result ranking• Sometimes transactional, req/resp • prediction, recommendations, etc Copyright Concurrent, Inc. 2011. All rights reserved.
  • In Summary collection cleansing processing deliveryevent data signal info knowledge normalization scoring mining The point of computing systems is to make data more valuable Copyright Concurrent, Inc. 2011. All rights reserved.
  • Where does Hadoop fit? Copyright Concurrent, Inc. 2011. All rights reserved.
  • Hadoop Cluster Rack Rack Rack Node Node Node Node ... Global Compute-space Global Namespace• Distributed replicated storage for large files• Distributed fault tolerant exec of batch processes• Scale out vs (legacy) scale up• Java API allows complex analysis, more freedom Copyright Concurrent, Inc. 2011. All rights reserved.
  • MapReduce• A “divide and conquer” strategy for parallelizing workloads against collections of data• Map & Reduce are two user defined functions chained via Key Value Pairs• It’s really Map->Group->Reduce where Group is built in Copyright Concurrent, Inc. 2011. All rights reserved.
  • Keys and Values• Map translates input to keys and values to new keys and values [K1,V1] Map [K2,V2]*• System Groups each unique [K2,V2] Group [K2,{V2,V2,....}] key with all its values [K2,{V2,V2,....}] Reduce [K3,V3]*• Reduce translates the values of each unique key to new keys and values * = zero or more Copyright Concurrent, Inc. 2011. All rights reserved.
  • Word CountMapper [0, "when in the course of human events"] Map ["when",1] ["in",1] ["the",1] [...,1] ["when",1] ["when",1] ["when",1] ["when",1] Group ["when",{1,1,1,1,1}] ["when",1]Reducer ["when",{1,1,1,1,1}] Reduce ["when",5] Copyright Concurrent, Inc. 2011. All rights reserved.
  • Divide and Conquer Parallelism• Since the ‘records’ entering the Map and ‘groups’ entering the Reduce are independent• That is, there is no expectation of order or requirement to share state between records/ groups• Arbitrary numbers of Map and Reduce function instances can be created against arbitrary portions of input data Copyright Concurrent, Inc. 2011. All rights reserved.
  • Cluster Cluster Rack Rack Rack Node Node Node Node ... map map map map map reduce reduce reduce• Multiple instances of each Map and Reduce function are distributed throughout the cluster Copyright Concurrent, Inc. 2011. All rights reserved.
  • Another View [K1,V1] Map [K2,V2] Combine Group [K2,{V2,...}] Reduce [K3,V3] Mapper Task same code Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Task Mappers must complete before Reducers can beginsplit1 split2 split3 split4 ... part-00000 part-00001 part-000N file directory Copyright Concurrent, Inc. 2011. All rights reserved.
  • Architectural Components NameNode DataNode DataNode DataNode DataNode data block ns read/write operations Secondary ns operations read/write ns operations read/write mapper mapper child jvm mapper child jvm jobs tasks child jvm Client JobTracker TaskTracker reducer reducer child jvm reducer child jvm child jvm• Solid boxes are unique applications• Dashed boxes are child JVM instances on same node as parent• Dotted boxes are blocks of managed files on same node as parent Copyright Concurrent, Inc. 2011. All rights reserved.
  • Deployment Topology Node Node Node jobs tasks Client JobTracker TaskTracker DataNode Node NameNode Not uncommon to Node be same node Secondary• Job Client may run on any node• NameNode and JobTracker may run on same node (Master)• DataNode and TaskTracker instances should run on same node (Slaves)• NameNode and SecondaryNode shouldn’t typically run on same node Copyright Concurrent, Inc. 2011. All rights reserved.
  • Complex job assemblies• Real applications are many MapReduce jobs chained together• Linked by intermediate (usually temporary) files• Executed in order, by hand, from the ‘client’ application Count Job Sort Job [ k, [v] ] [ k, [v] ] Map Reduce Map Reduce [ k, v ] [ k, v ] [ k, v ] [ k, v ] File File File [ k, v ] = key and value pair [ k, [v] ] = key and associated values collection Copyright Concurrent, Inc. 2011. All rights reserved.
  • Tokenize Count Job Map Reduce Map Reduce File File File Join Tokens/Counts Job File Map Reduce File Sort/Prefix Filter Job Map Reduce FileMatch two sets Self Join Job Map Reduce using prefix File filtering Unique Pairs Job Map Reduce File Join LHS Job Map Reduce File Join RHS / Match Job Map Reduce File Copyright Concurrent, Inc. 2011. All rights reserved.
  • Real World Apps [37/75] map+reduce [54/75] map+reduce[41/75] map+reduce [43/75] map+reduce [42/75] map+reduce [45/75] map+reduce [44/75] map+reduce [39/75] map+reduce [36/75] map+reduce [46/75] map+reduce [40/75] map+reduce [50/75] map+reduce [38/75] map+reduce [49/75] map+reduce [51/75] map+reduce [47/75] map+reduce [52/75] map+reduce [53/75] map+reduce [48/75] map+reduce[23/75] map+reduce [25/75] map+reduce [24/75] map+reduce [27/75] map+reduce [26/75] map+reduce [21/75] map+reduce [19/75] map+reduce [28/75] map+reduce [22/75] map+reduce [32/75] map+reduce [20/75] map+reduce [31/75] map+reduce [33/75] map+reduce [29/75] map+reduce [34/75] map+reduce [35/75] map+reduce [30/75] map+reduce [7/75] map+reduce [2/75] map+reduce [8/75] map+reduce [10/75] map+reduce [9/75] map+reduce [5/75] map+reduce [3/75] map+reduce [11/75] map+reduce [6/75] map+reduce [13/75] map+reduce [4/75] map+reduce [16/75] map+reduce [14/75] map+reduce [15/75] map+reduce [17/75] map+reduce [18/75] map+reduce [12/75] map+reduce [60/75] map [62/75] map [61/75] map [58/75] map [55/75] map [56/75] map+reduce [57/75] map [71/75] map [72/75] map [59/75] map [64/75] map+reduce [63/75] map+reduce [65/75] map+reduce [68/75] map+reduce [67/75] map+reduce [70/75] map+reduce [69/75] map+reduce [73/75] map+reduce [66/75] map+reduce [74/75] map+reduce [75/75] map+reduce [1/75] map+reduce1 app, 75 jobsgreen = map + reducepurple = mapblue = join/mergeorange = map split Copyright Concurrent, Inc. 2011. All rights reserved.
  • Cascading Word Count/Sort Flow Map Reduce Map Reduce [ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ] Parse Group Count Sort [ f1,f2,.. ] [ f1,f2,.. ] Data [ f1, f2,... ] = tuples with field names Data• Alternative model & API to MapReduce • pipe/filters of re-usable operations• For rapidly implementing Data Processing Systems Copyright Concurrent, Inc. 2011. All rights reserved.
  • Cascading• Allows for Unit testing independent of integration• Re-usable libraries• Integration is first class• Homogeneous framework for scheduling• Any JVM based host language Copyright Concurrent, Inc. 2011. All rights reserved.
  • Elastic MapReduce Amazon Web Services Elastic MapReduceUser CLI Console Master Slaves Client mr mr Map/Reduce temp HDFS input output S3 jar • Clusters typically single purpose • S3 used for storage between runs Copyright Concurrent, Inc. 2011. All rights reserved.
  • Architecture Isn’t Innovation operationalization collection cleansing processing deliveryevent data signal info knowledge normalization scoring mining innovation Rate of innovation and arrival of answers are proportional Copyright Concurrent, Inc. 2011. All rights reserved.
  • Big vs Lots Lots of "Big" Data Data Data ! = Hadoop Mining* ! ? ? = RDBMS, R, etc Data * Data Warehousing Processing ! !• Big - too much to fit in/on any one thing• Lots - complexity arising from keeping track of all the bits Copyright Concurrent, Inc. 2011. All rights reserved.
  • At Rest vs In Motion data mining raw data ETL data warehousing loggers loggersloggers ETL Analyst Data At Rest raw data data processing valuable loggers data loggers loggers Consumer process Data In Motion • Hub/Spoke vs Incremental Layers • Static Schema vs Dynamic Views • Monolithic vs Distributed Copyright Concurrent, Inc. 2011. All rights reserved.
  • Hadoop for Processing Value Creation Scalability Simplicity• Delivering Value from Innovation• Scalability, Not Performance• Simplifies Infrastructure Copyright Concurrent, Inc. 2011. All rights reserved.
  • Simplicity Cluster Rack Rack Rack Node Node Node Node ... cpus Global Compute-space disks Global Namespace• Virtualization across resources, not within (PaaS) • A single FileSystem across disks - no DBA • A single Execution System across CPUs - less IT• One app installed and managed across hardware Copyright Concurrent, Inc. 2011. All rights reserved.
  • Scalability Users Cluster Client Rack Rack Rack Node Node Node Node ... Client job job job Client• Scalability - continued reliability and met expectations as demand changes• Application Scalability - data grows, app/infra expand• Organizational Scalability - simpler infra and apps Copyright Concurrent, Inc. 2011. All rights reserved.
  • Delivering Value events reporting raw data loggers loggers data processing loggers Hadoop + Hadoop etlCascading analytics Cascading Producer Consumer product operational Value• Unconstrained processing model• Data processing requires integration• Processing must not fail or fall behind Copyright Concurrent, Inc. 2011. All rights reserved.
  • Data In Motion raw data data processing valuable loggers data loggers loggers Consumer process Data In Motion• Data always arriving, results being delivered• Not paying the upfront cost of indexing• No upfront schema design• “ETL” is built into the processing pipeline Copyright Concurrent, Inc. 2011. All rights reserved.
  • Where to Innovate? Lots of "Big" Data Data Data ! = Hadoop Mining* ! ? ? = RDBMS, R, etc Data * Data Warehousing Processing ! !• Depends on the problem whether Hadoop makes sense as your innovation platform Copyright Concurrent, Inc. 2011. All rights reserved.
  • Hadoop for Innovating value innovationinnovation innovation latency degrees of freedom • Need to ask similar questions repeatedly • Indexes help here • Need a reasonably high abstraction • Existing libraries and a simple syntax • Third-party Tool support Copyright Concurrent, Inc. 2011. All rights reserved.
  • Innovation Abstractions• Syntax • Pig • Hive - now has some indexing support• Language (easier to operationalize) • Cascalog • Cascading.jruby • 3 new Scala languages pending release Copyright Concurrent, Inc. 2011. All rights reserved.
  • Data At Rest data mining raw data ETL data warehousing loggers loggers loggers ETL Analyst Data At Rest• Hadoop becomes a warehouse (with Schemas)• and without indexes, high latency queries• ETL becomes an independent architecture Copyright Concurrent, Inc. 2011. All rights reserved.
  • Don’t throw out the baby with the bath water• Need low latency responses• Need support for existing tools• Need to not retrain analysts • RDBMS (Aster, • SAS GreenPlum, Vertica, Oracle) • MicroStrategies • R • Tableaux Copyright Concurrent, Inc. 2011. All rights reserved.
  • Bailing Wire & Bubble Gum• Integrating them with Hadoop adds brittleness and inefficiencies • Hadoop Streaming • RHIPE, etc.. Copyright Concurrent, Inc. 2011. All rights reserved.
  • Operationalizing operationalization collection cleansing processing delivery event data signal info knowledge normalization scoring mining innovation• Minimize the number of processing tech (debt)• Don’t lose sight of the physical model/plan• XML is not a programming language• String concatenation isn’t programming Copyright Concurrent, Inc. 2011. All rights reserved.
  • Resources• Chris K Wensel •chris@wensel.net •@cwensel• Cascading & Cascalog •http://cascading.org •@cascading• Concurrent, Inc. •http://concurrentinc.com •@concurrent •http://concurrentinc.com/careers Copyright Concurrent, Inc. 2011. All rights reserved.