Sv big datascience_cliffclick_5_2_2013
Upcoming SlideShare
Loading in...5
×
 

Sv big datascience_cliffclick_5_2_2013

on

  • 624 views

 

Statistics

Views

Total Views
624
Views on SlideShare
623
Embed Views
1

Actions

Likes
2
Downloads
14
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Sv big datascience_cliffclick_5_2_2013 Sv big datascience_cliffclick_5_2_2013 Presentation Transcript

  • Big Data forBig QuestionsCliff Click, CTO 0xdatacliffc@0xdata.comhttp://0xdata.comhttp://cliffc.org/blog
  • ● Motivation: What & Why Big Math?● Better Mousetrap● Demo● Fork: Deep Dive intoMath Hacking ...or...K/V StoreSource: https://github.com/0xdata/h2o
  • 0xdata.com 342!
  • 0xdata.com 442!What was the question again?
  • 0xdata.com 542!What was the question again?Oh yeah, it was:● How do I place ads based on a clickstream?
  • 0xdata.com 642!What was the question again?Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?
  • 0xdata.com 742!What was the question again?Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?● Detect cancer from sensor data?
  • 0xdata.com 842!What was the question again?Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?● Detect cancer from sensor data?● Predict equipment failure ahead of time?
  • 0xdata.com 942!What was the question again?Oh yeah, it was:● How do I place ads based on a clickstream?● Detect fraud in a credit-card swipe stream?● Detect cancer from sensor data?● Predict equipment failure ahead of time?● Find people (un)like me?● ... or ... or ... or... ????
  • 0xdata.com 10How do I figure it all out?● Well... what are my tools?● Domain Knowledge,● (me! The Expert)● Math & Science! Data Science, and● Data – lots and lots and lots of it● Old logs, new logs, databases, historicalrecords, click-streams, CSV files, dumps● Often TBs, sometimes PBs of it
  • 0xdata.com 11Data: The Main Player● Data: I got lots of it● But its a messy mixed-up lot● Stored in HDFS, S3, DB2 or scattered about● Incompatible formats, older & newer bits● Missing stuff, or "known broken" fields● And its Big● Too big for my laptop, or even one server
  • 0xdata.com 12Data: Cleaning it Up● Just the parts I want:● SQL, Hive, HBase, grep● Data is Big, so this is slow● Wrong format:● Awk, shell scripts, files, disk-to-disk● Inspection (do I got it right yet?)● Grep/awk, histograms, plots/prints● Visualization tools
  • 0xdata.com 13From Facts to Knowledge● Data cleaned up: lots of neat rows of facts● Lots of rows: millions and billions ...● But facts is not knowledge● Too much to "get it" by looking● Time for a mathematical Model!● Here again, Big limits my tools● Either cant deal, or deal very very slowly
  • 0xdata.com 14Modeling: math(data)● Modeling gives a simpler view● A way to understand● And predict in real time● Modeling is Math!● Generalized Linear Modeling– Oldest, most well known & used● Random Forest● K-Means Clustering
  • 0xdata.com 15Big Data vs Modeling● Model: a concise description of my data● A more accurate model predicts better● Generally More Data builds a better Model● But only if the tool can handle it● (some datasets are not helped but it rarely hurts)● Tools cant handle Big: so down sample,and use better (more complex) algorithm
  • 0xdata.com 16Big Data vs Better Algorithm● Dont want to choose Big vs Better● Down sampling loses information● Want a way to manipulate Big Data like itssmall: interactive & fast. Subtle when Ineed it and brute force when I dont● Build the Better Algorithm and use Big Data● Seeing 10x more data yield predictionincreases e.g. from 75% to 85%
  • 0xdata.com 17Building The BetterBig Data Mousetrap● Want fast: means dram instead of disk● Fall back to disk, if data >>> dram● Want fast: use all cpus● Problems are mostly data-parallel anyways● Want ease-of-programming:● “parallelism without effort”● Well understood programming model
  • 0xdata.com 18● Want ease-of-use:● python, json, REST/HTML interfaces● Full R semantics (via fastr project)● Data ingest:● where: HDFS, S3, NFS, URL, URI, browser● what: csv, hive, rdataBuilding The BetterBig Data Mousetrap
  • 0xdata.com 19Building The BetterBig Data Mousetrap● Want ease-of-admin:● e.g. java -jar h2o.jar● auto-cluster (no config at all) or hadoop Job● Want ease-of-upgrade:adding more servers gives● More CPU (faster exec)● More DRAM (larger data in dram)● More network/disk bandwidth (faster ingest)
  • 0xdata.com 20H2O: An Engine for Big Math● Built in layers – pick your abstraction level● Analysts, starters: REST, browser– "clicky clicky" load data, build model, score● Scientists: R, JSON, python to drive engine– Complex math● Math hackers: building new algos– Full (distributed) Java Memory Model– "codes like Java, runs distributed"● Core Engineering: call us, were hiring
  • 0xdata.com 21Core Engineering: K/V Store● Classic distributed Key/Value store● get/put/atomic-transaction● Full JMM semantics, exact consistency● Full caching as-needed– Cached keys "get" in 150 nanos– Misses limited by network speed● Hardware-like cache coherency protocol● Distributed fork/join (thanks Doug Lea)
  • 0xdata.com 22Core Engineering: D/F/J● Distributed fork/join (jsr 166y)● Recursive-descent for data-parallel● Distribution handled by the core– Log-tree scatter/gather across cluster● Supports map/reduce-style directly● But also "do this on all nodes" style● Or random graph hacking
  • 0xdata.com 23Math Hacking● “Tastes like (distributed) java”(actual inner loop, auto-parallel, auto-distributed)● Big “vector math” is easy● The obvious for-loop "just works"for( int i=0; i<rows; i++ ) {double X = ary.datad(bits,i,A);double Y = ary.datad(bits,i,B);_sumX += X;_sumY += Y;_sumX2+= X*X;}
  • 0xdata.com 24Math Hacking● Dense-vector algorithms are easy● Generalized Linear Modeling: 2 weeks● K-means: 2 days● Histogram: 2 hours● Random Forest: not dense vectors● Still makes good use of D/F/J● All-CPUs, all-nodes still light up– Very fast tree building
  • 0xdata.com 25Science: dancing with the data● Like the belle of the ball, the main algos(GLM, k-means, RF) only arrive when thedata is properly dressed● Munging data: dropping junk columns,replacing missing bits, adding features● H2O provides a tool-kit● Big vector calculator: "d := a+b*c"● dram speeds: "msec per Gbyte"
  • 0xdata.com 26Science: APIs● Need to script, automate repetitive tasks● R via fastr and bigmemory package● Full R semantics, 5x R speed single-thread● But your vectors can be very very big...● https://github.com/allr/fastr● REST / URL / JSON● Drive from e.g. python, scripts, curl, wget– e.g. h2o testing harness is all python
  • 0xdata.com 27Demos & Quick Starts● Full browser interface● Tutorials● Handful of clicks to run e.g. RF or GLMon gigabytes of data● Auto-cluster in seconds● On EC2 (or your laptops right now)● Good enough for serious work● (and have customers using this interface!)
  • 0xdata.com 28Demo Time!
  • 0xdata.com 29H2O: An Engine for Big Math● Focus on Big Math● Easy to extend via M/R or K/V programming● Auto-cluster● Data-parallel exec across all CPUs● dram caching across all servers● Parallel ingest across all servers● Open source: https://github.com/0xdata/h2o0xdata.com
  • 0xdata.com 30Math Hacking: The M/R API● Make a golden object● Will be endlessly replicated across cluster● Set input fields:– Auto-serialized, distributed– Shallow-copy on nodes: eg arrays share state● golden.map(key_1mb)● map() called on clone for each 1mb● Set output fields now
  • 0xdata.com 31Math Hacking: The M/R API● gold.reduce(gold)● Combine pairs of golden objects● Both locally and remotely (distributed)● Log-tree roll-up● output fields will be shipped over the wire● null-out input fields● transient marker available
  • 0xdata.com 32Math Hacking: ExampleCalcSumsTask cst = new CalcSumsTask();cst._arykey = ary._key; // BigData Table keycst._colA = colA; // integer indices to columnscst._colB = colB;cst.invoke(ary._key); // Do It!// Results returned directly in cst object...cst._sumX... // use resultspublic static class CalcSumsTask extends MRTask {Key _arykey; // BigData Table keyint _colA, _colB; // Column indices to work ondouble _sumX,_sumY,_sumX2; // Sum of Xs, Ys, X^2s
  • 0xdata.com 33Math Hacking: Examplepublic static class CalcSumsTask extends MRTask {Key _arykey; // BigData Table keyint _colA, _colB; // Column indices to work ondouble _sumX,_sumY,_sumX2; // Sum of Xs, Ys, X^2s// map called for every 1Mb of data, or sopublic void map( Key key1Mb ) {… boiler plate... // lots of unimportant details// Standard for-loop over the datafor( int i=0; i<rows; i++ ) {double X = ary.datad(bits,i,A);double Y = ary.datad(bits,i,B);_sumX += X;_sumY += Y;_sumX2+= X*X;}}
  • 0xdata.com 34Math Hacking: Examplepublic static class CalcSumsTask extends MRTask {Key _arykey; // BigData Table keyint _colA, _colB; // Column indices to work ondouble _sumX,_sumY,_sumX2; // Sum of Xs, Ys, X^2s// reduce called between pairs of golden objects// always reduce right-side into this objectpublic void reduce( DRemoteTask rt ) {CalcSumsTask cst = (CalcSumsTask)rt;_sumX += cst._sumX ;_sumY += cst._sumY ;_sumX2+= cst._sumX2;}}
  • 0xdata.com 35A Fast K/V Store● Distributed in-memory K/V Store● Peer-to-peer, no master● Full JMM semantics, get/put/atomic/remove● Hardware-style cache-coherency protocol● Fast: 150nanos for cache-hitting get● Fast: 50micros for cache-missing put● No persistence (see above for fast)● No locks: use atomic instead
  • 0xdata.com 36K/V Design Goals● JMM semantics on all get/put● Cache-hitting gets as fast as possible● Local hashtable lookup + few tests● puts as lazy as possible (still JMM)● Typically do not block for remote put● Arbitrary transactions on single Keys
  • 0xdata.com 37K/V Coherency Protocol● Many are possible● Picked a {fast-enough,easy} one● Faster is possible● Every Key has 1 master node● And everybody knows it from Key hash● Master orders racing writes● Winner of NBHM insert
  • 0xdata.com 38K/V Coherency Protocol● Master tracks replicas● Single CAS update● Invalidate replicas on update● Single CAS required, plus the invalidates● Cache miss on replica will reload● Interlocking get/put races solved withfinite state machine
  • 0xdata.com 39K/V Coherency Protocol
  • 0xdata.com 40Backup Slides
  • 0xdata.com 41The Expert● Domain Expert:● What data is useful, which is trash● What needs help to become useful● Missing elements? Toss outliers?● Build new features from old?● All through this process Big Data is, well,Big, hence Slow to cp / awk / grep● And Big limits my tools