0
Intro to Data Sciencewith CascadingPaco Nathan                   Document                              Collection         ...
opportunity Unstructured Data   meets  Enterprise Scale 1. backstory: how we got here 2. build: data science teams 3. over...
Intro to Data Science           Document           Collection                                        Scrub                ...
inflection point huge Internet successes after 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG           ...
inflection point: consequences Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm) Hadoop Summit, 2012: “A...
primary sources Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-split...
the world before…BI, SQL, and highlyoptimized code
data innovation: circa 1996                           Stakeholder                   Customers    Excel pivot tables  Power...
the world after…machine learning,leveraging log files
data innovation: circa 2001   Stakeholder                    Product                   Customers     dashboards           ...
the world ahead…what our customersare doing now
data innovation: circa 2013                                                                                            Cus...
a key difference…
statistical thinking          Process           Variation            Data           Tools  employing a mode of thought whi...
most valuable skills approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cl...
social caveats “This data cannot be correct!” may be an early warning about the organization itself much depends on how th...
references by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L
references by Jack Olson Data Quality Morgan Kaufmann, 2003 amazon.com/dp/1558608915
references also check out RStudio: rstudio.org/ rpubs.com/
Intro to Data Science           Document           Collection                                        Scrub                ...
core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few ...
team process                  help people ask the    discovery     right questions                  allow automation to   ...
building teams                                             nn          o          overy            very      elliing      ...
references by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 ama...
Intro to Data Science            Document            Collection                                         Scrub             ...
using science in data science                                                        edoMpUsserD:IUN                      ...
probability estimation “a random variable or stochastic variable is a variable whose value is subject to variations” “an e...
analytic variance our tools for automation leverage deep understanding of covariance cannot overstate the importance of sa...
order complexity techniques for manipulating order complexity: dimensional reduction… with clustering as a common case e.g...
learning theory in general, apps alternate between learning patterns/rules and retrieving similar things… statistical lear...
use case: marketing funnel  •   must optimize a very large ad spend  •   different vendors report different metrics       ...
use case: ecommerce fraud  • sparse data means lots of missing values                                                     ...
use case: customer segmentation  • many millions of customers, hard to determine      which features resonate             ...
use case: monetizing content  • need to suggest relevant content which would                                              ...
data+code “political spectrum” “Notes from the Mystery Machine Bus” by Steve Yegge, Google goo.gl/SeRZa         “conservat...
a selection of great tools…                                                                reporting:                     ...
Intro to Data Science           Document           Collection                                        Scrub                ...
the workflow abstraction  cascading.org/category/impatient/  Document  Collection                               Scrub     ...
layers of a workflow  business     domain expertise, business trade-offs,  process      market position, operating paramet...
audience? •   Business Stakeholder POV:     business process management for workflow orchestration (think BPM/BPEL) •   Sys...
1: copy                       public class                         Main                         {                         ...
wait!  ten lines of code  for a file copy…  seems like a lot.
same JAR, any scale…                                                       MegaCorp Enterprise IT:                        ...
2: word countDocumentCollection                Tokenize                           GroupBy        M                   token...
3: wc + scrubDocumentCollection                        Scrub   GroupBy             Tokenize                        token  ...
4: wc + scrub + stop wordsDocumentCollection                             Scrub             Tokenize                       ...
5: tf-idf                                                                        Unique                 Insert   SumBy    ...
6: tf-idf + tdd                                                                                                Unique     ...
City of Palo Alto open data                                                   Regex           Regex                       ...
CoPA: log events
CoPA: results                                      0.12                                                               Esti...
drill-down  blog, code/wiki/gists, jars, list, DevOps products:  cascading.org/  github.org/Cascading/  conjars.org/  goo....
Upcoming SlideShare
Loading in...5
×

Intro to Data Science with Cascading

4,499

Published on

Startup Slide meetup in Mountain View, at Outright.com on 2012-10-09
http://www.meetup.com/startupslide/events/85598842/

Please email if you need a PDF version at pnathan AT concurrentinc DOT com

Published in: Technology
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,499
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
145
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript of "Intro to Data Science with Cascading"

    1. 1. Intro to Data Sciencewith CascadingPaco Nathan Document Collection Tokenize ScrubConcurrent, Inc. token M HashJoin Regex Left token GroupBy R Stop Word token List RHSpnathan@concurrentinc.com Count Word Count@pacoid Copyright @2012, Concurrent, Inc.
    2. 2. opportunity Unstructured Data meets Enterprise Scale 1. backstory: how we got here 2. build: data science teams 3. overview: typical use cases 4. example: Cascading apps
    3. 3. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count1. backstory:how we got here
    4. 4. inflection point huge Internet successes after 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG 1997 consider this metric: 1998 annual revenue per customer / amount of data stored which dropped 100x within a few years after 1997 storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data… 2004 our methods must adapt “conventional wisdom” of RDBMS and BI tools became less viable; business cadre still focused on pivot tables and pie charts… tends toward inertia! MapReduce and the Hadoop open source stack grew directly out of that contention… however, that effort + only solves parts of the puzzle
    5. 5. inflection point: consequences Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm) Hadoop Summit, 2012: “All of Fortune 500 is now on notice over the next 10-year period.” Amazon and Google as exemplars of massive disruption in retail, advertising, etc. data as the major force displacing Global 1000 over the next decade, mostly through apps — verticals, leveraging domain expertise Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.) XLDB, 2012: “Complex analytics workloads are now displacing SQL as the basis  for Enterprise apps.”
    6. 6. primary sources Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “The Birth of Google” – John Battelle wired.com/wired/archive/13.08/battelle.html “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
    7. 7. the world before…BI, SQL, and highlyoptimized code
    8. 8. data innovation: circa 1996 Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS
    9. 9. the world after…machine learning,leveraging log files
    10. 10. data innovation: circa 2001 Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS
    11. 11. the world ahead…what our customersare doing now
    12. 12. data innovation: circa 2013 Customers Data Apps business Domain process Workflow Prod Expert dashboard Web Apps, metrics History services Mobile, data etc. s/w science dev Data PlannerScientist social discovery optimized interactions + capacity transactions, Eng endpoints modeling content App Dev Data Access Patterns Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch "real time" Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS
    13. 13. a key difference…
    14. 14. statistical thinking Process Variation Data Tools employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way… however, both systems engineers and data scientists must!
    15. 15. most valuable skills approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable D3 the rest of the skills – modeling, algorithms, etc. – those are secondary
    16. 16. social caveats “This data cannot be correct!” may be an early warning about the organization itself much depends on how the people whom you work alongside tend to arrive at decisions: ‣ probably good: Induction, Abduction, Circumscription ‣ probably poor: Deduction, Speculation, Justification in general, one good data visualization puts many ongoing verbal arguments to rest however, let domain experts handle “data storytelling”, not data scientists xkcd
    17. 17. references by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L
    18. 18. references by Jack Olson Data Quality Morgan Kaufmann, 2003 amazon.com/dp/1558608915
    19. 19. references also check out RStudio: rstudio.org/ rpubs.com/
    20. 20. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count2. build:data science teams
    21. 21. core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… solving for pattern, at scale. an interdisciplinary pursuit which requires teams, not sole players
    22. 22. team process help people ask the discovery right questions allow automation to modeling place informed bets deliver products at integration scale to customers build smarts into apps product features Gephi keep infrastructure systems running, cost-effective
    23. 23. building teams nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops
    24. 24. references by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZE
    25. 25. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count3. overview:typical use cases
    26. 26. using science in data science edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC in a nutshell, what we do… edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC ‣ estimate probability teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN y d d uB d dA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT t a eS e g n a h C dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC l e n aP t i dE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO ‣ calculate analytic variance ‣ manipulate order complexity ‣ make use of learning theory ‣ collab with DevOps, Stakeholders
    27. 27. probability estimation “a random variable or stochastic variable is a variable whose value is subject to variations” “an estimator is a rule for calculating an estimate of a given quantity based on observed data” estimators and probability distributions provide the essential basis for our insights bayesian methods, shrinkage… these are our friends quantile estimation, empirical CDFs… …versus frequentist notions
    28. 28. analytic variance our tools for automation leverage deep understanding of covariance cannot overstate the importance of sampling… insist on metrics described as confidence intervals, where valid bootstrapping, bagging… these are our friends Monte Carlo methods resolve “black box” problems point estimates may help prevent “uninformed” decisions do not skimp on this part, ever… a hard lesson learned from BI failures
    29. 29. order complexity techniques for manipulating order complexity: dimensional reduction… with clustering as a common case e.g., you may have 100 million HTML docs, but there are only ~10K useful keywords low-dimensional structures, PCA linear algebra tricks: eigenvalues, matrix decomposition, etc. many hard problems resolved by “divide and conquer” this is an area ripe for much advancement in algorithms research near-term
    30. 30. learning theory in general, apps alternate between learning patterns/rules and retrieving similar things… statistical learning theory – rigorous, prevents you from making billion dollar mistakes, probably our future machine learning – scalable, enables you to make billion dollar mistakes, much commercial emphasis supervised vs. unsupervised arguably, optimization is a related area once Big Data projects get beyond merely digesting log files, optimization will likely become yet another buzzword :)
    31. 31. use case: marketing funnel • must optimize a very large ad spend • different vendors report different metrics Wikipedia • seasonal variation distorts performance • some campaigns are much smaller than others • hard to predict ROI for incremental spend approach: • log aggregation, followed with cohort analysis • bayesian point estimates compare different-sized ad tests • customer lifetime value quantifies ROI of new leads • time series analysis normalizes for seasonal variation • geolocation adjusts for regional cost/benefit • linear programming models estimate elasticity of demand
    32. 32. use case: ecommerce fraud • sparse data means lots of missing values stat.berkeley.edu • “needle in a haystack” lack of training cases • answers are available in large-scale batch, results are needed in real-time event processing • not just one pattern to detect – many, ever-changing approach: • random forest (RF) classifiers predict likely fraud • subsampled data to re-balance training sets • impute missing values based on density functions • train on massive log files, run on in-memory grid • adjust metrics to minimize customer support costs • detect novelty – report anomalies via notifications
    33. 33. use case: customer segmentation • many millions of customers, hard to determine which features resonate Mathworks • multi-modal distributions get obscured by the practice of calculating an “average” • not much is known about individual customers approach: • connected components for sessionization, determining uniques from logs • estimates for age, gender, income, geo, etc. • clustering algorithms to group into market segments • social graph infers “unknown” relationships • covariance/heat maps visualizes segments vs. feature sets
    34. 34. use case: monetizing content • need to suggest relevant content which would Digital Humanities otherwise get buried in the back catalog • big disconnect between inventory and limited performance ad market • enormous amounts of text, hard to categorize approach: • text analytics glean key phrases from documents • hierarchical clustering of char frequencies detects lang • latent dirichlet allocation (LDA) reduces dimension to topic models • recommenders suggest similar topics to customers • collaborative filters connect known users with less known
    35. 35. data+code “political spectrum” “Notes from the Mystery Machine Bus” by Steve Yegge, Google goo.gl/SeRZa “conservative” “liberal” (mostly) Enterprise (mostly) Start-Up risk management customer experiments assurance flexibility well-defined schema schema follows code explicit configuration convention type-checking compiler interpreted scripts wants no surprises wants no impediments Java, Scala, Clojure, etc. PHP, Ruby, Python, etc. Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
    36. 36. a selection of great tools… reporting: visualization: Graphite, PowerPivot, ggplot2, D3, Gephi analytics/modeling: Pentaho, Jaspersoft, SAS R, Weka, Matlab, PMML, GLPK text: LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK apps: Cascading, Scalding, Cascalog, R markdown, SWF scale-out: Scalr, RightScale, CycleComputing, vFabric, Beanstalk graph: column: Gremlin, Vertica, GraphLab, HBase, key/val: index: relational: Neo4J Drill, Redis, Lucene/Solr, usual suspects Dynamo Membase, ElasticSearch MySQL imdg: Spark, Storm, hadoop: EMR, HW, MapR, machine data: Gigaspaces EMC, Azure, Compute Splunk, collectd, durable storage: Nagios S3, ASV, GCS, Riak, Couch
    37. 37. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count4. example:Cascading apps
    38. 38. the workflow abstraction cascading.org/category/impatient/ Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
    39. 39. layers of a workflow business domain expertise, business trade-offs, process market position, operating parameters, etc. API Scala, Clojure, Python, Ruby, Java, etc. language …envision whatever runs in a JVM optimize / schedule major changes in technology now Document Collection Scrub Tokenize token physical M HashJoin Regex Left token GroupBy R plan Stop Word token List RHS Count Word Count compute Apache Hadoop, in-memory local mode “assembler” code substrate …envision GPUs, streaming, etc. machine data Splunk, Nagios, Collectd, New Relic, etc.
    40. 40. audience? • Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) • Systems Integrator POV: system integration of heterogenous data sources and compute platforms • Data Scientist POV: a directed, acyclic graph (DAG) on which we can apply Amdahls Law • Data Architect POV: a physical plan for large-scale data flow management • Software Architect POV: a pattern language, similar to plumbing or circuit design Document • Collection App Developer POV: M Tokenize Scrub token API bindings for Scala, Clojure, Python, Ruby, Java Stop Word List HashJoin Left RHS Regex token GroupBy token R • Count Systems Engineer POV: Word Count a JAR file, has passed CI, available in a Maven repo
    41. 41. 1: copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath ); M     // create the sink tap     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath ); Sink     // specify a pipe to connect the taps     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete(); 1 mapper     }   } 0 reducers10 lines code
    42. 42. wait! ten lines of code for a file copy… seems like a lot.
    43. 43. same JAR, any scale… MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes
    44. 44. 2: word countDocumentCollection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer18 lines code
    45. 45. 3: wc + scrubDocumentCollection Scrub GroupBy Tokenize token token Count M R Word Count 1 mapper 1 reducer22+10 lines code
    46. 46. 4: wc + scrub + stop wordsDocumentCollection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count 1 mapper Word 1 reducer Count28+10 lines code
    47. 47. 5: tf-idf Unique Insert SumBy D doc_id 1 doc_idDocumentCollection M R M R M RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique GroupBy DF Left token token token ExprFunc Count CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M GroupBy TF doc_id, token Count GroupBy Count token M R M R Word R M R Count 11 mappers 9 reducers 65+10 lines code
    48. 48. 6: tf-idf + tdd Unique Insert SumBy D doc_id 1 doc_idDocumentCollection RHS M R M R M Assert Scrub Tokenize token HashJoin Checkpoint M M RHS token HashJoin Regex Unique GroupBy DF Left token token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF M GroupBy TF doc_id, Failure token Count Traps GroupBy Count token M R M R Word Count R M R 12 mappers 9 reducers 76+14 lines code
    49. 49. City of Palo Alto open data Regex Regex tree Scrub filter parser species M HashJoin Left Geohash CoPA GIS exprot Tree Metadata M RHS RHS tree Regex Checkpoint road Regex Regex tsv parser tsv filter Tree Filter GroupBy Checkpoint parser CoGroup Distance tree_dist tree_name shade M R M R M RHS M HashJoin Estimate Road Left Albedo Segments Geohash CoGroup Road Metadata GPS Failure RHS M logs Traps R road Geohash M Regex park filter reco M parkgithub.com/Cascading/CoPA/wiki ‣ GIS export for parks, roads, trees (unstructured / open data) ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks ‣ curated metadata, used to enrich the dataset ‣ could extend via mash-up with many available public data APIsEnterprise-scale app: road albedo + tree species metadata + geospatial indexing“Find a shady spot on a summer day to walk near downtown and take a call…”
    50. 50. CoPA: log events
    51. 51. CoPA: results 0.12 Estimated Tree Height (meters) 0.10 0.08 count 0 density 100 0.06 200 300 0.04 0.02 0.00 0 10 20 30 40 50 avg_height ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ avg height 23 m ‣ road albedo: 0.12 ‣ distance: 10 m ‣ a short walk from my train stop ✔
    52. 52. drill-down blog, code/wiki/gists, jars, list, DevOps products: cascading.org/ github.org/Cascading/ conjars.org/ goo.gl/KQtUL concurrentinc.com/
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×