Advertisement
Advertisement

More Related Content

Advertisement

More from Paco Nathan(20)

Advertisement

Intro to Data Science for Enterprise Big Data

  1. Intro to Data Science Paco Nathan Document Collection Scrub Tokenize token M Concurrent, Inc. Stop Word List HashJoin Left RHS Regex token GroupBy token R Count pnathan@concurrentinc.com Word Count @pacoid Copyright @2012, Concurrent, Inc.
  2. opportunity Unstructured Data meets Enterprise Scale
  3. core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… solving for pattern, at scale. NB: projects require teams, not sole players
  4. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count backstory
  5. personal timeline 1980s 1990s 2000s 2010s Symbiot, Adknowledge, lead data teams ShareThis, IMVU, etc. consult start-up CTO BNTI Bell Labs, enterprise Moto IBM, research NASA school Stanford
  6. inflection point: demand side • huge Internet successes after 1997 holiday season… 1997 AMZN, EBAY, then GOOG, Inktomi (YHOO Search) • consider this metric: 1998 annual revenue per customer / operational data store size dropped more than 100x within a few years after 1997 • storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data… our methods must adapt 2004 • “conventional wisdom” of RDBMS and BI tools became less viable; business cadre still focused on pivot tables and pie charts… which tends toward inertia • MapReduce and the Hadoop open source stack grew directly out of this context… but that only solves parts massive disruption in retail, advertising, etc., “All of Fortune 500 is now on notice over the next 10-year period.” – Geoffrey Moore, 2012 (Mohr Davidow Ventures)
  7. inflection point: supply side source: source: DJ Patil R-Bloggers
  8. statistical thinking Process Variation Data Tools a mode of thinking which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way
  9. most valuable skills • approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues • unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up • most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable D3 the rest of the skills – modeling, algorithms, etc. – those are secondary
  10. social caveats • the phrase “This data cannot be correct!” may be an early warning about the organization itself • much depends on how the people whom you work alongside tend to arrive at their decisions: ‣ probably good: Induction, Abduction, Circumscription ‣ probably poor: Deduction, Speculation, Justification in general, one good data visualization can put many ongoing verbal arguments to rest xkcd
  11. reference Statistical Modeling: The Two Cultures by Leo Breiman Statistical Science, 2001 http://bit.ly/eUTh9L
  12. reference Data Quality by Jack Olson Morgan Kaufmann, 2003 http://www.amazon.com/dp/1558608915
  13. reference Building Data Science Teams by DJ Patil O’Reilly, 2011 http://www.amazon.com/dp/B005O4U3ZE
  14. reference Data Jujitsu by DJ Patil O’Reilly, 2012 http://www.amazon.com/dp/B008HMN5BE
  15. reference RStudio download and run it on your laptop http://rstudio.org/
  16. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count build: data science teams
  17. process help people ask the discovery right questions allow automation to modeling place informed bets deliver products at integration scale to customers leverage smarts in apps product features Gephi keep infrastructure systems running, cost-effective
  18. matrix = needs × roles nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops
  19. matrix: usage nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy conceptual tool for managing Data Science teams stakeholder overlay your project requirements (needs) with your team’s strengths (roles) scientist that will show very quickly where to focus NB: bring in individuals who cover 2-3 needs, developer particularly for team leads ops
  20. matrix: needs nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy one dimension is “needs”: stakeholder discovery, modeling, integration, apps, systems these are the primary phases of leveraging Big Data… scientist stakeholders represent the domain: the key aspect to leverage analysts usually drive from discovery toward integration, developer while the engineers tend to drive from systems toward integration ops NB: effective, hands-on management in Data Science must live in the space of integration, not delegate it
  21. matrix: roles nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy one dimension is “roles”: stakeholder stakeholder, scientist, developer, ops each role leverages different disciplines, opportunities, scientist and risks… there’s great power in pairing people with complementary skills, in team environments where they can recognize each other’s priorities and perspectives developer blurring these roles is wonderful, when you find great people capable of doing so, e.g., DevOps… however, when businesses get into trouble, they will tend to ops “push down” these roles, blurring boundaries in ways which stresses teams and limits scalability
  22. matrix: example team nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops
  23. matrix: example team nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops summary: this team seems heavy on systems, may need more overlap between modeling and integration, particularly among team leads
  24. typical hand-offs integrity availability discovery communications people vendor data sources Query data Query Hosts query BI & dashboards warehouse Hosts hosts reporting production cluster presentations decision support classifiers predictive analyze, customer analytics visualize business interactions recommenders stakeholders internal API, crons, etc. modeling engineers, automation analysts
  25. data priorities • Availability Top priority, providing access to data as needed. Lack of availability causes large hidden costs to a business. • Integrity integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Hosts Discovery warehouse hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions recommenders stakeholders Modeling internal API, crons, etc. modeling engineers, automation analysts • Communications
  26. data priorities • Availability • Integrity Work within Engineering to ensure that customer data, internal metrics, third-party sources, etc., get collected and maintained in ways which are meaningful and consistent for required business use cases. • Discovery integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Hosts Modeling warehouse hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions recommenders stakeholders Communications internal API, crons, etc. modeling engineers, automation analysts
  27. data priorities • Availability • Integrity • Discovery Analyze and visualize data on behalf of business stakeholders. Leverage statistics so that we not only say “What” decisions to take, but can answer “Why?” and “How good are they?” • Modeling integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Communications warehouse Hosts hosts reporting production cluster presentations decision support classifiers predictive analyze, customer analytics visualize business interactions recommenders stakeholders internal API, crons, etc. modeling engineers, automation analysts
  28. data priorities • Availability integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Integrity warehouse Hosts hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions stakeholders Discovery recommenders internal API, crons, etc. modeling engineers, automation analysts • Modeling Use business learnings in automated, scalable ways. For example, manage an automated bid system. Principally “algorithmic modeling”, not “data modeling”. • Communications
  29. data priorities • Availability • Integrity integrity vendor data sources availability Query discovery communications people • data Query Hosts query BI & dashboards Hosts Discovery warehouse hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions recommenders stakeholders Modeling internal API, crons, etc. modeling engineers, automation analysts • Communications Work closely with stakeholders so that insights gleaned from data+analysis are understood, and important to the business. Sum of learnings from this ongoing process represents our primary value.
  30. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count theory: wrangle the data
  31. CAP theorem high availability C A strong consistency P eventual consistency partition tolerance
  32. CAP theorem “You can have at most two of these properties for any shared-data system… the choice of which feature to discard determines the nature of your system.” – Eric Brewer, 2000 (Inktomi) • revenue transactions in ecommerce typically require strong consistency and partition tolerance • most analytics jobs for business use cases generally require availability and eventual consistency, but tend to not tolerate highly partitioned data • ETL becomes an Achilles heal for “agile”: ‣ agile/experiment-driven/scale-out, which leads to… ‣ provably-hard-to-detect metadata drift, leading to… ‣ high-risk technical debt
  33. interpretation • purpose: theoretical limits for data access patterns • essence: ‣ consistency ‣ availability ‣ partition tolerance • best case scenario: you may pick two … or spend billions struggling to obtain all three at scale (GOOG) • translated: cost of doing business https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
  34. data access patterns • the world is not made of data warehouses… • a handful of common data access patterns prevail • learn to recognize these for any given problem • typically expressed as trade-offs among: ‣ speed & volume (latency and throughput) ‣ reads & writes (access and storage) ‣ consistency / availability / partition tolerance as for roles on teams, some mixing is valuable; OTOH, too much blurring of boundaries causes stress
  35. data access patterns • design patterns: originated in consensus negotiation for architecture, later used in software engineering • consider the corollaries in large-scale data work… • essential advice: select data frameworks based on your data access patterns • in other words, decouple use cases based on needs – to avoid “one size fits all” blockers • let’s review some examples…
  36. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  37. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  38. Amdahl’s law source: Wikipedia
  39. interpretation • purpose: theoretical limits for scalable computation • essence: task overhead and data independence define limits of parallelism for any given problem; however, these also suggest how well a problem can be scaled-out • translated: return on investment http://en.wikipedia.org/wiki/Amdahl's_law http://www.bu.edu/tech/research/training/tutorials/matlab-pct/scalability/
  40. parallel computation • parallelism allows for horizontal scale-out, which create business “levers” in cost/performance at scale • NB: MapReduce provides a compute framework which is part-parallel and part-serial… that tends to complicate app development • most hard problems in industry have portions which do not allow data independence, or which require iteration • current efforts in massively parallel algorithms research may help to parallelize problems and reduce iteration – estimates are 3-5 years out for industry use GPUs and other hardware architecture advancements will likely make Hadoop unrecognizable 3-5 years out
  41. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count theory: manage the science
  42. the science in data science • Estimate Probability! • Calculate Analytic Variance!! • Apply Learning Theory!!! edoMpUsserD:IUN • Manipulate Order tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT Complexity!!!! egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC l e n aP t i dE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN y b b o l s em a g d e hc n u a L noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yr tcudorP evomeR lenaP edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps ss dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lena lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esah M215 :gniniamer ecaps gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detra teP weN etaerC detrats etius tset :tseTy emag pazyeh dehcnua eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO
  43. probability estimation “a random variable or stochastic variable is a variable whose value is subject to variations” “an estimator is a rule for calculating an estimate of a given quantity based on observed data” estimators and probability distributions provide the essential basis for our insights bayesian methods, shrinkage… these are our friends quantile estimation, empirical CDFs… …versus frequentist notions
  44. analytic variance our tools for automation leverage deep understanding of covariance cannot overstate the importance of sampling… insist on metrics described as confidence intervals, where valid bootstrapping, bagging… these are our friends Monte Carlo methods resolve “black box” problems point estimates may help prevent “uninformed” decisions do not skimp on this part, ever… a hard lesson learned from BI failures
  45. learning theory in general, apps alternate between learning patterns/rules and retrieving similar things… statistical learning theory – rigorous, prevents you from making billion dollar mistakes, probably our future machine learning – scalable, enables you to make billion dollar mistakes, much commercial emphasis supervised vs. unsupervised arguably, optimization is a related area once Big Data projects get beyond merely digesting log files, optimization will likely become yet another buzzword :)
  46. order complexity techniques for manipulating order complexity: dimensional reduction… with clustering as a common case e.g., you may have 100 million HTML docs, but there are only ~10K useful keywords low-dimensional structures, PCA linear algebra tricks: eigenvalues, matrix decomposition, etc. many hard problems resolved by “divide and conquer” this is an area ripe for much advancement in algorithms research near-term
  47. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count praxis
  48. some great tools… reporting: visualization: PowerPivot, Pentaho, Jaspersoft, SAS ggplot2, D3, Gephi analytics/modeling: R, Weka, Matlab, PMML, GLPK text: LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK apps: Cascading, Scalding, Cascalog, R markdown, SWF scale-out: Scalr, RightScale, CycleComputing, vFabric, Beanstalk graph: column: Gremlin, Vertica, GraphLab, HBase, key/val: index: relational: Neo4J Drill, Redis, Lucene/Solr, usual suspects Dynamo Membase, ElasticSearch MySQL stream/iter: hadoop: Storm, Spark EMR, HW, MapR, EMC, Azure, Compute durable storage: ASV, S3, Riak, Couch
  49. a sample of great algorithms… time series analysis seasonal variation geospatial hidden markov models ARIMA bayesian point estimates kriging k-d trees funnel optimization topics lang id anti-fraud regression linear programming cosine similarity LDA TextRank LID TF-IDF random forest GLM/GAM elasticity of demand recommender key phrase doc similarity classifier differential equations k-medoids PCA LSH k-means|| probabilistic hashing customer lifetime value market segmentation dimensional reduction customer experiments connected components markov random walk association rules multi-arm bandit sessionization social graph what if ? sample variance affiliation networks MCMC bootstrapping
  50. Intro to Data Science Paco Nathan Document Collection Scrub Tokenize token M Concurrent, Inc. Stop Word List HashJoin Left RHS Regex token GroupBy token R Count pnathan@concurrentinc.com Word Count @pacoid Copyright @2012, Concurrent, Inc.

Editor's Notes

  1. responsible for net lift, or we work on something else\n
  2. responsible for net lift, or we work on something else\n
  3. responsible for net lift, or we work on something else\n
  4. responsible for net lift, or we work on something else\n
  5. responsible for net lift, or we work on something else\n
  6. responsible for net lift, or we work on something else\n
  7. responsible for net lift, or we work on something else\n
  8. responsible for net lift, or we work on something else\n
  9. responsible for net lift, or we work on something else\n
  10. responsible for net lift, or we work on something else\n
  11. responsible for net lift, or we work on something else\n
  12. responsible for net lift, or we work on something else\n
  13. responsible for net lift, or we work on something else\n
  14. responsible for net lift, or we work on something else\n
  15. responsible for net lift, or we work on something else\n
  16. responsible for net lift, or we work on something else\n
  17. responsible for net lift, or we work on something else\n
  18. responsible for net lift, or we work on something else\n
  19. responsible for net lift, or we work on something else\n
  20. responsible for net lift, or we work on something else\n
  21. responsible for net lift, or we work on something else\n
  22. responsible for net lift, or we work on something else\n
  23. responsible for net lift, or we work on something else\n
  24. responsible for net lift, or we work on something else\n
  25. responsible for net lift, or we work on something else\n
  26. responsible for net lift, or we work on something else\n
  27. responsible for net lift, or we work on something else\n
  28. responsible for net lift, or we work on something else\n
  29. responsible for net lift, or we work on something else\n
  30. responsible for net lift, or we work on something else\n
  31. responsible for net lift, or we work on something else\n
  32. responsible for net lift, or we work on something else\n
  33. responsible for net lift, or we work on something else\n
  34. responsible for net lift, or we work on something else\n
  35. responsible for net lift, or we work on something else\n
  36. responsible for net lift, or we work on something else\n
  37. responsible for net lift, or we work on something else\n
  38. responsible for net lift, or we work on something else\n
  39. responsible for net lift, or we work on something else\n
  40. responsible for net lift, or we work on something else\n
  41. responsible for net lift, or we work on something else\n
  42. responsible for net lift, or we work on something else\n
  43. responsible for net lift, or we work on something else\n
  44. responsible for net lift, or we work on something else\n
  45. responsible for net lift, or we work on something else\n
  46. responsible for net lift, or we work on something else\n
  47. responsible for net lift, or we work on something else\n
  48. responsible for net lift, or we work on something else\n
  49. responsible for net lift, or we work on something else\n
  50. responsible for net lift, or we work on something else\n
Advertisement