Intro to Data Science for Enterprise Big Data
 

Intro to Data Science for Enterprise Big Data

on

  • 14,967 views

If you need a different format (PDF, PPT) instead of Keynote, please email me: pnathan AT concurrentinc DOT com ...

If you need a different format (PDF, PPT) instead of Keynote, please email me: pnathan AT concurrentinc DOT com

An overview of Data Science for Enterprise Big Data. In other words, how to combine structured and unstructured data, leveraging the tools of automation and mathematics, for highly scalable businesses. We discuss management strategy for building Data Science teams, basic requirements of the "science" in Data Science, and typical data access patterns for working with Big Data. We review some great algorithms, tools, and truisms for building a Data Science practice, and provide plus some great references to read for further study.

Presented initially at the Enterprise Big Data meetup at Tata Consultancy Services, Santa Clara, 2012-08-20 http://www.meetup.com/Enterprise-Big-Data/events/77635202/

Statistics

Views

Total Views
14,967
Views on SlideShare
10,652
Embed Views
4,315

Actions

Likes
56
Downloads
886
Comments
4

23 Embeds 4,315

http://datascience101.wordpress.com 2371
http://liber118.com 866
http://www.scoop.it 630
http://nguyentantrieu.info 252
https://si0.twimg.com 68
http://www.linkedin.com 56
https://twitter.com 33
http://www.techgig.com 7
https://www.linkedin.com 5
http://translate.googleusercontent.com 4
http://tweetedtimes.com 4
http://twitter.com 3
https://crtest13.derby.ac.uk 2
http://www.techgig.timesjobs.com 2
https://twimg0-a.akamaihd.net 2
https://labcorp.atlassian.net 2
http://zest.to 2
http://www.liber118.com 1
http://tweets.lillianpierson.com 1
http://snews.rage2.yandex.ru 1
http://www.newsblur.com 1
http://a0.twimg.com 1
https://datascience101.wordpress.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • That's an interesting metric - ARPU/OperationalDataStoreSize
    Are you sure you want to
    Your message goes here
    Processing…
  • This is brilliant! Thank you!
    Are you sure you want to
    Your message goes here
    Processing…
  • @davekincaid we didn't get a recording last night, but we're looking to cover this material again and will! we'd also like to wrap parts of it as webcasts
    Are you sure you want to
    Your message goes here
    Processing…
  • Amazing stuff! By any chance is there a recording of your actual presentation? The narration that goes along with these would be fantastic!
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n
  • responsible for net lift, or we work on something else\n

Intro to Data Science for Enterprise Big Data Intro to Data Science for Enterprise Big Data Presentation Transcript

  • Intro to Data SciencePaco Nathan Document Collection Scrub Tokenize token MConcurrent, Inc. Stop Word List HashJoin Left RHS Regex token GroupBy token R Countpnathan@concurrentinc.com Word Count@pacoid Copyright @2012, Concurrent, Inc.
  • opportunity Unstructured Data meets Enterprise Scale
  • core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… solving for pattern, at scale. NB: projects require teams, not sole players
  • Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Countbackstory
  • personal timeline 1980s 1990s 2000s 2010s Symbiot, Adknowledge, lead data teams ShareThis, IMVU, etc. consult start-up CTO BNTI Bell Labs, enterprise Moto IBM, research NASA school Stanford
  • inflection point: demand side • huge Internet successes after 1997 holiday season… 1997 AMZN, EBAY, then GOOG, Inktomi (YHOO Search) • consider this metric: 1998 annual revenue per customer / operational data store size dropped more than 100x within a few years after 1997 • storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data… our methods must adapt 2004 • “conventional wisdom” of RDBMS and BI tools became less viable; business cadre still focused on pivot tables and pie charts… which tends toward inertia • MapReduce and the Hadoop open source stack grew directly out of this context… but that only solves parts massive disruption in retail, advertising, etc., “All of Fortune 500 is now on notice over the next 10-year period.” – Geoffrey Moore, 2012 (Mohr Davidow Ventures)
  • inflection point: supply side source: source: DJ Patil R-Bloggers
  • statistical thinking Process Variation Data Tools a mode of thinking which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way
  • most valuable skills • approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues • unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up • most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable D3 the rest of the skills – modeling, algorithms, etc. – those are secondary
  • social caveats • the phrase “This data cannot be correct!” may be an early warning about the organization itself • much depends on how the people whom you work alongside tend to arrive at their decisions: ‣ probably good: Induction, Abduction, Circumscription ‣ probably poor: Deduction, Speculation, Justification in general, one good data visualization can put many ongoing verbal arguments to rest xkcd
  • reference Statistical Modeling: The Two Cultures by Leo Breiman Statistical Science, 2001 http://bit.ly/eUTh9L
  • reference Data Quality by Jack Olson Morgan Kaufmann, 2003 http://www.amazon.com/dp/1558608915
  • reference Building Data Science Teams by DJ Patil O’Reilly, 2011 http://www.amazon.com/dp/B005O4U3ZE
  • reference Data Jujitsu by DJ Patil O’Reilly, 2012 http://www.amazon.com/dp/B008HMN5BE
  • reference RStudio download and run it on your laptop http://rstudio.org/
  • Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Countbuild:data science teams
  • process help people ask the discovery right questions allow automation to modeling place informed bets deliver products at integration scale to customers leverage smarts in apps product features Gephi keep infrastructure systems running, cost-effective
  • matrix = needs × roles nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops
  • matrix: usage nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy conceptual tool for managing Data Science teams stakeholder overlay your project requirements (needs) with your team’s strengths (roles) scientist that will show very quickly where to focus NB: bring in individuals who cover 2-3 needs, developer particularly for team leads ops
  • matrix: needs nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy one dimension is “needs”: stakeholder discovery, modeling, integration, apps, systems these are the primary phases of leveraging Big Data… scientist stakeholders represent the domain: the key aspect to leverage analysts usually drive from discovery toward integration, developer while the engineers tend to drive from systems toward integration ops NB: effective, hands-on management in Data Science must live in the space of integration, not delegate it
  • matrix: roles nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy one dimension is “roles”: stakeholder stakeholder, scientist, developer, ops each role leverages different disciplines, opportunities, scientist and risks… there’s great power in pairing people with complementary skills, in team environments where they can recognize each other’s priorities and perspectives developer blurring these roles is wonderful, when you find great people capable of doing so, e.g., DevOps… however, when businesses get into trouble, they will tend to ops “push down” these roles, blurring boundaries in ways which stresses teams and limits scalability
  • matrix: example team nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops
  • matrix: example team nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops summary: this team seems heavy on systems, may need more overlap between modeling and integration, particularly among team leads
  • typical hand-offs integrity availability discovery communications people vendor data sources Query data Query Hosts query BI & dashboards warehouse Hosts hosts reporting production cluster presentations decision support classifiers predictive analyze, customer analytics visualize business interactions recommenders stakeholders internal API, crons, etc. modeling engineers, automation analysts
  • data priorities • Availability Top priority, providing access to data as needed. Lack of availability causes large hidden costs to a business. • Integrity integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Hosts Discovery warehouse hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions recommenders stakeholders Modeling internal API, crons, etc. modeling engineers, automation analysts • Communications
  • data priorities • Availability • Integrity Work within Engineering to ensure that customer data, internal metrics, third-party sources, etc., get collected and maintained in ways which are meaningful and consistent for required business use cases. • Discovery integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Hosts Modeling warehouse hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions recommenders stakeholders Communications internal API, crons, etc. modeling engineers, automation analysts
  • data priorities • Availability • Integrity • Discovery Analyze and visualize data on behalf of business stakeholders. Leverage statistics so that we not only say “What” decisions to take, but can answer “Why?” and “How good are they?” • Modeling integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Communications warehouse Hosts hosts reporting production cluster presentations decision support classifiers predictive analyze, customer analytics visualize business interactions recommenders stakeholders internal API, crons, etc. modeling engineers, automation analysts
  • data priorities • Availability integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Integrity warehouse Hosts hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions stakeholders Discovery recommenders internal API, crons, etc. modeling engineers, automation analysts • Modeling Use business learnings in automated, scalable ways. For example, manage an automated bid system. Principally “algorithmic modeling”, not “data modeling”. • Communications
  • data priorities • Availability • Integrity integrity vendor data sources availability Query discovery communications people • data Query Hosts query BI & dashboards Hosts Discovery warehouse hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions recommenders stakeholders Modeling internal API, crons, etc. modeling engineers, automation analysts • Communications Work closely with stakeholders so that insights gleaned from data+analysis are understood, and important to the business. Sum of learnings from this ongoing process represents our primary value.
  • Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Counttheory:wrangle the data
  • CAP theorem high availability C A strong consistency P eventual consistency partition tolerance
  • CAP theorem “You can have at most two of these properties for any shared-data system… the choice of which feature to discard determines the nature of your system.” – Eric Brewer, 2000 (Inktomi) • revenue transactions in ecommerce typically require strong consistency and partition tolerance • most analytics jobs for business use cases generally require availability and eventual consistency, but tend to not tolerate highly partitioned data • ETL becomes an Achilles heal for “agile”: ‣ agile/experiment-driven/scale-out, which leads to… ‣ provably-hard-to-detect metadata drift, leading to… ‣ high-risk technical debt
  • interpretation • purpose: theoretical limits for data access patterns • essence: ‣ consistency ‣ availability ‣ partition tolerance • best case scenario: you may pick two … or spend billions struggling to obtain all three at scale (GOOG) • translated: cost of doing business https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
  • data access patterns • the world is not made of data warehouses… • a handful of common data access patterns prevail • learn to recognize these for any given problem • typically expressed as trade-offs among: ‣ speed & volume (latency and throughput) ‣ reads & writes (access and storage) ‣ consistency / availability / partition tolerance as for roles on teams, some mixing is valuable; OTOH, too much blurring of boundaries causes stress
  • data access patterns • design patterns: originated in consensus negotiation for architecture, later used in software engineering • consider the corollaries in large-scale data work… • essential advice: select data frameworks based on your data access patterns • in other words, decouple use cases based on needs – to avoid “one size fits all” blockers • let’s review some examples…
  • access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  • access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  • Amdahl’s law source: Wikipedia
  • interpretation • purpose: theoretical limits for scalable computation • essence: task overhead and data independence define limits of parallelism for any given problem; however, these also suggest how well a problem can be scaled-out • translated: return on investment http://en.wikipedia.org/wiki/Amdahls_law http://www.bu.edu/tech/research/training/tutorials/matlab-pct/scalability/
  • parallel computation • parallelism allows for horizontal scale-out, which create business “levers” in cost/performance at scale • NB: MapReduce provides a compute framework which is part-parallel and part-serial… that tends to complicate app development • most hard problems in industry have portions which do not allow data independence, or which require iteration • current efforts in massively parallel algorithms research may help to parallelize problems and reduce iteration – estimates are 3-5 years out for industry use GPUs and other hardware architecture advancements will likely make Hadoop unrecognizable 3-5 years out
  • Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Counttheory:manage the science
  • the science in data science • Estimate Probability! • Calculate Analytic Variance!! • Apply Learning Theory!!! edoMpUsserD:IUN • Manipulate Order tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT Complexity!!!! egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC l e n aP t i dE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN y b b o l s em a g d e hc n u a L noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yr tcudorP evomeR lenaP edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps ss dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lena lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esah M215 :gniniamer ecaps gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detra teP weN etaerC detrats etius tset :tseTy emag pazyeh dehcnua eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO
  • probability estimation “a random variable or stochastic variable is a variable whose value is subject to variations” “an estimator is a rule for calculating an estimate of a given quantity based on observed data” estimators and probability distributions provide the essential basis for our insights bayesian methods, shrinkage… these are our friends quantile estimation, empirical CDFs… …versus frequentist notions
  • analytic variance our tools for automation leverage deep understanding of covariance cannot overstate the importance of sampling… insist on metrics described as confidence intervals, where valid bootstrapping, bagging… these are our friends Monte Carlo methods resolve “black box” problems point estimates may help prevent “uninformed” decisions do not skimp on this part, ever… a hard lesson learned from BI failures
  • learning theory in general, apps alternate between learning patterns/rules and retrieving similar things… statistical learning theory – rigorous, prevents you from making billion dollar mistakes, probably our future machine learning – scalable, enables you to make billion dollar mistakes, much commercial emphasis supervised vs. unsupervised arguably, optimization is a related area once Big Data projects get beyond merely digesting log files, optimization will likely become yet another buzzword :)
  • order complexity techniques for manipulating order complexity: dimensional reduction… with clustering as a common case e.g., you may have 100 million HTML docs, but there are only ~10K useful keywords low-dimensional structures, PCA linear algebra tricks: eigenvalues, matrix decomposition, etc. many hard problems resolved by “divide and conquer” this is an area ripe for much advancement in algorithms research near-term
  • Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Countpraxis
  • some great tools… reporting: visualization: PowerPivot, Pentaho, Jaspersoft, SAS ggplot2, D3, Gephi analytics/modeling: R, Weka, Matlab, PMML, GLPK text: LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK apps: Cascading, Scalding, Cascalog, R markdown, SWF scale-out: Scalr, RightScale, CycleComputing, vFabric, Beanstalk graph: column: Gremlin, Vertica, GraphLab, HBase, key/val: index: relational: Neo4J Drill, Redis, Lucene/Solr, usual suspects Dynamo Membase, ElasticSearch MySQL stream/iter: hadoop: Storm, Spark EMR, HW, MapR, EMC, Azure, Compute durable storage: ASV, S3, Riak, Couch
  • a sample of great algorithms… time series analysis seasonal variation geospatial hidden markov models ARIMA bayesian point estimates kriging k-d trees funnel optimization topics lang id anti-fraud regression linear programming cosine similarity LDA TextRank LID TF-IDF random forest GLM/GAM elasticity of demand recommender key phrase doc similarity classifier differential equations k-medoids PCA LSH k-means|| probabilistic hashing customer lifetime value market segmentation dimensional reduction customer experiments connected components markov random walk association rules multi-arm bandit sessionization social graph what if ? sample variance affiliation networks MCMC bootstrapping
  • Intro to Data SciencePaco Nathan Document Collection Scrub Tokenize token MConcurrent, Inc. Stop Word List HashJoin Left RHS Regex token GroupBy token R Countpnathan@concurrentinc.com Word Count@pacoid Copyright @2012, Concurrent, Inc.