SlideShare a Scribd company logo
Intro to Data Science
Paco Nathan
                              Document
                              Collection



                                                           Scrub
                                           Tokenize
                                                           token

                                      M




Concurrent, Inc.                                      Stop Word
                                                         List
                                                                   HashJoin
                                                                     Left


                                                                     RHS
                                                                              Regex
                                                                              token
                                                                                      GroupBy
                                                                                       token
                                                                                                 R




                                                                                         Count




pnathan@concurrentinc.com
                                                                                                     Word
                                                                                                     Count




@pacoid



                            Copyright @2012, Concurrent, Inc.
opportunity




 Unstructured Data
   meets
  Enterprise Scale
core values
  Data Science teams develop actionable insights,
  building confidence for decisions

  that work may influence a few decisions worth
  billions (e.g., M&A) or billions of small decisions
  (e.g., AdWords)

  probably somewhere in-between…
  solving for pattern, at scale.

  NB: projects require
  teams, not sole players
Intro to Data Science
             Document
             Collection



                                          Scrub
                          Tokenize
                                          token

                     M



                                                  HashJoin   Regex
                                                    Left     token
                                                                     GroupBy    R
                                     Stop Word                        token
                                        List
                                                    RHS




                                                                        Count




                                                                                    Word
                                                                                    Count




backstory
personal timeline
                1980s                      1990s             2000s                    2010s

                                                             Symbiot, Adknowledge,
 lead data teams                                              ShareThis, IMVU, etc.



                                                   consult




 start-up CTO                               BNTI



                              Bell Labs,
 enterprise                     Moto


                      IBM,
 research             NASA



 school            Stanford
inflection point: demand side
 • huge Internet successes after 1997 holiday season…            1997
   AMZN, EBAY, then GOOG, Inktomi (YHOO Search)

 • consider this metric:                                         1998
     annual revenue per customer / operational data store size
   dropped more than 100x within a few years after 1997

 • storage and processing costs plummeted, now we must
   work much smarter to extract ROI from Big Data…
   our methods must adapt                                        2004
 • “conventional wisdom” of RDBMS and BI tools became
   less viable; business cadre still focused on pivot tables
   and pie charts… which tends toward inertia

 • MapReduce and the Hadoop open source stack grew
   directly out of this context… but that only solves parts
 massive disruption in retail, advertising, etc.,
 “All of Fortune 500 is now on notice over the next 10-year
 period.” – Geoffrey Moore, 2012 (Mohr Davidow Ventures)
inflection point: supply side




  source:          source:
  DJ Patil         R-Bloggers
statistical thinking

         Process          Variation         Data         Tools




  a mode of thinking which includes both logical and analytical
  reasoning: evaluating the whole of a problem, as well as its
  component parts; attempting to assess the effects of changing
  one or more variables

  this approach attempts to understand not just problems and
  solutions, but also the processes involved and their variances
  particularly valuable in Big Data work when combined with hands-on
  experience in physics – roughly 50% of my peers come from physics
  or physical engineering… programmers typically don’t think this way
most valuable skills
 • approximately 80% of the costs for data-related projects
   get spent on data preparation – mostly on cleaning up
   data quality issues

 • unfortunately, data-related budgets for many companies tend
   to go into frameworks which can only be used after clean up

 • most valuable skills:
   ‣ learn to use programmable tools that prepare data

   ‣ learn to generate compelling data visualizations

   ‣ learn to estimate the confidence for reported results

   ‣ learn to automate work, making analysis repeatable
                                                                 D3
 the rest of the skills – modeling,
 algorithms, etc. – those are secondary
social caveats
 • the phrase “This data cannot be correct!” may be an
  early warning about the organization itself

 • much depends on how the people whom you work
  alongside tend to arrive at their decisions:
     ‣ probably good: Induction, Abduction, Circumscription
     ‣ probably poor: Deduction, Speculation, Justification


  in general, one good
  data visualization
  can put many ongoing
  verbal arguments
  to rest
                                                              xkcd
reference

 Statistical Modeling:
 The Two Cultures
 by Leo Breiman
 Statistical Science, 2001


 http://bit.ly/eUTh9L
reference

 Data Quality
 by Jack Olson
 Morgan Kaufmann, 2003



 http://www.amazon.com/dp/1558608915
reference

 Building Data Science Teams
 by DJ Patil
 O’Reilly, 2011



 http://www.amazon.com/dp/B005O4U3ZE
reference

 Data Jujitsu
 by DJ Patil
 O’Reilly, 2012



 http://www.amazon.com/dp/B008HMN5BE
reference

 RStudio
 download and
 run it on your laptop



 http://rstudio.org/
Intro to Data Science
           Document
           Collection



                                        Scrub
                        Tokenize
                                        token

                   M



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token
                                      List
                                                  RHS




                                                                      Count




                                                                                  Word
                                                                                  Count




build:
data science teams
process

                  help people ask the
    discovery     right questions


                  allow automation to
    modeling      place informed bets


                  deliver products at
    integration   scale to customers


                  leverage smarts in
       apps       product features          Gephi


                  keep infrastructure
     systems      running, cost-effective
matrix = needs × roles
                                             nn
          o
          overy
            very      elliing
                       e ng            ratiio
                                       rat o      apps
                                                  apps      stem
                                                            stem
                                                                 s
                                                                 s
    diisc
    d sc           mod
                   mod            nteg
                                iinte
                                      g                  sy
                                                         sy


                                                                     stakeholder



                                                                      scientist



                                                                     developer



                                                                        ops
matrix: usage
                                                 nn
              o
              overy
                very      elliing
                           e ng            ratiio
                                           rat o      apps
                                                      apps      stem
                                                                stem
                                                                     s
                                                                     s
        diisc
        d sc           mod
                       mod            nteg
                                    iinte
                                          g                  sy
                                                             sy


 conceptual tool for managing Data Science teams                         stakeholder

 overlay your project requirements (needs)
 with your team’s strengths (roles)
                                                                          scientist
 that will show very quickly where to focus

 NB: bring in individuals who cover 2-3 needs,                           developer
 particularly for team leads

                                                                            ops
matrix: needs
                                                 nn
              o
              overy
                very      elliing
                           e ng            ratiio
                                           rat o      apps
                                                      apps      stem
                                                                stem
                                                                     s
                                                                     s
        diisc
        d sc           mod
                       mod            nteg
                                    iinte
                                          g                  sy
                                                             sy


 one dimension is “needs”:                                               stakeholder
 discovery, modeling, integration, apps, systems

 these are the primary phases of leveraging Big Data…
                                                                          scientist
 stakeholders represent the domain: the key aspect
 to leverage

 analysts usually drive from discovery toward integration,               developer
 while the engineers tend to drive from systems toward
 integration
                                                                            ops
 NB: effective, hands-on management in Data Science
 must live in the space of integration, not delegate it
matrix: roles
                                                 nn
              o
              overy
                very      elliing
                           e ng            ratiio
                                           rat o      apps
                                                      apps      stem
                                                                stem
                                                                     s
                                                                     s
        diisc
        d sc           mod
                       mod            nteg
                                    iinte
                                          g                  sy
                                                             sy


 one dimension is “roles”:                                               stakeholder
 stakeholder, scientist, developer, ops

 each role leverages different disciplines, opportunities,
                                                                          scientist
 and risks… there’s great power in pairing people with
 complementary skills, in team environments where they
 can recognize each other’s priorities and perspectives
                                                                         developer
 blurring these roles is wonderful, when you find great
 people capable of doing so, e.g., DevOps… however,
 when businesses get into trouble, they will tend to                        ops
 “push down” these roles, blurring boundaries in
 ways which stresses teams and limits scalability
matrix: example team
                                             nn
          o
          overy
            very      elliing
                       e ng            ratiio
                                       rat o      apps
                                                  apps      stem
                                                            stem
                                                                 s
                                                                 s
    diisc
    d sc           mod
                   mod            nteg
                                iinte
                                      g                  sy
                                                         sy


                                                                     stakeholder



                                                                      scientist



                                                                     developer



                                                                        ops
matrix: example team
                                               nn
            o
            overy
              very      elliing
                         e ng            ratiio
                                         rat o      apps
                                                    apps      stem
                                                              stem
                                                                   s
                                                                   s
      diisc
      d sc           mod
                     mod            nteg
                                  iinte
                                        g                  sy
                                                           sy


                                                                       stakeholder



                                                                        scientist



                                                                       developer



                                                                          ops


 summary: this team seems heavy on systems, may need more overlap
 between modeling and integration, particularly among team leads
typical hand-offs
           integrity                      availability              discovery            communications


                                                                                                      people
        vendor
          data
        sources
                                                         Query
                                 data                     Query
                                                         Hosts
                                                            query              BI &            dashboards
                              warehouse                   Hosts
                                                            hosts            reporting
      production
        cluster                                                                              presentations

                                                                                          decision support

                             classifiers
                                                     predictive       analyze,
 customer                                             analytics       visualize                business
 interactions            recommenders                                                          stakeholders

                       internal API, crons, etc.
                                                             modeling


                                                     engineers,
   automation                                        analysts
data priorities
  •   Availability
      Top priority, providing access to data as needed.
      Lack of availability causes large hidden costs to a business.

  •   Integrity
                                                             integrity



                                                          vendor
                                                            data
                                                          sources
                                                                                            availability              discovery            communications


                                                                                                                                                        people




  •
                                                                                                           Query
                                                                                   data                     Query
                                                                                                           Hosts
                                                                                                              query              BI &            dashboards
                                                                                                            Hosts



      Discovery
                                                                                warehouse                     hosts            reporting
                                                        production
                                                          cluster                                                                              presentations

                                                                                                                                            decision support

                                                                               classifiers
                                                                                                       predictive       analyze,




  •
                                                   customer                                             analytics       visualize                business
                                                   interactions            recommenders                                                          stakeholders



      Modeling                                                           internal API, crons, etc.
                                                                                                               modeling


                                                                                                       engineers,
                                                     automation                                        analysts




  •   Communications
data priorities
 •   Availability

 •   Integrity
     Work within Engineering to ensure that customer data,
     internal metrics, third-party sources, etc., get collected and
     maintained in ways which are meaningful and consistent
     for required business use cases.

 •   Discovery
                                                             integrity



                                                          vendor
                                                            data
                                                          sources
                                                                                            availability              discovery            communications


                                                                                                                                                        people




 •
                                                                                                           Query
                                                                                   data                     Query
                                                                                                           Hosts
                                                                                                              query              BI &            dashboards
                                                                                                            Hosts



     Modeling
                                                                                warehouse                     hosts            reporting
                                                        production
                                                          cluster                                                                              presentations

                                                                                                                                            decision support

                                                                               classifiers
                                                                                                       predictive       analyze,




 •
                                                   customer                                             analytics       visualize                business
                                                   interactions            recommenders                                                          stakeholders



     Communications                                                      internal API, crons, etc.
                                                                                                               modeling


                                                                                                       engineers,
                                                     automation                                        analysts
data priorities
 •   Availability

 •   Integrity

 •   Discovery
     Analyze and visualize data on behalf of business stakeholders.
     Leverage statistics so that we not only say “What” decisions to
     take, but can answer “Why?” and “How good are they?”

 •   Modeling                                              integrity



                                                        vendor
                                                          data
                                                        sources
                                                                                          availability              discovery            communications


                                                                                                                                                      people




 •
                                                                                                         Query
                                                                                 data                     Query
                                                                                                         Hosts
                                                                                                            query              BI &            dashboards



     Communications
                                                                              warehouse                   Hosts
                                                                                                            hosts            reporting
                                                      production
                                                        cluster                                                                              presentations

                                                                                                                                          decision support

                                                                             classifiers
                                                                                                     predictive       analyze,
                                                 customer                                             analytics       visualize                business
                                                 interactions            recommenders                                                          stakeholders

                                                                       internal API, crons, etc.
                                                                                                             modeling


                                                                                                     engineers,
                                                   automation                                        analysts
data priorities
 •   Availability                                          integrity



                                                        vendor
                                                          data
                                                        sources
                                                                                          availability              discovery            communications


                                                                                                                                                      people




 •
                                                                                                         Query
                                                                                 data                     Query
                                                                                                         Hosts
                                                                                                            query              BI &            dashboards



     Integrity
                                                                              warehouse                   Hosts
                                                                                                            hosts            reporting
                                                      production
                                                        cluster                                                                              presentations

                                                                                                                                          decision support

                                                                             classifiers
                                                                                                     predictive       analyze,




 •
                                                 customer                                             analytics       visualize                business
                                                 interactions                                                                                  stakeholders



     Discovery
                                                                         recommenders

                                                                       internal API, crons, etc.
                                                                                                             modeling


                                                                                                     engineers,
                                                   automation                                        analysts




 •   Modeling
     Use business learnings in automated, scalable ways.
     For example, manage an automated bid system.

     Principally “algorithmic modeling”, not “data modeling”.

 •   Communications
data priorities
 •   Availability

 •   Integrity
                                                          integrity



                                                       vendor
                                                         data
                                                       sources
                                                                                         availability




                                                                                                        Query
                                                                                                                   discovery            communications


                                                                                                                                                     people




 •
                                                                                data                     Query
                                                                                                        Hosts
                                                                                                           query              BI &            dashboards
                                                                                                         Hosts



     Discovery
                                                                             warehouse                     hosts            reporting
                                                     production
                                                       cluster                                                                              presentations

                                                                                                                                         decision support

                                                                            classifiers
                                                                                                    predictive       analyze,




 •
                                                customer                                             analytics       visualize                business
                                                interactions            recommenders                                                          stakeholders




     Modeling                                                         internal API, crons, etc.
                                                                                                            modeling


                                                                                                    engineers,
                                                  automation                                        analysts




 •   Communications
     Work closely with stakeholders so that insights gleaned from
     data+analysis are understood, and important to the business.

     Sum of learnings from this ongoing process represents
     our primary value.
Intro to Data Science
           Document
           Collection



                                        Scrub
                        Tokenize
                                        token

                   M



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token
                                      List
                                                  RHS




                                                                      Count




                                                                                  Word
                                                                                  Count




theory:
wrangle the data
CAP theorem


                                         high
                                         availability

                  C               A
    strong
    consistency




                       P              eventual
                                      consistency



                      partition
                      tolerance
CAP theorem
 “You can have at most two of these properties for any shared-data
 system… the choice of which feature to discard determines the
 nature of your system.” – Eric Brewer, 2000 (Inktomi)

 • revenue transactions in ecommerce typically require
  strong consistency and partition tolerance

 • most analytics jobs for business use cases generally require
  availability and eventual consistency, but tend to
  not tolerate highly partitioned data

 • ETL becomes an Achilles heal for “agile”:
      ‣ agile/experiment-driven/scale-out, which leads to…
      ‣ provably-hard-to-detect metadata drift, leading to…
      ‣ high-risk technical debt
interpretation
 • purpose: theoretical limits for data access patterns
 • essence:
    ‣ consistency
    ‣ availability
    ‣ partition tolerance




 • best case scenario: you may pick two … or spend
    billions struggling to obtain all three at scale (GOOG)
 • translated: cost of doing business
  https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

  http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
data access patterns
 • the world is not made of data warehouses…
 • a handful of common data access patterns prevail
 • learn to recognize these for any given problem
 • typically expressed as trade-offs among:
     ‣ speed & volume (latency and throughput)

     ‣ reads & writes (access and storage)

     ‣ consistency / availability / partition tolerance


  as for roles on teams, some mixing is valuable;
  OTOH, too much blurring of boundaries causes
  stress
data access patterns
 • design patterns: originated in consensus negotiation
   for architecture, later used in software engineering

 • consider the corollaries in large-scale data work…
 • essential advice:
   select data frameworks based on
   your data access patterns

 • in other words, decouple use cases based on
   needs – to avoid “one size fits all” blockers

 • let’s review some examples…
access → frameworks → forfeits
  financial transactions               general ledger in RDBMS            CAx
  ad-hoc queries                      RDS (hosted MySQL)                 CAx
  reporting, dashboards               like Pentaho                       CAx
  log rotation/persistence            like Riak                          xxP
  search indexes                      like Lucene/Solr                   xAP
  static content, archives            S3 (durable storage)               xAP
  customer facts                      like Redis, Membase                xAP
  distributed counters, locks, sets   like Redis                         x A P*
  data objects CRUD                   key/value – like, NoSQL on MySQL   CxP
  authoritative metadata              like Zookeeper                     CxP
  data prep, modeling at scale        like Hadoop/Cascading + R          CxP
  graph analysis                      like Hadoop + Redis + Gephi        CxP
  data marts                          like Hadoop/HBase                  CxP
access → frameworks → forfeits
  financial transactions               general ledger in RDBMS            CAx
  ad-hoc queries                      RDS (hosted MySQL)                 CAx
  reporting, dashboards               like Pentaho                       CAx
  log rotation/persistence            like Riak                          xxP
  search indexes                      like Lucene/Solr                   xAP
  static content, archives            S3 (durable storage)               xAP
  customer facts                      like Redis, Membase                xAP
  distributed counters, locks, sets   like Redis                         x A P*
  data objects CRUD                   key/value – like, NoSQL on MySQL   CxP
  authoritative metadata              like Zookeeper                     CxP
  data prep, modeling at scale        like Hadoop/Cascading + R          CxP
  graph analysis                      like Hadoop + Redis + Gephi        CxP
  data marts                          like Hadoop/HBase                  CxP
Amdahl’s law




               source:
               Wikipedia
interpretation
 • purpose: theoretical limits for scalable computation
 • essence:
   task overhead and data independence
   define limits of parallelism for any given problem;
   however, these also suggest how well a problem
   can be scaled-out


 • translated: return on investment
  http://en.wikipedia.org/wiki/Amdahl's_law

  http://www.bu.edu/tech/research/training/tutorials/matlab-pct/scalability/
parallel computation
 • parallelism allows for horizontal scale-out, which
  create business “levers” in cost/performance at scale

 • NB: MapReduce provides a compute framework which
  is part-parallel and part-serial… that tends to
  complicate app development

 • most hard problems in industry have portions which
  do not allow data independence, or which require
  iteration

 • current efforts in massively parallel algorithms research
  may help to parallelize problems and reduce iteration –
  estimates are 3-5 years out for industry use
 GPUs and other hardware architecture advancements
 will likely make Hadoop unrecognizable 3-5 years out
Intro to Data Science
           Document
           Collection



                                        Scrub
                        Tokenize
                                        token

                   M



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token
                                      List
                                                  RHS




                                                                      Count




                                                                                  Word
                                                                                  Count




theory:
manage the science
the science in data science

 • Estimate Probability!

 • Calculate Analytic Variance!!

 • Apply Learning Theory!!!
                                                   edoMpUsserD:IUN




 • Manipulate Order
                             tcudorP ylppA lenaP yrotnevnI tneilC
                          tcudorP evomeR lenaP yrotnevnI tneilC
                                                  edoMmooRyM:IUN
                                             edoMmooRcilbuP:IUN
                                                              ydduB ddA
                                                          nigoL etisbeW
                                                                            vd
                                                     edoMsdneirF:IUN
                                                         edoMtahC:IUN
                                                     egasseM a evaeL
                                     G1 :gniniamer ecaps sserddA
                                               dekcilCeliforPyM:IUN
                                               edoMstiderCyuB:IUN
                                                      tohspanS a ekaT




    Complexity!!!!
                                               egapemoH nwO tisiV
                                                          elbbuB a epyT
                                                           taeS egnahC
                                                     wodniW D3 nepO
                                                              dneirF ddA
                         revO tcudorP pilF lenaP yrotnevnI tneilC
                                                               l e n aP t i dE
                                                                woN tahC
                                                                   teP yalP
                                                                  teP deeF
                     2 petS egaP traC esahcruP edaM remotsuC
                                  M215 :gniniamer ecaps sserddA
                                                         gnihtolC no tuP
                                                      bew :metI na yuB
                                                        edoMeivoM:IUN
                            ytinummoc ,tneilc :detratS weiV eivoM
                                                        teP weN etaerC
                                detrats etius tset :tseTytivitcennoC
                                           emag pazyeh dehcnuaL
                                            eciov mooRcilbuP tahC
                                                     egasseM yadhtriB
                                                     edoMlairotuT:IUN
                                            y b b o l s em a g d e hc n u a L
                                                 noitartsigeR euqinU
                                                                                 edoMpUsserD:IUN
                                                                                 tcudorP ylppA lenaP yr
                                                                                 tcudorP evomeR lenaP
                                                                                 edoMmooRyM:IUN
                                                                                 edoMmooRcilbuP:IUN
                                                                                 ydduB ddA
                                                                                 nigoL etisbeW
                                                                                 vd
                                                                                 edoMsdneirF:IUN
                                                                                 edoMtahC:IUN
                                                                                 egasseM a evaeL
                                                                                 G1 :gniniamer ecaps ss
                                                                                 dekcilCeliforPyM:IUN
                                                                                 edoMstiderCyuB:IUN
                                                                                 tohspanS a ekaT
                                                                                 egapemoH nwO tisiV
                                                                                 elbbuB a epyT
                                                                                 taeS egnahC

                                                                                 dneirF ddA
                                                                                 revO tcudorP pilF lena
                                                                                 lenaP tidE
                                                                                 woN tahC
                                                                                 teP yalP
                                                                                 teP deeF
                                                                                 2 petS egaP traC esah
                                                                                 M215 :gniniamer ecaps
                                                                                 gnihtolC no tuP
                                                                                 bew :metI na yuB
                                                                                 edoMeivoM:IUN
                                                                                 ytinummoc ,tneilc :detra
                                                                                 teP weN etaerC
                                                                                 detrats etius tset :tseTy
                                                                                 emag pazyeh dehcnua
                                                                                 eciov mooRcilbuP tahC
                                                                                 egasseM yadhtriB
                                                                                 edoMlairotuT:IUN
                                                                                 ybbol semag dehcnuaL
                                                                                 noitartsigeR euqinU
                                                                                 wodniW D3 nepO
probability estimation
 “a random variable or stochastic variable is a
 variable whose value is subject to variations”
 “an estimator is a rule for calculating an
 estimate of a given quantity based on observed
 data”
 estimators and probability
 distributions provide the essential
 basis for our insights
 bayesian methods, shrinkage…
 these are our friends
 quantile estimation, empirical CDFs…
 …versus frequentist notions
analytic variance
 our tools for automation leverage deep
 understanding of covariance
 cannot overstate the importance of
 sampling… insist on metrics described
 as confidence intervals, where valid
 bootstrapping, bagging…
 these are our friends
 Monte Carlo methods resolve “black box”
 problems
 point estimates may help prevent
 “uninformed” decisions
 do not skimp on this part, ever…
 a hard lesson learned from BI failures
learning theory
 in general, apps alternate between learning
 patterns/rules and retrieving similar things…
 statistical learning theory – rigorous,
 prevents you from making billion dollar
 mistakes, probably our future
 machine learning – scalable, enables
 you to make billion dollar mistakes, much
 commercial emphasis
 supervised vs. unsupervised
 arguably, optimization is a related area

 once Big Data projects get beyond merely
 digesting log files, optimization will likely
 become yet another buzzword :)
order complexity
 techniques for manipulating order complexity:
 dimensional reduction… with clustering
 as a common case
 e.g., you may have 100 million HTML docs,
 but there are only ~10K useful keywords
 low-dimensional structures, PCA
 linear algebra tricks: eigenvalues, matrix
 decomposition, etc.
 many hard problems resolved by “divide and
 conquer”
 this is an area ripe for much advancement in
 algorithms research near-term
Intro to Data Science
          Document
          Collection



                                       Scrub
                       Tokenize
                                       token

                  M



                                               HashJoin   Regex
                                                 Left     token
                                                                  GroupBy    R
                                  Stop Word                        token
                                     List
                                                 RHS




                                                                     Count




                                                                                 Word
                                                                                 Count




praxis
some great tools…
                                                           reporting:
                              visualization:
                                                           PowerPivot, Pentaho, Jaspersoft, SAS
                              ggplot2, D3, Gephi
  analytics/modeling:
  R, Weka, Matlab, PMML, GLPK
                                 text:
                                 LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK

      apps:
      Cascading, Scalding, Cascalog, R markdown, SWF
                                           scale-out:
                                           Scalr, RightScale, CycleComputing, vFabric, Beanstalk
              graph:         column:
              Gremlin,       Vertica,
              GraphLab,      HBase,         key/val:       index:             relational:
              Neo4J          Drill,         Redis,         Lucene/Solr,       usual suspects
                             Dynamo         Membase,       ElasticSearch
                                            MySQL

              stream/iter:        hadoop:
              Storm, Spark        EMR, HW, MapR,
                                  EMC, Azure, Compute
                                                               durable storage:
                                                               ASV, S3, Riak, Couch
a sample of great algorithms…
             time series analysis               seasonal variation                    geospatial



    hidden markov models         ARIMA                bayesian point estimates            kriging        k-d trees


                            funnel optimization                topics          lang id          anti-fraud             regression



    linear programming      cosine similarity           LDA     TextRank        LID       TF-IDF    random forest GLM/GAM


    elasticity of demand            recommender             key phrase             doc similarity                    classifier



  differential equations        k-medoids             PCA               LSH           k-means||          probabilistic hashing


  customer lifetime value           market segmentation           dimensional reduction             customer experiments



                   connected components         markov random walk             association rules              multi-arm bandit


               sessionization         social graph          what if ?         sample variance



                                affiliation networks                     MCMC             bootstrapping
Intro to Data Science
Paco Nathan
                              Document
                              Collection



                                                           Scrub
                                           Tokenize
                                                           token

                                      M




Concurrent, Inc.                                      Stop Word
                                                         List
                                                                   HashJoin
                                                                     Left


                                                                     RHS
                                                                              Regex
                                                                              token
                                                                                      GroupBy
                                                                                       token
                                                                                                 R




                                                                                         Count




pnathan@concurrentinc.com
                                                                                                     Word
                                                                                                     Count




@pacoid



                            Copyright @2012, Concurrent, Inc.

More Related Content

What's hot

Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
Mohammed Barakat
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Edureka!
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Tharushi Ruwandika
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
Gang Tao
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
SadhanaParameswaran
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6
 
Data science
Data scienceData science
Data science
Mohamed Loey
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Jason Geng
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
Data Science Club
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Edureka!
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Edureka!
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
ANOOP V S
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
DATAVERSITY
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data Architecture
DATAVERSITY
 
Full-stack Data Scientist
Full-stack Data ScientistFull-stack Data Scientist
Full-stack Data Scientist
Alexey Grigorev
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
DATAVERSITY
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Srishti44
 

What's hot (20)

Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Data science
Data scienceData science
Data science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data Architecture
 
Full-stack Data Scientist
Full-stack Data ScientistFull-stack Data Scientist
Full-stack Data Scientist
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 

Similar to Intro to Data Science for Enterprise Big Data

Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
Paco Nathan
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
Paco Nathan
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
Paco Nathan
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
Paco Nathan
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
Paco Nathan
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Paco Nathan
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with Cascading
Paco Nathan
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Paco Nathan
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
Paco Nathan
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
Paco Nathan
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
Paco Nathan
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
OReillyStrata
 
SNIA 2012 - Creating an Enterprise Hadoop Platform
SNIA 2012 - Creating an Enterprise Hadoop PlatformSNIA 2012 - Creating an Enterprise Hadoop Platform
SNIA 2012 - Creating an Enterprise Hadoop Platform
Joey Jablonski
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r
SAP Technology
 

Similar to Intro to Data Science for Enterprise Big Data (14)

Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with Cascading
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
SNIA 2012 - Creating an Enterprise Hadoop Platform
SNIA 2012 - Creating an Enterprise Hadoop PlatformSNIA 2012 - Creating an Enterprise Hadoop Platform
SNIA 2012 - Creating an Enterprise Hadoop Platform
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Paco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Paco Nathan
 
Computable Content
Computable ContentComputable Content
Computable Content
Paco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
Paco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Paco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Paco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
Paco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Paco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
Paco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 

Recently uploaded (20)

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 

Intro to Data Science for Enterprise Big Data

  • 1. Intro to Data Science Paco Nathan Document Collection Scrub Tokenize token M Concurrent, Inc. Stop Word List HashJoin Left RHS Regex token GroupBy token R Count pnathan@concurrentinc.com Word Count @pacoid Copyright @2012, Concurrent, Inc.
  • 2. opportunity Unstructured Data meets Enterprise Scale
  • 3. core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… solving for pattern, at scale. NB: projects require teams, not sole players
  • 4. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count backstory
  • 5. personal timeline 1980s 1990s 2000s 2010s Symbiot, Adknowledge, lead data teams ShareThis, IMVU, etc. consult start-up CTO BNTI Bell Labs, enterprise Moto IBM, research NASA school Stanford
  • 6. inflection point: demand side • huge Internet successes after 1997 holiday season… 1997 AMZN, EBAY, then GOOG, Inktomi (YHOO Search) • consider this metric: 1998 annual revenue per customer / operational data store size dropped more than 100x within a few years after 1997 • storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data… our methods must adapt 2004 • “conventional wisdom” of RDBMS and BI tools became less viable; business cadre still focused on pivot tables and pie charts… which tends toward inertia • MapReduce and the Hadoop open source stack grew directly out of this context… but that only solves parts massive disruption in retail, advertising, etc., “All of Fortune 500 is now on notice over the next 10-year period.” – Geoffrey Moore, 2012 (Mohr Davidow Ventures)
  • 7. inflection point: supply side source: source: DJ Patil R-Bloggers
  • 8. statistical thinking Process Variation Data Tools a mode of thinking which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way
  • 9. most valuable skills • approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues • unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up • most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable D3 the rest of the skills – modeling, algorithms, etc. – those are secondary
  • 10. social caveats • the phrase “This data cannot be correct!” may be an early warning about the organization itself • much depends on how the people whom you work alongside tend to arrive at their decisions: ‣ probably good: Induction, Abduction, Circumscription ‣ probably poor: Deduction, Speculation, Justification in general, one good data visualization can put many ongoing verbal arguments to rest xkcd
  • 11. reference Statistical Modeling: The Two Cultures by Leo Breiman Statistical Science, 2001 http://bit.ly/eUTh9L
  • 12. reference Data Quality by Jack Olson Morgan Kaufmann, 2003 http://www.amazon.com/dp/1558608915
  • 13. reference Building Data Science Teams by DJ Patil O’Reilly, 2011 http://www.amazon.com/dp/B005O4U3ZE
  • 14. reference Data Jujitsu by DJ Patil O’Reilly, 2012 http://www.amazon.com/dp/B008HMN5BE
  • 15. reference RStudio download and run it on your laptop http://rstudio.org/
  • 16. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count build: data science teams
  • 17. process help people ask the discovery right questions allow automation to modeling place informed bets deliver products at integration scale to customers leverage smarts in apps product features Gephi keep infrastructure systems running, cost-effective
  • 18. matrix = needs × roles nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops
  • 19. matrix: usage nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy conceptual tool for managing Data Science teams stakeholder overlay your project requirements (needs) with your team’s strengths (roles) scientist that will show very quickly where to focus NB: bring in individuals who cover 2-3 needs, developer particularly for team leads ops
  • 20. matrix: needs nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy one dimension is “needs”: stakeholder discovery, modeling, integration, apps, systems these are the primary phases of leveraging Big Data… scientist stakeholders represent the domain: the key aspect to leverage analysts usually drive from discovery toward integration, developer while the engineers tend to drive from systems toward integration ops NB: effective, hands-on management in Data Science must live in the space of integration, not delegate it
  • 21. matrix: roles nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy one dimension is “roles”: stakeholder stakeholder, scientist, developer, ops each role leverages different disciplines, opportunities, scientist and risks… there’s great power in pairing people with complementary skills, in team environments where they can recognize each other’s priorities and perspectives developer blurring these roles is wonderful, when you find great people capable of doing so, e.g., DevOps… however, when businesses get into trouble, they will tend to ops “push down” these roles, blurring boundaries in ways which stresses teams and limits scalability
  • 22. matrix: example team nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops
  • 23. matrix: example team nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops summary: this team seems heavy on systems, may need more overlap between modeling and integration, particularly among team leads
  • 24. typical hand-offs integrity availability discovery communications people vendor data sources Query data Query Hosts query BI & dashboards warehouse Hosts hosts reporting production cluster presentations decision support classifiers predictive analyze, customer analytics visualize business interactions recommenders stakeholders internal API, crons, etc. modeling engineers, automation analysts
  • 25. data priorities • Availability Top priority, providing access to data as needed. Lack of availability causes large hidden costs to a business. • Integrity integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Hosts Discovery warehouse hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions recommenders stakeholders Modeling internal API, crons, etc. modeling engineers, automation analysts • Communications
  • 26. data priorities • Availability • Integrity Work within Engineering to ensure that customer data, internal metrics, third-party sources, etc., get collected and maintained in ways which are meaningful and consistent for required business use cases. • Discovery integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Hosts Modeling warehouse hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions recommenders stakeholders Communications internal API, crons, etc. modeling engineers, automation analysts
  • 27. data priorities • Availability • Integrity • Discovery Analyze and visualize data on behalf of business stakeholders. Leverage statistics so that we not only say “What” decisions to take, but can answer “Why?” and “How good are they?” • Modeling integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Communications warehouse Hosts hosts reporting production cluster presentations decision support classifiers predictive analyze, customer analytics visualize business interactions recommenders stakeholders internal API, crons, etc. modeling engineers, automation analysts
  • 28. data priorities • Availability integrity vendor data sources availability discovery communications people • Query data Query Hosts query BI & dashboards Integrity warehouse Hosts hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions stakeholders Discovery recommenders internal API, crons, etc. modeling engineers, automation analysts • Modeling Use business learnings in automated, scalable ways. For example, manage an automated bid system. Principally “algorithmic modeling”, not “data modeling”. • Communications
  • 29. data priorities • Availability • Integrity integrity vendor data sources availability Query discovery communications people • data Query Hosts query BI & dashboards Hosts Discovery warehouse hosts reporting production cluster presentations decision support classifiers predictive analyze, • customer analytics visualize business interactions recommenders stakeholders Modeling internal API, crons, etc. modeling engineers, automation analysts • Communications Work closely with stakeholders so that insights gleaned from data+analysis are understood, and important to the business. Sum of learnings from this ongoing process represents our primary value.
  • 30. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count theory: wrangle the data
  • 31. CAP theorem high availability C A strong consistency P eventual consistency partition tolerance
  • 32. CAP theorem “You can have at most two of these properties for any shared-data system… the choice of which feature to discard determines the nature of your system.” – Eric Brewer, 2000 (Inktomi) • revenue transactions in ecommerce typically require strong consistency and partition tolerance • most analytics jobs for business use cases generally require availability and eventual consistency, but tend to not tolerate highly partitioned data • ETL becomes an Achilles heal for “agile”: ‣ agile/experiment-driven/scale-out, which leads to… ‣ provably-hard-to-detect metadata drift, leading to… ‣ high-risk technical debt
  • 33. interpretation • purpose: theoretical limits for data access patterns • essence: ‣ consistency ‣ availability ‣ partition tolerance • best case scenario: you may pick two … or spend billions struggling to obtain all three at scale (GOOG) • translated: cost of doing business https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
  • 34. data access patterns • the world is not made of data warehouses… • a handful of common data access patterns prevail • learn to recognize these for any given problem • typically expressed as trade-offs among: ‣ speed & volume (latency and throughput) ‣ reads & writes (access and storage) ‣ consistency / availability / partition tolerance as for roles on teams, some mixing is valuable; OTOH, too much blurring of boundaries causes stress
  • 35. data access patterns • design patterns: originated in consensus negotiation for architecture, later used in software engineering • consider the corollaries in large-scale data work… • essential advice: select data frameworks based on your data access patterns • in other words, decouple use cases based on needs – to avoid “one size fits all” blockers • let’s review some examples…
  • 36. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  • 37. access → frameworks → forfeits financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene/Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/HBase CxP
  • 38. Amdahl’s law source: Wikipedia
  • 39. interpretation • purpose: theoretical limits for scalable computation • essence: task overhead and data independence define limits of parallelism for any given problem; however, these also suggest how well a problem can be scaled-out • translated: return on investment http://en.wikipedia.org/wiki/Amdahl's_law http://www.bu.edu/tech/research/training/tutorials/matlab-pct/scalability/
  • 40. parallel computation • parallelism allows for horizontal scale-out, which create business “levers” in cost/performance at scale • NB: MapReduce provides a compute framework which is part-parallel and part-serial… that tends to complicate app development • most hard problems in industry have portions which do not allow data independence, or which require iteration • current efforts in massively parallel algorithms research may help to parallelize problems and reduce iteration – estimates are 3-5 years out for industry use GPUs and other hardware architecture advancements will likely make Hadoop unrecognizable 3-5 years out
  • 41. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count theory: manage the science
  • 42. the science in data science • Estimate Probability! • Calculate Analytic Variance!! • Apply Learning Theory!!! edoMpUsserD:IUN • Manipulate Order tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT Complexity!!!! egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC l e n aP t i dE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN y b b o l s em a g d e hc n u a L noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yr tcudorP evomeR lenaP edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps ss dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lena lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esah M215 :gniniamer ecaps gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detra teP weN etaerC detrats etius tset :tseTy emag pazyeh dehcnua eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO
  • 43. probability estimation “a random variable or stochastic variable is a variable whose value is subject to variations” “an estimator is a rule for calculating an estimate of a given quantity based on observed data” estimators and probability distributions provide the essential basis for our insights bayesian methods, shrinkage… these are our friends quantile estimation, empirical CDFs… …versus frequentist notions
  • 44. analytic variance our tools for automation leverage deep understanding of covariance cannot overstate the importance of sampling… insist on metrics described as confidence intervals, where valid bootstrapping, bagging… these are our friends Monte Carlo methods resolve “black box” problems point estimates may help prevent “uninformed” decisions do not skimp on this part, ever… a hard lesson learned from BI failures
  • 45. learning theory in general, apps alternate between learning patterns/rules and retrieving similar things… statistical learning theory – rigorous, prevents you from making billion dollar mistakes, probably our future machine learning – scalable, enables you to make billion dollar mistakes, much commercial emphasis supervised vs. unsupervised arguably, optimization is a related area once Big Data projects get beyond merely digesting log files, optimization will likely become yet another buzzword :)
  • 46. order complexity techniques for manipulating order complexity: dimensional reduction… with clustering as a common case e.g., you may have 100 million HTML docs, but there are only ~10K useful keywords low-dimensional structures, PCA linear algebra tricks: eigenvalues, matrix decomposition, etc. many hard problems resolved by “divide and conquer” this is an area ripe for much advancement in algorithms research near-term
  • 47. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count praxis
  • 48. some great tools… reporting: visualization: PowerPivot, Pentaho, Jaspersoft, SAS ggplot2, D3, Gephi analytics/modeling: R, Weka, Matlab, PMML, GLPK text: LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK apps: Cascading, Scalding, Cascalog, R markdown, SWF scale-out: Scalr, RightScale, CycleComputing, vFabric, Beanstalk graph: column: Gremlin, Vertica, GraphLab, HBase, key/val: index: relational: Neo4J Drill, Redis, Lucene/Solr, usual suspects Dynamo Membase, ElasticSearch MySQL stream/iter: hadoop: Storm, Spark EMR, HW, MapR, EMC, Azure, Compute durable storage: ASV, S3, Riak, Couch
  • 49. a sample of great algorithms… time series analysis seasonal variation geospatial hidden markov models ARIMA bayesian point estimates kriging k-d trees funnel optimization topics lang id anti-fraud regression linear programming cosine similarity LDA TextRank LID TF-IDF random forest GLM/GAM elasticity of demand recommender key phrase doc similarity classifier differential equations k-medoids PCA LSH k-means|| probabilistic hashing customer lifetime value market segmentation dimensional reduction customer experiments connected components markov random walk association rules multi-arm bandit sessionization social graph what if ? sample variance affiliation networks MCMC bootstrapping
  • 50. Intro to Data Science Paco Nathan Document Collection Scrub Tokenize token M Concurrent, Inc. Stop Word List HashJoin Left RHS Regex token GroupBy token R Count pnathan@concurrentinc.com Word Count @pacoid Copyright @2012, Concurrent, Inc.

Editor's Notes

  1. responsible for net lift, or we work on something else\n
  2. responsible for net lift, or we work on something else\n
  3. responsible for net lift, or we work on something else\n
  4. responsible for net lift, or we work on something else\n
  5. responsible for net lift, or we work on something else\n
  6. responsible for net lift, or we work on something else\n
  7. responsible for net lift, or we work on something else\n
  8. responsible for net lift, or we work on something else\n
  9. responsible for net lift, or we work on something else\n
  10. responsible for net lift, or we work on something else\n
  11. responsible for net lift, or we work on something else\n
  12. responsible for net lift, or we work on something else\n
  13. responsible for net lift, or we work on something else\n
  14. responsible for net lift, or we work on something else\n
  15. responsible for net lift, or we work on something else\n
  16. responsible for net lift, or we work on something else\n
  17. responsible for net lift, or we work on something else\n
  18. responsible for net lift, or we work on something else\n
  19. responsible for net lift, or we work on something else\n
  20. responsible for net lift, or we work on something else\n
  21. responsible for net lift, or we work on something else\n
  22. responsible for net lift, or we work on something else\n
  23. responsible for net lift, or we work on something else\n
  24. responsible for net lift, or we work on something else\n
  25. responsible for net lift, or we work on something else\n
  26. responsible for net lift, or we work on something else\n
  27. responsible for net lift, or we work on something else\n
  28. responsible for net lift, or we work on something else\n
  29. responsible for net lift, or we work on something else\n
  30. responsible for net lift, or we work on something else\n
  31. responsible for net lift, or we work on something else\n
  32. responsible for net lift, or we work on something else\n
  33. responsible for net lift, or we work on something else\n
  34. responsible for net lift, or we work on something else\n
  35. responsible for net lift, or we work on something else\n
  36. responsible for net lift, or we work on something else\n
  37. responsible for net lift, or we work on something else\n
  38. responsible for net lift, or we work on something else\n
  39. responsible for net lift, or we work on something else\n
  40. responsible for net lift, or we work on something else\n
  41. responsible for net lift, or we work on something else\n
  42. responsible for net lift, or we work on something else\n
  43. responsible for net lift, or we work on something else\n
  44. responsible for net lift, or we work on something else\n
  45. responsible for net lift, or we work on something else\n
  46. responsible for net lift, or we work on something else\n
  47. responsible for net lift, or we work on something else\n
  48. responsible for net lift, or we work on something else\n
  49. responsible for net lift, or we work on something else\n
  50. responsible for net lift, or we work on something else\n