Crowd Sourcing Reflected
Intelligence Using Search and Big
Data
Ted Dunning
Grant Ingersoll

©MapR Technologies - Confidential   1
Grant’s Background

     Co-founder:
       –   LucidWorks – Chief Scientist
       –   Apache Mahout
     Long time Lucene/Solr committer
     Author: Taming Text
     Background in IR and NLP
       –   Built CLIR, QA and a variety of other search-based apps




©MapR Technologies - Confidential             2
Ted’s Background

     Academia, Startups
       –   Aptex, MusicMatch, ID Analytics, Veoh
       –   Big data since before big
     Open source
       –   since the dark ages before the internet
       –   Mahout, Zookeeper, Drill
       –   bought the beer at first HUG
     MapR
       –   Chief Application Architect
     Founding member of Apache Drill




©MapR Technologies - Confidential             3
Agenda

     Intro
     Search Evolution and Search Revolution
     Reflected Intelligence Use Cases
     Building a Next Generation Search and Discovery Platform
       –   MapR
       –   LucidWorks
     1+1=3




©MapR Technologies - Confidential        4
Search is Dead, Long Live Search


     Search is a system building block                       Content
       –   text is only a part of the story


     If the algorithms fit,
                             use them!          Content                    User
                                              Relationships             Interaction


     Embrace fuzziness!


     Scoring features are everywhere                         Access




©MapR Technologies - Confidential                  5
Search (R)evolution

     Search use leads to search abuse
       –   denormalization frees your mind
       –   scoring is just a sparse matrix multiply

     Lucene/Solr evolution
       –   non free text usages abound
       –   many DB-like features
       –   noSQL before NoSQL was cool
       –   flexible indexing
       –   finite State Transducers FTW!

     Scale

     “This ain’t your father’s relevance anymore”


©MapR Technologies - Confidential                 6
Add (Lots of) Water

     Large-scale analysis is key to reflected intelligence
       –   correlation analysis
            • based on queries, clicks, mouse tracks,
                   even explicit feedback
            • produce clusters, trends, topics, SIP’s              Search
       –   start with engineered knowledge,
                 refine with user feedback
     Large-scale discovery features
        encourage experimentation

     Always test, always enrich!                      Analytics            Discovery




©MapR Technologies - Confidential                  7
Social Media Analysis in Telecom

     Correlate mobile traffic analysis with social media analysis
       –   events cause traffic micro-bursts
       –   participants tweet the events ahead of time
     Deploy operations faster to predict outages and better handle
      emergency situations
       –   high cost bandwidth augmentation can be marshaled as the traffic appears
       –   anticipation beats reaction




©MapR Technologies - Confidential            8
Provenance is 80% of value

     Analysis of social media to determine advertising reach and
      response


     In one case the same untargeted advertising was worth 5x if sold
      with supporting data.




©MapR Technologies - Confidential     9
Claims Analysis

     Goal
       –   Insurance claims processing and analysis
       –   fraud analysis
     Method
       –   Combine free text search with metadata analysis to identify high risk
           activities across the country
       –   Integrate with corporate workflows to detect and fix outliers in customer
           relations
     Results
       –   Questions that took 24-48 hours now take seconds to answer




©MapR Technologies - Confidential            10
Virginia Tech - Help the World

     Grab data around crisis
     Search immediately
     Large-scale analysis enriches data to find
      ways to improve responses and
      understanding
     http://www.ctrnet.net




©MapR Technologies - Confidential      11
Bright Planet - Catch the Bad Guys

     Online Drug Counterfeit detection
     Identify commonly used language indicating counterfeits
       –   you know it when you see it
       –   and you know you have seen it
     Feed to analyst via search-driven application
       –   enrich based on analysts feedback




©MapR Technologies - Confidential              12
Veoh - Cross Recommendations

     Cross recommendation as search
       –   with search used to build cross recommendation!
     Recommend content to people who exhibit certain behaviors
      (clicks, query terms, other)
     (Ab)use of a search engine
       –   but not as a search engine for content
       –   more like a search engine for behavior




©MapR Technologies - Confidential            13
What Platform Do You Need?

     Fast, efficient, scalable search
       –   bulk and near real-time indexing
       –   handle billions of records with sub-second search and faceting


     Large scale, cost effective storage and processing capabilities

     NLP and machine learning tools that scale to enhance discovery
      and analysis


     Integrated log analysis workflows that close the loop between the
      raw data and user interactions


©MapR Technologies - Confidential            14
Reference Architecture
                                                                      Access APIs
                                                                                      •View into
                                               Search View               Analytic      numeric/histo   Personalization &
                                                                                       ric data
                               1                                         Services                      Machine Learning
                                     2                                                                     Services
                          Shards           3                  N
                                                                                                              •Classification
                                                                                                              •Recommendation

                                                                           Document       •Documents   Classification Models
                                    Discovery &                                           •Users
                                    Enrichment                               Store                        In memory
                                                                                          •Logs           Replicated
                                    Clustering, classificat
                                    ion, NLP, topic                                                       Multi-tenant
                                    identification, searc
                                    h log analysis, user
                                    behavior
                                                                  Content Acquisition
                                                                     ETL, batch or near
                                                                     real-time


                                    Data
                    • LucidWorks Search
                      connectors
                    • Push




©MapR Technologies - Confidential                                             15
MapR

     MapR provides the technology leading Hadoop distribution
       –   full eco-system distribution
       –   integrated data platform
       –   complete solution for data integrity
     MapR clusters also provide tight integration with search
      technologies like LucidWorks
       –   integration is key for effective ops




©MapR Technologies - Confidential                 16
LucidWorks

     LucidWorks provides the leading packaging of Apache Lucene and
      Solr
       –   build your own, we support
       –   founded by the most prominent Lucene/Solr experts
     LucidWorks Search
       –   “Solr++”
            •    UI, REST API, MapR connectors, relevance tools, much more
     LucidWorks Big Data
       –   Big Data as a Service
       –   Integrated LucidWorks Search, Hadoop, machine learning with prebuilt
           workflows for many of these tasks




©MapR Technologies - Confidential                  17
LucidWorks Big Data Architecture

                                              Uniform ReST API

                     Content        Search – Discovery – Analytics                                System
                                      • LucidWorks Search
                    Acquisition       • Machine Learning                                        Management
                                        (classification, clustering, recommendations)
                                                                                             • Administration
                                      • Natural Language Processing
            • Enterprise
                                      • SQL (Hive) Interface
              Repository                                                                     • Provisioning
                                      • Data Workflows (ETL, log analysis, common metrics)
                                      • Extensible
            • Social Media                                                                   • Monitoring

            • Databases                  Big Data Operating System                           • Configuration

            • HDFS                                                                           • Service Management

            • Cloud (S3)                                                                     • Data Management

            • Push                                                                           • Security
                                     Hadoop/HBase        Search            Search
                                                          Logs            Indexes




©MapR Technologies - Confidential                          18
Easy Wins

     Analyze logs from application stored in MapR
     Seamlessly store search indexes in MapR
       –   and feed to Pig, Mahout and others
       –   use mirrors + NFS to directly deploy indexes
     Snapshots make backups a snap
     LucidWorks 2.5 (2013 Q1) easily connects with MapR




©MapR Technologies - Confidential             19
1+1=3



©MapR Technologies - Confidential     20
Learn More

     More information
       http://www.mapr.com/company/events/lucidworks-12-13-2012
     Vote for this topic for Hadoop Summit EU:
       http://bit.ly/128tLQe
     Talk to Ted
       @ted_dunning
       tdunning@maprtech.com
     Talk to Grant
       @gsingers
     MapR and Lucid Works
       http://www.mapr.com
       http://www.lucidworks.com


©MapR Technologies - Confidential      21

MapR lucidworks joint webinar

  • 1.
    Crowd Sourcing Reflected IntelligenceUsing Search and Big Data Ted Dunning Grant Ingersoll ©MapR Technologies - Confidential 1
  • 2.
    Grant’s Background  Co-founder: – LucidWorks – Chief Scientist – Apache Mahout  Long time Lucene/Solr committer  Author: Taming Text  Background in IR and NLP – Built CLIR, QA and a variety of other search-based apps ©MapR Technologies - Confidential 2
  • 3.
    Ted’s Background  Academia, Startups – Aptex, MusicMatch, ID Analytics, Veoh – Big data since before big  Open source – since the dark ages before the internet – Mahout, Zookeeper, Drill – bought the beer at first HUG  MapR – Chief Application Architect  Founding member of Apache Drill ©MapR Technologies - Confidential 3
  • 4.
    Agenda  Intro  Search Evolution and Search Revolution  Reflected Intelligence Use Cases  Building a Next Generation Search and Discovery Platform – MapR – LucidWorks  1+1=3 ©MapR Technologies - Confidential 4
  • 5.
    Search is Dead,Long Live Search  Search is a system building block Content – text is only a part of the story  If the algorithms fit, use them! Content User Relationships Interaction  Embrace fuzziness!  Scoring features are everywhere Access ©MapR Technologies - Confidential 5
  • 6.
    Search (R)evolution  Search use leads to search abuse – denormalization frees your mind – scoring is just a sparse matrix multiply  Lucene/Solr evolution – non free text usages abound – many DB-like features – noSQL before NoSQL was cool – flexible indexing – finite State Transducers FTW!  Scale  “This ain’t your father’s relevance anymore” ©MapR Technologies - Confidential 6
  • 7.
    Add (Lots of)Water  Large-scale analysis is key to reflected intelligence – correlation analysis • based on queries, clicks, mouse tracks, even explicit feedback • produce clusters, trends, topics, SIP’s Search – start with engineered knowledge, refine with user feedback  Large-scale discovery features encourage experimentation  Always test, always enrich! Analytics Discovery ©MapR Technologies - Confidential 7
  • 8.
    Social Media Analysisin Telecom  Correlate mobile traffic analysis with social media analysis – events cause traffic micro-bursts – participants tweet the events ahead of time  Deploy operations faster to predict outages and better handle emergency situations – high cost bandwidth augmentation can be marshaled as the traffic appears – anticipation beats reaction ©MapR Technologies - Confidential 8
  • 9.
    Provenance is 80%of value  Analysis of social media to determine advertising reach and response  In one case the same untargeted advertising was worth 5x if sold with supporting data. ©MapR Technologies - Confidential 9
  • 10.
    Claims Analysis  Goal – Insurance claims processing and analysis – fraud analysis  Method – Combine free text search with metadata analysis to identify high risk activities across the country – Integrate with corporate workflows to detect and fix outliers in customer relations  Results – Questions that took 24-48 hours now take seconds to answer ©MapR Technologies - Confidential 10
  • 11.
    Virginia Tech -Help the World  Grab data around crisis  Search immediately  Large-scale analysis enriches data to find ways to improve responses and understanding  http://www.ctrnet.net ©MapR Technologies - Confidential 11
  • 12.
    Bright Planet -Catch the Bad Guys  Online Drug Counterfeit detection  Identify commonly used language indicating counterfeits – you know it when you see it – and you know you have seen it  Feed to analyst via search-driven application – enrich based on analysts feedback ©MapR Technologies - Confidential 12
  • 13.
    Veoh - CrossRecommendations  Cross recommendation as search – with search used to build cross recommendation!  Recommend content to people who exhibit certain behaviors (clicks, query terms, other)  (Ab)use of a search engine – but not as a search engine for content – more like a search engine for behavior ©MapR Technologies - Confidential 13
  • 14.
    What Platform DoYou Need?  Fast, efficient, scalable search – bulk and near real-time indexing – handle billions of records with sub-second search and faceting  Large scale, cost effective storage and processing capabilities  NLP and machine learning tools that scale to enhance discovery and analysis  Integrated log analysis workflows that close the loop between the raw data and user interactions ©MapR Technologies - Confidential 14
  • 15.
    Reference Architecture Access APIs •View into Search View Analytic numeric/histo Personalization & ric data 1 Services Machine Learning 2 Services Shards 3 N •Classification •Recommendation Document •Documents Classification Models Discovery & •Users Enrichment Store In memory •Logs Replicated Clustering, classificat ion, NLP, topic Multi-tenant identification, searc h log analysis, user behavior Content Acquisition ETL, batch or near real-time Data • LucidWorks Search connectors • Push ©MapR Technologies - Confidential 15
  • 16.
    MapR  MapR provides the technology leading Hadoop distribution – full eco-system distribution – integrated data platform – complete solution for data integrity  MapR clusters also provide tight integration with search technologies like LucidWorks – integration is key for effective ops ©MapR Technologies - Confidential 16
  • 17.
    LucidWorks  LucidWorks provides the leading packaging of Apache Lucene and Solr – build your own, we support – founded by the most prominent Lucene/Solr experts  LucidWorks Search – “Solr++” • UI, REST API, MapR connectors, relevance tools, much more  LucidWorks Big Data – Big Data as a Service – Integrated LucidWorks Search, Hadoop, machine learning with prebuilt workflows for many of these tasks ©MapR Technologies - Confidential 17
  • 18.
    LucidWorks Big DataArchitecture Uniform ReST API Content Search – Discovery – Analytics System • LucidWorks Search Acquisition • Machine Learning Management (classification, clustering, recommendations) • Administration • Natural Language Processing • Enterprise • SQL (Hive) Interface Repository • Provisioning • Data Workflows (ETL, log analysis, common metrics) • Extensible • Social Media • Monitoring • Databases Big Data Operating System • Configuration • HDFS • Service Management • Cloud (S3) • Data Management • Push • Security Hadoop/HBase Search Search Logs Indexes ©MapR Technologies - Confidential 18
  • 19.
    Easy Wins  Analyze logs from application stored in MapR  Seamlessly store search indexes in MapR – and feed to Pig, Mahout and others – use mirrors + NFS to directly deploy indexes  Snapshots make backups a snap  LucidWorks 2.5 (2013 Q1) easily connects with MapR ©MapR Technologies - Confidential 19
  • 20.
  • 21.
    Learn More  More information http://www.mapr.com/company/events/lucidworks-12-13-2012  Vote for this topic for Hadoop Summit EU: http://bit.ly/128tLQe  Talk to Ted @ted_dunning tdunning@maprtech.com  Talk to Grant @gsingers  MapR and Lucid Works http://www.mapr.com http://www.lucidworks.com ©MapR Technologies - Confidential 21

Editor's Notes

  • #4 TED: We can tighten or loosen as necessary.
  • #5 TED: I think that the agenda needs to go here because it otherwise breaks up some key flow
  • #6 TED: This is a money slide where people should say “Wow man”. They shouldn’t understand the implications of this, but they should be very, very aware that something big just slide into the room.Tech Building Block: Not just textNot just users + queriesEmbrace Fuzziness: Esp. in Big Data, it is the only way you are going to survive.TED: I think that this should make the case for advanced that is still search at its heart. The idea that search can be radically changed should be on the next slide.
  • #7 Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much moreSearch has added more DB features over the yearsTED: We need to introduce the idea of *REVOLUTION* somewhere in here.
  • #8 All that revolution is good, but what the heck does this have to do w/ Big Data?
  • #14 GSI: needs a bit more meat
  • #16 Service-Oriented ArchitectureStatelessFailover/Fault TolerantLightweight Coordination and MessagingSmart about UpdatesDocument store isDistributedScalableAnalysisBatchNear Real-Time