Crowd sourced intelligence built into search over hadoop


Published on

Presented by Ted Dunning, Chief Application Architect, MapR
& Grant Ingersoll, Chief Technology Officer, LucidWorks

Search has quickly evolved from being an extension of the data warehouse to being run as a real time decision processing system. Search is increasingly being used to gather intelligence on multi-structured data leveraging distributed platforms such as Hadoop in the background. This session will provide details on how search engines can be abused to use not text, but mathematically derived tokens to build models that implement reflected intelligence. In such a system, intelligent or trend-setting behavior of some users is reflected back at other users. More importantly, the mathematics of evaluating these models can be hidden in a conventional search engine like SolR, making the system easy to build and deploy. The session will describe how to integrate Apache Solr/Lucene with Hadoop. Then we will show how crowd-sourced search behavior can be looped back into analysis and how constantly self-correcting models can be created and deployed. Finally, we will show how these models can respond with intelligent behavior in realtime.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Crowd sourced intelligence built into search over hadoop

  1. 1. 1©MapR Technologies - ConfidentialCrowd Sourcing Reflected IntelligenceUsing Search and Big DataTed Dunning, MapRGrant Ingersoll, LucidWorks
  2. 2. 2©MapR Technologies - ConfidentialIs Search Enough?• Keyword search is a commodity• Holistic view of the data and theuser interactions with that data arecritical• Search, Discovery and Analytics arethe key to unlocking this view ofusers and dataSearchMapR and LucidWorks
  3. 3. 3©MapR Technologies - ConfidentialGrant’s Background Co-founder:– LucidWorks – CTO– Apache Mahout Long time Lucene/Solr committer Author: Taming Text Background in IR and NLP– Built CLIR, QA and a variety of other search-based apps
  4. 4. 4©MapR Technologies - ConfidentialTed’s Background Academia, Startups– MapR, MusicMatch, ID Analytics, Veoh Networks– Big data since before big Open source– since the dark ages before the internet– Mahout, Zookeeper, Drill– bought the beer at first HUG Co-Author– Mahout in Action MapR– Chief Application Architect Founding member of Apache Drill
  5. 5. 5©MapR Technologies - ConfidentialAgenda Intro Search (R)evolution Reflected Intelligence Use Cases Building a Next Generation Search and Discovery Platform– MapR– LucidWorks 1+1=3
  6. 6. 6©MapR Technologies - ConfidentialUser Interactions With Big DataDataDataDataDFSKey ValueStoreIndexCommandLineQueryLanguageKeywordSearchSystemAdministratorEngineerEnd User
  7. 7. 7©MapR Technologies - ConfidentialUser Interactions With Big DataDataDataDataDFSKey ValueStoreIndexCommandLineQueryLanguageKeywordSearchSystemAdministratorEngineerEnd UserReflectedIntelligence
  8. 8. 8©MapR Technologies - ConfidentialSearch (R)evolution Search use leads to search abuse– denormalization frees your mind– scoring is just a sparse matrix multiply Lucene/Solr evolution– non free text usages abound– many DB-like features– noSQL before NoSQL was cool– flexible indexing– finite State Transducers FTW! Scale “This ain’t your father’s relevance anymore”
  9. 9. 9©MapR Technologies - ConfidentialSearch, Discovery and Analytics Large-scale analysis is key to reflected intelligence– correlation analysis• based on queries, clicks, mouse tracks,even explicit feedback• produce clusters, trends, topics, SIP’s– start with engineered knowledge,refine with user feedback Large-scale discovery featuresencourage experimentation Always test, always enrich!SearchDiscoveryAnalytics
  10. 10. 10©MapR Technologies - ConfidentialSocial Media Analysis in Telecom Goal– Detect flash-mob traffic events– Provision additional resources before failures Method: Correlate mobile traffic analysis with social media analysis– events cause traffic micro-bursts– participants tweet the events ahead of time– tweet locations converge on burst location Deploy operations faster to predict outages and better handleemergency situations– high cost bandwidth augmentation can be marshaled as the traffic appears– anticipation beats reaction
  11. 11. 11©MapR Technologies - ConfidentialProvenance is 80% of Value Problem– Broadcasters don’t know what audiences really like at a micro level Method:– Analysis of social media to determine advertising reach and response– Time resolution of social traffic can provide detailed response metrics Results:– In one case the untargeted advertising was worth 5x more if withsupporting response data
  12. 12. 12©MapR Technologies - ConfidentialClaims Analysis Goal– Insurance claims processing and analysis– fraud analysis Method– Combine free text search with metadata analysis to identify high riskactivities across the country– Integrate with corporate workflows to detect and fix outliers in customerrelations Results– Questions that took 24-48 hours now take seconds to answer
  13. 13. 13©MapR Technologies - ConfidentialVirginia Tech - Help the World Grab data around crisis Search immediately Large-scale analysis enriches data to findways to improve responses andunderstanding
  14. 14. 14©MapR Technologies - ConfidentialCan Search Catch the Bad Guys? Online Drug Counterfeitdetection Identify commonly used languageindicating counterfeits– you know it when you see it– and you know you have seen it Feed to analyst via search-drivenapplication– enrich based on analysts feedback
  15. 15. 15©MapR Technologies - ConfidentialVeoh - Cross Recommendations Cross recommendation as search– with search used to build cross recommendation! Recommend content to people who exhibit certain behaviors(clicks, query terms, other) (Ab)use of a search engine– but not as a search engine for content– more like a search engine for behavior
  16. 16. 16©MapR Technologies - ConfidentialRecommendation Basics History:User Thing1 32 43 42 33 21 12 1
  17. 17. 17©MapR Technologies - ConfidentialRecommendation Basics History as matrix: t1+t3 cooccur 2 times, t1+t4 once, t2+t4 oncet1 t2 t3 t4u1 1 0 1 0u2 1 0 1 1u3 0 1 0 1
  18. 18. 18©MapR Technologies - ConfidentialRecommendation Basics Coocurrencet1 t2 t3 t4t1 2 0 2 1t2 0 1 0 1t3 2 0 1 1t4 1 1 1 2t3 not t3t1 2 1not t1 1 1
  19. 19. 21©MapR Technologies - ConfidentialSpot the Anomaly Root LLR is roughly like standard deviationsA not AB 13 1000not B 1000 100,000A not AB 1 0not B 0 2A not AB 1 0not B 0 10,000A not AB 10 0not B 0 100,0000.44 0.982.26 7.15
  20. 20. 22©MapR Technologies - ConfidentialThreshold by Score Coocurrencet1 t2 t3 t4t1 2 0 2 1t2 0 1 0 1t3 2 0 1 1t4 1 1 1 2
  21. 21. 23©MapR Technologies - ConfidentialThreshold by Score Indicatorst1 t2 t3 t4t1 1 0 0 1t2 0 1 0 1t3 0 0 1 1t4 1 0 0 1indicator: t2, t4
  22. 22. 24©MapR Technologies - ConfidentialSearch Engine for Reflected Intelligence Map-reduce “big data” part– Logs record user + item occurrence– Group by user to get rows of occurrence matrix– Self-join to get cooccurrence– Log-likelihood test to find anomalies Search part– Anomalous cooccurrences are indicators(or use statistical scores to provide fancy boosts)– Indicator fields and other meta-data are indexed– Recommendation implemented using a single search– Boosts, functions, similarity also can reflect learned behavior
  23. 23. 25©MapR Technologies - ConfidentialWhat Platform Do You Need? Fast, efficient, scalable search– bulk and near real-time indexing– handle billions of records with sub-second search and faceting Large scale, cost effective storage and processing capabilities NLP and machine learning tools that scale to enhance discoveryand analysis Integrated log analysis workflows that close the loop between theraw data and user interactions
  24. 24. 26©MapR Technologies - ConfidentialShards123 NSearch View•Documents•Users•LogsDocumentStoreAnalyticServicesView intonumeric/historic dataClassificationRecommendationPersonalization &Machine LearningServicesClassification ModelsIn memoryReplicatedMulti-tenantDiscovery &EnrichmentClustering,classification, NLP,topic identification,search log analysis,user behaviorContent AcquisitionETL, batch or nearreal-timeAccess APIsData• LucidWorks Searchconnectors• PushReference Architecture
  25. 25. 27©MapR Technologies - ConfidentialMapR MapR provides the technology leading Hadoop distribution– full eco-system distribution– integrated data platform– complete solution for data integrity MapR clusters also provide tight integration with searchtechnologies like LucidWorks– integration is key for effective ops
  26. 26. 28©MapR Technologies - ConfidentialLucidWorks LucidWorks provides the leading packaging of Apache Lucene andSolr– build your own, we support– founded by the most prominent Lucene/Solr experts LucidWorks Search– “Solr++”• UI, REST API, MapR connectors, relevance tools, much more LucidWorks Big Data– Big Data as a Service– Integrated LucidWorks Search, Hadoop, machine learning with prebuiltworkflows for many of these tasks
  27. 27. 29©MapR Technologies - ConfidentialLucidWorks Big DataInputsAPIMgmtSearch, Discovery, AnalyticsProcessing & StorageAnalytics Service Document ServiceBig Data LucidWorks Web HDFSAdminServiceMgmtDataMgmtProvisioning, Monitoring & Configuration
  28. 28. 30©MapR Technologies - ConfidentialEasy Wins Analyze logs from application stored in MapR Seamlessly store search indexes in MapR– and feed to Pig, Mahout and others– use mirrors + NFS to directly deploy indexes Snapshots make backups a snap– A lot like version control for data LucidWorks 2.5.2 easily connects with MapR
  29. 29. 31©MapR Technologies - Confidential1 + 1 = 3
  30. 30. 32©MapR Technologies - ConfidentialLearn More Talk to Talk to MapR and Lucid Workshttp://www.mapr.comhttp://www.lucidworks.comHash Tags#mapr #lucenesolr #lucidworks