DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
Upcoming SlideShare
Loading in...5
×
 

DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

on

  • 1,283 views

Session Description: Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the basics of simple batch processing jobs. In many cases, one ...

Session Description: Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the basics of simple batch processing jobs. In many cases, one needs both ad hoc, real time access to the content as well as the ability to discover interesting information based on a variety of features such as recommendations, summaries and other insights. In this talk, we`ll discuss real world use cases across several industries as well as how to effectively leverage open source tools like Hadoop, Solr, Mahout and others to better enable user access to big data.

Statistics

Views

Total Views
1,283
Views on SlideShare
1,283
Embed Views
0

Actions

Likes
3
Downloads
20
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • How do you gain insight?The Search boxis the UI for data these daysFeedback improvements into system for usersExtract key metrics for business understanding
  • ChallengesMany of these are intense calculations or iterativeMany are subjective and require a lot of experimentation
  • Single nucleotide polymorphisms (SNPs) are used as markers in linkage and association studies to detect which regions in the human genome may be involved in disease.Single nucleotide polymorphism (SNP) studies and random mutagenesis projects identify amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT (Sorting Intolerant From Tolerant) is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study.
  • Make into images?
  • SearchStorage and processingExperiment managementToolsNLPstatistical analysisScalableLow costProduction monitoringProvisioningBulk and near real-time Handle volume in sub-second processing
  • Solr takes care of leader election, etc. so no more master/slave1 second (default) soft commits for NRT updates1 minute (default) hard commits (no searcher reopen)Transaction logs for recovery

DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION Presentation Transcript

  • Large Scale Search, Discovery and Analysis in Action Ivan Provalov Research Engineer Office of the Chief Scientist September 25, 2012Confidential © Copyright 2012
  • User Interactions With Big Data Command System Data DFS Line Administrator Key Query Data Value Engineer Language Store Keyword Data Index End User Search Confidential and Proprietary2 © 2012 LucidWorks
  • Is Search Enough? • Keyword search is a commodity endeavour shuttle bay area • Holistic view of the data and Search the user interactions with that data • Search, Discovery and Analytics are the key to unlocking this view of users and data Search, Discovery and Analytics Confidential and Proprietary3 © 2012 LucidWorks
  • Why Search, Discovery and Analytics? • User Needs Search - real-time, ad hoc access to content - aggressive prioritization based on importance - serendipity - feedback/learning from past Analytics Discovery • Business Needs - deeper insight into users - leverage existing internal knowledge - cost effective Confidential and Proprietary4 © 2012 LucidWorks
  • Topics • Background and needs • Architecture • Search, Discovery and Analytics in action • Road map • Wrap up Confidential and Proprietary5 © 2012 LucidWorks
  • Search • Performance • Real time • Relevance and importance • Presenting results • Experiment management Confidential and Proprietary6 © 2012 LucidWorks
  • Discovery • Content clustering • Discovering near duplicate documents • Finding ‘dark data’ • Making recommendations • Uncovering trends • Recognizing topics • More like this Confidential and Proprietary7 © 2012 LucidWorks
  • Analytics • Term frequency • Facets • Click analysis • Relevancy metrics • Zero results queries • Hot spots • Statistically interesting phrases Confidential and Proprietary8 © 2012 LucidWorks
  • Some Use Cases • Video streaming - classification - recommendations • Financial, transportation, telecommunications - fraud detection • Social media - trend monitoring • Information technology - logs monitoring • Healthcare - identifying patients for clinical studies Confidential and Proprietary9 © 2012 LucidWorks
  • In Focus: Personalized Medicine Alignment and other Genetic analysis Variations Patient DNA Standard Therapies Alternative Therapies Search and Faceting Confidential and Proprietary10 © 2012 LucidWorks
  • In Focus: Log Processing in Telecommunications • Each year, large sums of money are lost due to fraudulent calls and poor service • Logs are usually semi-structured and contain vital information about errors and fraud • Deeper batch analytics can provide insight into patterns across vast amounts of data • Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities Confidential and Proprietary11 © 2012 LucidWorks
  • What Does a Search, Discovery and Analytics Platform Need? • Fast, efficient, scalable search - bulk and near real time indexing - handle billions of records with sub-second search and faceting • Large scale, cost effective storage and processing capabilities - need whole data consumption and analysis - experimentation/sampling tools • NLP and machine learning tools that scale to enhance discovery and analysis Confidential and Proprietary12 © 2012 LucidWorks
  • Building a Search, Discovery and Analytics Platform API Search, Discovery, Analytics Management InputsBulk & Processing & StorageReal Time Provisioning, Monitoring & Configuration Confidential and Proprietary © 2012 LucidWorks
  • LucidWorks Big Data APIInputs Search, Discovery, Analytics Management Processing & Storage Provisioning, Monitoring & ConfigurationConfidential and Proprietary© 2012 LucidWorks
  • LucidWorks Big Data APIInputs Search, Discovery, Analytics Management Processing & Storage Provisioning, Monitoring & ConfigurationConfidential and Proprietary© 2012 LucidWorks
  • LucidWorks Big Data APIInputs Search, Discovery, Analytics Analytics Service Document Service Management Processing & Storage Provisioning, Monitoring & ConfigurationConfidential and Proprietary© 2012 LucidWorks
  • LucidWorks Big Data APIInputs Search, Discovery, Analytics Mgmt Analytics Service Document Service Admin Service Processing & Storage Mgmt Data Mgmt Provisioning, Monitoring & ConfigurationConfidential and Proprietary© 2012 LucidWorks
  • LucidWorks Big Data APIInputs Search, Discovery, Analytics Mgmt Analytics Service Document Service Admin Service Processing & Storage Mgmt Data Mgmt Provisioning, Monitoring & ConfigurationConfidential and Proprietary© 2012 LucidWorks
  • LucidWorks Big Data API Big Data LucidWorks Web HDFSInputs Search, Discovery, Analytics Mgmt Analytics Service Document Service Admin Service Processing & Storage Mgmt Data Mgmt Provisioning, Monitoring & ConfigurationConfidential and Proprietary© 2012 LucidWorks
  • Components – LucidWorks Search Component Benefit LucidWorks Search (2.1.1) Lucene/Solr 4.0-dev, sharded with • connector framework SolrCloud, near-real time indexing, • security transaction logs for recovery. • user click framework • business process integration • administration LucidWorks Search Confidential and Proprietary20 © 2012 LucidWorks
  • Components - Hadoop Component Benefit Apache Hadoop (1.0.3) Distributed computing and processing for ETL and analytics jobs. Apache HBase (0.92) Key-value store allowing fast access to the data. Apache Oozie (modified 3.2) Workflow orchestration. Confidential and Proprietary21 © 2012 LucidWorks
  • Components - Analysis/ML/NLP Component Benefit Apache Mahout (trunk) Distributed machine learning • k-means clustering processing framework. • statistically interesting phrases • similar documents • classification Apache UIMA (2.4.0) Text processing and annotations. Apache OpenNLP (1.5.2) Machine learning toolkit for natural • named entity extraction language processing. Behemoth (modified trunk) Makes easier M/R data extraction, abstracts annotations frameworks. Apache Pig (0.9.2) Helps with writing analytics M/R • ETL programs. • log analysis Confidential and Proprietary22 © 2012 LucidWorks
  • Components - Middleware Component Benefit Apache ZooKeeper (3.4.3) Service discovery. • Netflix Curator Apache Kafka (0.7) Logs consumption and event-based real-time document processing framework. Confidential and Proprietary23 © 2012 LucidWorks
  • Components - SDA Engine • RESTful services (Restlet 2.1) • ZooKeeper + Netflix Curator • Authentication and authorization • Proxies for LucidWorks and WebHDFS API • Workflow engine Confidential and Proprietary24 © 2012 LucidWorks
  • Road Map • Analytics themes - relevance - data quality - discovery - integration with other packages (R) • Machine learning - NLP - recommendations • Experiment management Confidential and Proprietary25 © 2012 LucidWorks
  • Conclusions • Search, Discovery and Analytics, when combined into a single, integrated system provides powerful insight into both your content and your users • LucidWorks has combined many of these things into LucidWorks Big Data Confidential and Proprietary26 © 2012 LucidWorks
  • LucidWorks Big Data • Unified development platform for Big Data applications • Integrated open source stack: Lucene/Solr, Hadoop, Mahout, NLP • Single, uniform REST API • Pre-tuned by open source industry experts • Out of the box provisioning - hosted or on premise Confidential and Proprietary27 © 2012 LucidWorks
  • Search | Discover | Analyze www.lucidworks.com/bigdata ivan.provalov@lucidworks.com @iprovalov Confidential and Proprietary28 © 2012 LucidWorks