• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Open Source for Enterprise Search: Breaking Down the Barriers to Information
 

Open Source for Enterprise Search: Breaking Down the Barriers to Information

on

  • 1,433 views

 

Statistics

Views

Total Views
1,433
Views on SlideShare
1,430
Embed Views
3

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 3

http://www.slideshare.net 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Open Source for Enterprise Search: Breaking Down the Barriers to Information Open Source for Enterprise Search: Breaking Down the Barriers to Information Presentation Transcript

    • •Welcome to  KMWorld Magazine Sponsored Event © 2008‐2009 Lucid Imagination, Inc. 1
    • Moderator • Andy Moore • Publisher • KMWorld © 2008‐2009 Lucid Imagination, Inc. 2
    • Open Source for Enterprise Search:  Breaking Down the  Barriers to Information
    • Speakers © 2008‐2009 Lucid Imagination, Inc. 4
    • Going into enterprise search with your eyes open Susan Feldman, Vice President Search and Discovery Technologies IDC Webcast June 23, 2009, sponsored by Lucid Imagination Copyright 2009 IDC. Reproduction is forbidden unless authorized. All rights reserved.
    • Outline  Search defined  The searching process  Today’s search platforms  Types of search applications  Features to look for  What’s next?  Grand challenges © 2009 IDC
    • Search: The Status Quo Is luck enough? © 2009 IDC
    • Uses of Search Today  Intranet search  Publishing applications  Web search  Rich media search  Call centers  Web advertising platforms  Enterprise Applications like BI, ERP and CRM  Recommendation engines  eDiscovery and litigation  Reputation and opinion support applications monitoring applications  Compliance applications  Social media applications  Predictive analytics  Fraud detection applications  Product early warning applications  Border security applications  Ecommerce applications  Spam detection applications © 2009 IDC
    • Information Access Technology Map Conversational Decision support systems inference engines data plus content: Business Gov’t Intelligence -text analytics Voice of Customer Apps Intelligence Unified access -BI number & complexity of technologies -Reporting tools Ad Matching -data mining Reputation Competitive management Customer support Image search Intelligence Sentiment extraction Trend Analysis eDiscovery Fact/event extraction Brand Question management relationship extraction Search for ideas, answering Geo-tagging Geo-specific search not words Online concept extraction tech support Find people, places entity extraction and things Alerting multilingual support Tag data and Categorization and browsing content Rich media Phrase eCommerce part of speech tagging search identification Speech to text Keyword Retrieve Search and relevance ranking search Audio files Accuracy required © 2009 IDC
    • Characteristics Language is the right vehicle for human interaction, but it is imprecise.  Fuzzy matching.  Dialogue and interaction to define the information need  Disambiguation of text—context  Linguistic patterns are predictable and computable: – Syntax for context – Dictionaries for meaning, semantics  Relevance ranking to help manage large results sets  Ad hoc searching © 2009 IDC
    • Today’s Search/Discovery Platform Disambiguate Visualize Enrich Cluster Query Filter Search Engine Interface Document Categorize BI/Data Language Analysis Extract Apps © 2009 IDC
    • Today’s Search/Discovery Platform Disambiguate Visualize Enrich Cluster Query Filter Search Engine Interface Document Categorize BI/Data Language Apps Analysis Extract © 2009 IDC
    • Today’s Search/Discovery Platform Disambiguate Visualize Enrich Cluster Query Filter Search Engine Interface Document Categorize BI/Data Language Apps Analysis Extract © 2009 IDC
    • Today’s Search/Discovery Platform Disambiguate Visualize Enrich Cluster Query Filter Search Engine Interface Document Categorize BI/Data Language Apps Analysis Extract © 2009 IDC
    • © 2009 IDC
    • Today’s Search/Discovery Platform Disambiguate Visualize Enrich Cluster Query Filter Search Engine Interface Document Categorize BI/Data Language Apps Analysis Extract © 2009 IDC
    • SPSS Concepts and Categories © 2009 IDC
    • Types of Search Products Analysis and Volume reporting Customizable Integrated Platforms -Multiple Text analytics Sources Intelligence Multipurpose Intranet Call centers Languages Search -Multiple Navigation ecommerce Apps: Relevance BI, CRM, tuning ERP, Finance, Site Inventory, Security Email Search Voice of Customer eDiscovery UI Single Purpose Features Reputation Monitoring Integrated Desktop work Search environments Search-Based Out of the Box Applications Search Important Strategic © 2009 IDC
    • Types of Search Products Analysis and Volume Customizable Integrated Platforms & reporting -Multiple Text analytics ed Intranet & d Intelligence Sources dd uilSearch de uilt Multipurpose t Call centers edecommerce Languages be B -Multiple b mB Em m e Navigation Apps: Em sto BI, CRM, Ho Site Relevance Cu tuning ERP, Finance, Inventory, Security Email Search Voice of ed Customer ed UI dd eDiscovery dd Single Purpose Monitoring be Features Reputation be Integrated m Em Desktop work SearchE Search-Based environments Out of the Box Applications Search Important Strategic © 2009 IDC
    • Features that first time buyers look for Search features ranked by priority from our 2008 Survey 1. Relevance based search 2. Browsing and navigation (categorization) 3. Taxonomies/ontologies 4. Parametric search 5. Concept search 6. Auto tagging 7. Visualization by clustering Source: IDC 2008 © 2009 IDC
    • Experienced search buyers differ 1. Relevance based search But, after experience, add: •Customer service 2. Browsing and navigation (categorization) •Ease of implementation, •Unified access, 3. Taxonomies/ontologies •Usability, 4. Parametric search •Auto tagging, 5. Concept search •Better search features like 6. Auto tagging stemming and best bets, •Security 7. Visualization by clustering •Entity extraction •Rights management Source: IDC 2008 © 2009 IDC
    • Directions for NextGen Information Access  Integration of multiple technologies required  Integrated platforms for diverse, multiple information access requirements  Search-based apps to address specialized workflows and tasks like eDiscovery  Web scale processing  Rich media and social media add new challenges for search  Mobile search applications will explode © 2009 IDC
    • Contact Information Susan Feldman VP, Search and Discovery Technologies sfeldman@idc.com © 2009 IDC
    • Ranga Muvavarirwa Director  Product Planning & Development Comcast Interactive Media Search for New Business Models: Setting the requirements  and choosing the technology
    • Comcast Interactive Media • Division of Comcast  • Dedicated to  online/cross‐platform  entertainment and media  businesses • Develop and grow Internet businesses with  compelling technology  and product innovations • Targeting broadband  users, customers and  non‐subscribers alike
    • Fancast.com Search: Business‐critical Need:  • Customizable  • Scalable for volume: both traffic  and content  5‐6 million  unique monthly users   4 million+ records 200,000+ assets • Economics:  • 9K+ hours online video New business model,  • 55K+ videos • 10K+ full‐length shows sensitive to fixed  • ~150K other assets and operating costs (photos, tidbits, etc.)  100+ content providers
    • Search Use Cases • Comprehensive, relevant, up‐to‐date and authoritative – Movies, TV shows, clips, celebrities and other media info • Seamless merge of multiple, heterogenous sources – Metadata each with own  format, content refresh timing ?simpson – Spider‐Man vs. “spiderman” • Must Have:  – Accurate results in the  mind of the user Jessica = Homer /
    • Architecting for Scale • Scaling metric: operationally  simple, scalable and stable.  + • Search must be as fast as anything else on the site <20ms at peak per server instance.  + • Data‐center operations gets simple  rules for sizing “x” users        “y” application servers • Provided linear scalability with  traffic growth from 50K to ~1M  peak uniques/day over 16 months Users now visit site >1x week on average
    • Performance Test: Solr vs “X” Open source Solr vs. leading commercial search vendor (Brand “X”) Measured/compared  query response rates  At multiple load levels  Range: 100 to 1500 requests/second TEST BED  Tested stress failure points • Avalanche load generators   Peak queries per second  • multiple instances of Sun x64  multi‐core 1~2RU servers   Failure characteristics • Red Hat Linux Vendor “X” & Lucid Imagination (Solr) INDEXES   Tuned test bed • 2 million documents  Validated results • 4 million documents Solr meaningfully outperformed “X” • Each deployed on each  of  the competing servers   Response Rates  (Solr vs. “X”)  Failure‐handling characteristics
    • Do’s and Don’ts of Open Source 1. DON’T abandon structured analysis of Business  or Technical Requirements – Open source must still fit business needs – DO a bake‐off to drive your decision 2. DO ensure a fit between development culture  and business objectives: – Do you Integrate or Develop?  – Do you develop or innovate based on your data? – Do you have a Source for expertise you lack? 
    • Open Source: Risks & Mitigation RISKS LUCENE/SOLR: (1) SUFFICIENT COMMUNITY OF  Highly active community; ad‐hoc  DEVELOPERS?  support and answers online community  SLA based support from Lucid  Imagination (2) COMMERCIAL ORGANIZATIONS “BET  Similar businesses: CNET, Netflix;  THE COMPANY” ON THIS?  Dissimilar businesses: MySpace, Orbitz (3) ALIGNMENT WITH INTERNAL  Premium at CIM on software  RESOURCES? ENABLES PRODUCT  engineering talent  DEVELOPMENT AGILITY? Flexibility to support innovation without  steep learning curves Mutually reinforcing benefits of product  development culture and highly  engaged human capital
    • Tom Morton Search Architect Comcast Interactive Media Improving Search with Solr/Lucene How you can use this even  if you’re not in the entertainment business
    • Document Boost • Allows pages to be  assigned inherent  results relevancy  • Boost is computed  using related data  e.g., box office  receipts, recency • Boost value set when indexing. • Similar concept to PageRank, but set based on  business rules, not just popularity 
    • Indexing Related Content • Allows related terms to match a query even if  terms don’t need to be surfaced on a page. – Add fields and weights to XML. <str name="qf"> nameExact^6.5 name^2.0 alias^1.1 related^0.5  description^0.1 </str> – Similar to how web‐search indexes link terms.
    • Type‐ahead A few small XML changes  turns on “type ahead” feature <fieldtype name="ngramUntokenized" class="solr.TextField"  positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory"/> <filterclass="solr.EdgeNGramFilterFactory"minGramSize="2” maxGramSize="20"/> </analyzer>    <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory"/> </analyzer> </fieldtype>
    • Generating Content from  Relationships • Using relationships to generate descriptions of  search entities. – Allows description results to be displayed even if  data is unavailable.
    • Generating Content from  Relationships • Using relationships to generate descriptions of  search entities. – Allows description results to be displayed even if  data is unavailable.
    • More Like This • Using more‐like‐this  functionality to produce  recommendations. – Based on relationships:  movie, TV series, actor,  and tag – Specify fields to use and  weights in XML.
    • Key Solr Search Strategies • Metadata holds great value for both: – Improved Relevancy • Take a broad view of “content”, not just text – Better Search Experience • Search is only as good as the users think it is • Solr/Lucene can accomplish much of this  with just a dab of XML – Little real programming required
    • Q & A •Question and Answer Session •(please submit questions) © 2008‐2009 Lucid Imagination, Inc. 40
    • Archive Please use the same URL you used to view today’s live event  for the archive event, plus we will be sending you a follow‐up  email with that URL once the archive is posted! © 2008‐2009 Lucid Imagination, Inc. 41
    • Thank You Thank you for participating in today’s web event Just by attending this event you could win this TomTom GPS car navigation system Winner to be announced June 30th © 2008‐2009 Lucid Imagination, Inc. 42
    • Thank you