Mining Solutions A New Approach to Making the Most of Your Research Time SLA,Strategic Technology Alliance, Seattle, 2008 Joe Buzzanga, Product Manager, Elsevier Science and Technology June 17, 2008
Agenda Challenges and Framework for Information Retrieval (IR) Using Natural Language Processing (NLP) in IR (illumin8) Product Demo
Digital Universe: 10x bigger in 5 years “ Searching for  meaning  in the content of unstructured data like images, video clips, documents, and the numbers and characters in databases is the  rocket   science  of the digital universe.” IDC Source: IDC Whitepaper, The Diverse and Exploding Digital Universe, March 2008
Today’s Researcher? Search for Meaning?
What’s at Stake? Business Week Innovation Scorecard Amazon “Kindle”
Impact on Information Retrieval Separate the Signal from Noise Signal processing
Our Goal Make you successful through superior information retrieval tools
Framework for Information Retrieval Traditional: card catalog, periodical index… Human Index Search Simple Model Human Index Search Print Collections  Surrogate  Record Simple Model: single book Meta Data Content Content
Framework for Information Retrieval Human Index Search Digital Bibliographic A&I  Surrogate  Record Digital Index Hybrid Index Meta Data Digital bibliographic A&I Semi-structured records Content under editorial control Application of controlled terms Application of digital indexing Results need to be organized and ranked additional access points (e.g., facets, tags..) Results Content
Framework for Information Retrieval No Human Intervention Content unstructured, uncontrolled and unmeasurable Crawling is inherently imperfect Typically Keyword indexing Ranking of results becomes critical Web Search Crawl Digital Index Content Results
Content:How Big is the Web? Today 170 million websites across all domains Source: Netcraft 2 years ago 80 million websites across all domains
Content: Plumbing the Depths Source: Mills Davis, Project 10X
Content: How Big is the Web? ~10 Billion pages (2003 estimate) http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
Crawling in the Dark
The  Key  in Keyword?  Keyword is a misnomer in context of an index Keyword is in the mind of the searcher Every word is indexed, since the computer is not smart enough to know significant words (i.e., the “key” in “keyword”) Brute force approach, feasible with compute power
Results: Mystery Equation mystery clip
Results: Facets
Research and its Discontents 5.5 hours / week * Searching and gathering information * Source: 2007 survey of 6,300 knowledge workers, Outsell, Inc. 4.7 hours / week * Organizing and analyzing and applying information
Introducing illumin8 Cut through the noise Rapid summary/overview Cross domain view Integrated content Web-based Sharing results Applies Natural Language Processing at Internet Scale!
Typical Search Current general search Get millions of documents to sift through Page 1 Page 2 Page 180,000 … compostable film There is just no way any researcher can read through all this information. It just takes too long!
Illumin8 Uses Natural Language Processing to “read” text Enter search terms Generate  Organized Result Set Products Companies/Organizations Technical Approaches Results grouped into meaningful classes System generates list of solutions, not records Quickly see interesting and useful areas for investigation
Our Approach Premium Scientific Patent Web Search -Crawl -Load Semantic Index Results NLP Applied Problems, Solutions, Benefits NLP Applied Fuse, Classify,  Summarize NLP Applied NLP applied throughout the system: index, query, result set   Content
How does illumin8 work? Full Text Abstracts illumin8 searches on solutions. The solutions are extracted from full text sources, abstracts, web, and patents Internet Patents illumin8 Solution Database 1.1 billion 5 Billion web pages, blogs and forums 3 Million full-text scientific  and technical articles from 1,800 Elsevier journals 33 Million scientific  records from 15,000  peer reviewed journals & more than 4,000 publishers 21 Million patents from 5 world-wide patent offices Extract and Summarize Solutions Search
A Uniform Lens (index) Across Content Sources WEB JOURNAL PATENT Summarizing information about Companies, Products, etc., for technologies that researchers care about   Organizing results from the worlds most trusted scientific content and billions of web pages
Taking Search Beyond Keyword Indexing Keyword Indexing Meaning is lost Sentence processing   Meaning is maintained  Identify & classify problems, solutions and benefits Neural Network    used in    handwriting recognition Solution Problem
Natural Language Parsing Help_patterns Succeed2 Correct_problem treatPerson_SAVS positively_influence have_positive_influence protect_sb_against_sth Product_would_do_good provide_sb_with_sth Product_is_shown_to talented_at use_sth_to_do_sth approve_sth rely_on_product_to application_is Product_allows_sb_toVG2 ensure_protagonist A_makes_B_good benefit_of ... Thousands of rules Plus statistical models illumin8 Rules Grammatical Role Role Test Role Assignment provides Capacitive deionization an economical and efficient method for removing salt and impurities from water Solution Benefit Continue … Modal? Check that Verb polarity is positive; this rule would  not  match if the Verb were modal (i.e. only in certain cases), for example if it said  “should provide … but” Check that Subject is not negated; this rule would  not  match if Subject were not positive, for example if it said  “no process provides an economical an efficient …” Check that Object is not antagonistic; this rule would  not  match if Object were, for example  “provides a costly and complicated method” no yes Negated? no yes Antagonistic? no yes Capacitive deionization with carbon aerogel electrodes provides an economical and efficient method for removing salt and impurities from water. Verb Subject Object
Analyzing A Sentence Carrier’s Infinity™ Air Purifier  uses  ultraviolet light  to eliminate  germs  such as  viruses, molds, bacteria, mildew  and  mold spores  from the indoor air of homes and offices, ensuring a higher  indoor air quality . Germ [Problem] Indoor air quality [Benefit] Carrier [Organization] Infinity  Air Purifier [Product] Ultraviolet light [Technology] Virus Mold Bacteria Mildew Makes Uses Solves Provides Kind of Mold spore Concepts, ideas and entities extracted from a single sentence.
DEMO

SLA Summer 2008

  • 1.
    Mining Solutions ANew Approach to Making the Most of Your Research Time SLA,Strategic Technology Alliance, Seattle, 2008 Joe Buzzanga, Product Manager, Elsevier Science and Technology June 17, 2008
  • 2.
    Agenda Challenges andFramework for Information Retrieval (IR) Using Natural Language Processing (NLP) in IR (illumin8) Product Demo
  • 3.
    Digital Universe: 10xbigger in 5 years “ Searching for meaning in the content of unstructured data like images, video clips, documents, and the numbers and characters in databases is the rocket science of the digital universe.” IDC Source: IDC Whitepaper, The Diverse and Exploding Digital Universe, March 2008
  • 4.
  • 5.
    What’s at Stake?Business Week Innovation Scorecard Amazon “Kindle”
  • 6.
    Impact on InformationRetrieval Separate the Signal from Noise Signal processing
  • 7.
    Our Goal Makeyou successful through superior information retrieval tools
  • 8.
    Framework for InformationRetrieval Traditional: card catalog, periodical index… Human Index Search Simple Model Human Index Search Print Collections Surrogate Record Simple Model: single book Meta Data Content Content
  • 9.
    Framework for InformationRetrieval Human Index Search Digital Bibliographic A&I Surrogate Record Digital Index Hybrid Index Meta Data Digital bibliographic A&I Semi-structured records Content under editorial control Application of controlled terms Application of digital indexing Results need to be organized and ranked additional access points (e.g., facets, tags..) Results Content
  • 10.
    Framework for InformationRetrieval No Human Intervention Content unstructured, uncontrolled and unmeasurable Crawling is inherently imperfect Typically Keyword indexing Ranking of results becomes critical Web Search Crawl Digital Index Content Results
  • 11.
    Content:How Big isthe Web? Today 170 million websites across all domains Source: Netcraft 2 years ago 80 million websites across all domains
  • 12.
    Content: Plumbing theDepths Source: Mills Davis, Project 10X
  • 13.
    Content: How Bigis the Web? ~10 Billion pages (2003 estimate) http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
  • 14.
  • 15.
    The Key in Keyword? Keyword is a misnomer in context of an index Keyword is in the mind of the searcher Every word is indexed, since the computer is not smart enough to know significant words (i.e., the “key” in “keyword”) Brute force approach, feasible with compute power
  • 16.
  • 17.
  • 18.
    Research and itsDiscontents 5.5 hours / week * Searching and gathering information * Source: 2007 survey of 6,300 knowledge workers, Outsell, Inc. 4.7 hours / week * Organizing and analyzing and applying information
  • 19.
    Introducing illumin8 Cutthrough the noise Rapid summary/overview Cross domain view Integrated content Web-based Sharing results Applies Natural Language Processing at Internet Scale!
  • 20.
    Typical Search Currentgeneral search Get millions of documents to sift through Page 1 Page 2 Page 180,000 … compostable film There is just no way any researcher can read through all this information. It just takes too long!
  • 21.
    Illumin8 Uses NaturalLanguage Processing to “read” text Enter search terms Generate Organized Result Set Products Companies/Organizations Technical Approaches Results grouped into meaningful classes System generates list of solutions, not records Quickly see interesting and useful areas for investigation
  • 22.
    Our Approach PremiumScientific Patent Web Search -Crawl -Load Semantic Index Results NLP Applied Problems, Solutions, Benefits NLP Applied Fuse, Classify, Summarize NLP Applied NLP applied throughout the system: index, query, result set Content
  • 23.
    How does illumin8work? Full Text Abstracts illumin8 searches on solutions. The solutions are extracted from full text sources, abstracts, web, and patents Internet Patents illumin8 Solution Database 1.1 billion 5 Billion web pages, blogs and forums 3 Million full-text scientific and technical articles from 1,800 Elsevier journals 33 Million scientific records from 15,000 peer reviewed journals & more than 4,000 publishers 21 Million patents from 5 world-wide patent offices Extract and Summarize Solutions Search
  • 24.
    A Uniform Lens(index) Across Content Sources WEB JOURNAL PATENT Summarizing information about Companies, Products, etc., for technologies that researchers care about Organizing results from the worlds most trusted scientific content and billions of web pages
  • 25.
    Taking Search BeyondKeyword Indexing Keyword Indexing Meaning is lost Sentence processing Meaning is maintained Identify & classify problems, solutions and benefits Neural Network  used in  handwriting recognition Solution Problem
  • 26.
    Natural Language ParsingHelp_patterns Succeed2 Correct_problem treatPerson_SAVS positively_influence have_positive_influence protect_sb_against_sth Product_would_do_good provide_sb_with_sth Product_is_shown_to talented_at use_sth_to_do_sth approve_sth rely_on_product_to application_is Product_allows_sb_toVG2 ensure_protagonist A_makes_B_good benefit_of ... Thousands of rules Plus statistical models illumin8 Rules Grammatical Role Role Test Role Assignment provides Capacitive deionization an economical and efficient method for removing salt and impurities from water Solution Benefit Continue … Modal? Check that Verb polarity is positive; this rule would not match if the Verb were modal (i.e. only in certain cases), for example if it said “should provide … but” Check that Subject is not negated; this rule would not match if Subject were not positive, for example if it said “no process provides an economical an efficient …” Check that Object is not antagonistic; this rule would not match if Object were, for example “provides a costly and complicated method” no yes Negated? no yes Antagonistic? no yes Capacitive deionization with carbon aerogel electrodes provides an economical and efficient method for removing salt and impurities from water. Verb Subject Object
  • 27.
    Analyzing A SentenceCarrier’s Infinity™ Air Purifier uses ultraviolet light to eliminate germs such as viruses, molds, bacteria, mildew and mold spores from the indoor air of homes and offices, ensuring a higher indoor air quality . Germ [Problem] Indoor air quality [Benefit] Carrier [Organization] Infinity Air Purifier [Product] Ultraviolet light [Technology] Virus Mold Bacteria Mildew Makes Uses Solves Provides Kind of Mold spore Concepts, ideas and entities extracted from a single sentence.
  • 28.