SLA Summer 2008


Published on

My presentation to SLA, summer 2008

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

SLA Summer 2008

  1. 1. Mining Solutions A New Approach to Making the Most of Your Research Time SLA,Strategic Technology Alliance, Seattle, 2008 Joe Buzzanga, Product Manager, Elsevier Science and Technology June 17, 2008
  2. 2. Agenda <ul><li>Challenges and Framework for Information Retrieval (IR) </li></ul><ul><li>Using Natural Language Processing (NLP) in IR (illumin8) </li></ul><ul><li>Product Demo </li></ul>
  3. 3. Digital Universe: 10x bigger in 5 years “ Searching for meaning in the content of unstructured data like images, video clips, documents, and the numbers and characters in databases is the rocket science of the digital universe.” IDC Source: IDC Whitepaper, The Diverse and Exploding Digital Universe, March 2008
  4. 4. Today’s Researcher? Search for Meaning?
  5. 5. What’s at Stake? Business Week Innovation Scorecard Amazon “Kindle”
  6. 6. Impact on Information Retrieval <ul><li>Separate the Signal from Noise </li></ul><ul><li>Signal processing </li></ul>
  7. 7. Our Goal <ul><li>Make you successful through superior information retrieval tools </li></ul>
  8. 8. Framework for Information Retrieval <ul><li>Traditional: card catalog, periodical index… </li></ul>Human Index Search Simple Model Human Index Search Print Collections Surrogate Record <ul><li>Simple Model: single book </li></ul>Meta Data Content Content
  9. 9. Framework for Information Retrieval Human Index Search Digital Bibliographic A&I Surrogate Record Digital Index Hybrid Index Meta Data <ul><li>Digital bibliographic A&I </li></ul><ul><ul><li>Semi-structured records </li></ul></ul><ul><ul><li>Content under editorial control </li></ul></ul><ul><ul><li>Application of controlled terms </li></ul></ul><ul><ul><li>Application of digital indexing </li></ul></ul><ul><ul><li>Results need to be organized and ranked </li></ul></ul><ul><ul><ul><li>additional access points (e.g., facets, tags..) </li></ul></ul></ul>Results Content
  10. 10. Framework for Information Retrieval <ul><li>No Human Intervention </li></ul><ul><ul><li>Content unstructured, uncontrolled and unmeasurable </li></ul></ul><ul><ul><li>Crawling is inherently imperfect </li></ul></ul><ul><ul><li>Typically Keyword indexing </li></ul></ul><ul><ul><li>Ranking of results becomes critical </li></ul></ul>Web Search Crawl Digital Index Content Results
  11. 11. Content:How Big is the Web? Today 170 million websites across all domains Source: Netcraft 2 years ago 80 million websites across all domains
  12. 12. Content: Plumbing the Depths Source: Mills Davis, Project 10X
  13. 13. Content: How Big is the Web? <ul><li>~10 Billion pages (2003 estimate) </li></ul>
  14. 14. Crawling in the Dark
  15. 15. The Key in Keyword? <ul><li>Keyword is a misnomer in context of an index </li></ul><ul><li>Keyword is in the mind of the searcher </li></ul><ul><li>Every word is indexed, since the computer is not smart enough to know significant words (i.e., the “key” in “keyword”) </li></ul><ul><ul><li>Brute force approach, feasible with compute power </li></ul></ul>
  16. 16. Results: Mystery Equation mystery clip
  17. 17. Results: Facets
  18. 18. Research and its Discontents 5.5 hours / week * Searching and gathering information * Source: 2007 survey of 6,300 knowledge workers, Outsell, Inc. 4.7 hours / week * Organizing and analyzing and applying information
  19. 19. Introducing illumin8 <ul><li>Cut through the noise </li></ul><ul><li>Rapid summary/overview </li></ul><ul><li>Cross domain view </li></ul><ul><li>Integrated content </li></ul><ul><li>Web-based </li></ul><ul><li>Sharing results </li></ul>Applies Natural Language Processing at Internet Scale!
  20. 20. Typical Search Current general search Get millions of documents to sift through Page 1 Page 2 Page 180,000 … compostable film There is just no way any researcher can read through all this information. It just takes too long!
  21. 21. Illumin8 Uses Natural Language Processing to “read” text Enter search terms Generate Organized Result Set Products Companies/Organizations Technical Approaches <ul><li>Results grouped into meaningful classes </li></ul><ul><li>System generates list of solutions, not records </li></ul><ul><li>Quickly see interesting and useful areas for investigation </li></ul>
  22. 22. Our Approach <ul><li>Premium Scientific </li></ul><ul><li>Patent </li></ul><ul><li>Web </li></ul>Search -Crawl -Load Semantic Index Results NLP Applied Problems, Solutions, Benefits NLP Applied Fuse, Classify, Summarize NLP Applied NLP applied throughout the system: index, query, result set Content
  23. 23. How does illumin8 work? Full Text Abstracts illumin8 searches on solutions. The solutions are extracted from full text sources, abstracts, web, and patents Internet Patents illumin8 Solution Database 1.1 billion 5 Billion web pages, blogs and forums 3 Million full-text scientific and technical articles from 1,800 Elsevier journals 33 Million scientific records from 15,000 peer reviewed journals & more than 4,000 publishers 21 Million patents from 5 world-wide patent offices Extract and Summarize Solutions Search
  24. 24. A Uniform Lens (index) Across Content Sources WEB JOURNAL PATENT <ul><li>Summarizing information about Companies, Products, etc., for technologies that researchers care about </li></ul><ul><li>Organizing results from the worlds most trusted scientific content and billions of web pages </li></ul>
  25. 25. Taking Search Beyond Keyword Indexing <ul><li>Keyword Indexing </li></ul><ul><li>Meaning is lost </li></ul><ul><li>Sentence processing </li></ul><ul><li>Meaning is maintained </li></ul><ul><li>Identify & classify problems, solutions and benefits </li></ul>Neural Network  used in  handwriting recognition Solution Problem
  26. 26. Natural Language Parsing Help_patterns Succeed2 Correct_problem treatPerson_SAVS positively_influence have_positive_influence protect_sb_against_sth Product_would_do_good provide_sb_with_sth Product_is_shown_to talented_at use_sth_to_do_sth approve_sth rely_on_product_to application_is Product_allows_sb_toVG2 ensure_protagonist A_makes_B_good benefit_of ... Thousands of rules Plus statistical models illumin8 Rules Grammatical Role Role Test Role Assignment provides Capacitive deionization an economical and efficient method for removing salt and impurities from water Solution Benefit Continue … Modal? Check that Verb polarity is positive; this rule would not match if the Verb were modal (i.e. only in certain cases), for example if it said “should provide … but” Check that Subject is not negated; this rule would not match if Subject were not positive, for example if it said “no process provides an economical an efficient …” Check that Object is not antagonistic; this rule would not match if Object were, for example “provides a costly and complicated method” no yes Negated? no yes Antagonistic? no yes Capacitive deionization with carbon aerogel electrodes provides an economical and efficient method for removing salt and impurities from water. Verb Subject Object
  27. 27. Analyzing A Sentence Carrier’s Infinity™ Air Purifier uses ultraviolet light to eliminate germs such as viruses, molds, bacteria, mildew and mold spores from the indoor air of homes and offices, ensuring a higher indoor air quality . Germ [Problem] Indoor air quality [Benefit] Carrier [Organization] Infinity Air Purifier [Product] Ultraviolet light [Technology] Virus Mold Bacteria Mildew Makes Uses Solves Provides Kind of Mold spore Concepts, ideas and entities extracted from a single sentence.
  28. 28. DEMO