SLA Summer 2008
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
677
On Slideshare
673
From Embeds
4
Number of Embeds
2

Actions

Shares
Downloads
9
Comments
0
Likes
2

Embeds 4

http://www.linkedin.com 3
https://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Mining Solutions A New Approach to Making the Most of Your Research Time SLA,Strategic Technology Alliance, Seattle, 2008 Joe Buzzanga, Product Manager, Elsevier Science and Technology June 17, 2008
  • 2. Agenda
    • Challenges and Framework for Information Retrieval (IR)
    • Using Natural Language Processing (NLP) in IR (illumin8)
    • Product Demo
  • 3. Digital Universe: 10x bigger in 5 years “ Searching for meaning in the content of unstructured data like images, video clips, documents, and the numbers and characters in databases is the rocket science of the digital universe.” IDC Source: IDC Whitepaper, The Diverse and Exploding Digital Universe, March 2008
  • 4. Today’s Researcher? Search for Meaning?
  • 5. What’s at Stake? Business Week Innovation Scorecard Amazon “Kindle”
  • 6. Impact on Information Retrieval
    • Separate the Signal from Noise
    • Signal processing
  • 7. Our Goal
    • Make you successful through superior information retrieval tools
  • 8. Framework for Information Retrieval
    • Traditional: card catalog, periodical index…
    Human Index Search Simple Model Human Index Search Print Collections Surrogate Record
    • Simple Model: single book
    Meta Data Content Content
  • 9. Framework for Information Retrieval Human Index Search Digital Bibliographic A&I Surrogate Record Digital Index Hybrid Index Meta Data
    • Digital bibliographic A&I
      • Semi-structured records
      • Content under editorial control
      • Application of controlled terms
      • Application of digital indexing
      • Results need to be organized and ranked
        • additional access points (e.g., facets, tags..)
    Results Content
  • 10. Framework for Information Retrieval
    • No Human Intervention
      • Content unstructured, uncontrolled and unmeasurable
      • Crawling is inherently imperfect
      • Typically Keyword indexing
      • Ranking of results becomes critical
    Web Search Crawl Digital Index Content Results
  • 11. Content:How Big is the Web? Today 170 million websites across all domains Source: Netcraft 2 years ago 80 million websites across all domains
  • 12. Content: Plumbing the Depths Source: Mills Davis, Project 10X
  • 13. Content: How Big is the Web?
    • ~10 Billion pages (2003 estimate)
    http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
  • 14. Crawling in the Dark
  • 15. The Key in Keyword?
    • Keyword is a misnomer in context of an index
    • Keyword is in the mind of the searcher
    • Every word is indexed, since the computer is not smart enough to know significant words (i.e., the “key” in “keyword”)
      • Brute force approach, feasible with compute power
  • 16. Results: Mystery Equation mystery clip
  • 17. Results: Facets
  • 18. Research and its Discontents 5.5 hours / week * Searching and gathering information * Source: 2007 survey of 6,300 knowledge workers, Outsell, Inc. 4.7 hours / week * Organizing and analyzing and applying information
  • 19. Introducing illumin8
    • Cut through the noise
    • Rapid summary/overview
    • Cross domain view
    • Integrated content
    • Web-based
    • Sharing results
    Applies Natural Language Processing at Internet Scale!
  • 20. Typical Search Current general search Get millions of documents to sift through Page 1 Page 2 Page 180,000 … compostable film There is just no way any researcher can read through all this information. It just takes too long!
  • 21. Illumin8 Uses Natural Language Processing to “read” text Enter search terms Generate Organized Result Set Products Companies/Organizations Technical Approaches
    • Results grouped into meaningful classes
    • System generates list of solutions, not records
    • Quickly see interesting and useful areas for investigation
  • 22. Our Approach
    • Premium Scientific
    • Patent
    • Web
    Search -Crawl -Load Semantic Index Results NLP Applied Problems, Solutions, Benefits NLP Applied Fuse, Classify, Summarize NLP Applied NLP applied throughout the system: index, query, result set Content
  • 23. How does illumin8 work? Full Text Abstracts illumin8 searches on solutions. The solutions are extracted from full text sources, abstracts, web, and patents Internet Patents illumin8 Solution Database 1.1 billion 5 Billion web pages, blogs and forums 3 Million full-text scientific and technical articles from 1,800 Elsevier journals 33 Million scientific records from 15,000 peer reviewed journals & more than 4,000 publishers 21 Million patents from 5 world-wide patent offices Extract and Summarize Solutions Search
  • 24. A Uniform Lens (index) Across Content Sources WEB JOURNAL PATENT
    • Summarizing information about Companies, Products, etc., for technologies that researchers care about
    • Organizing results from the worlds most trusted scientific content and billions of web pages
  • 25. Taking Search Beyond Keyword Indexing
    • Keyword Indexing
    • Meaning is lost
    • Sentence processing
    • Meaning is maintained
    • Identify & classify problems, solutions and benefits
    Neural Network  used in  handwriting recognition Solution Problem
  • 26. Natural Language Parsing Help_patterns Succeed2 Correct_problem treatPerson_SAVS positively_influence have_positive_influence protect_sb_against_sth Product_would_do_good provide_sb_with_sth Product_is_shown_to talented_at use_sth_to_do_sth approve_sth rely_on_product_to application_is Product_allows_sb_toVG2 ensure_protagonist A_makes_B_good benefit_of ... Thousands of rules Plus statistical models illumin8 Rules Grammatical Role Role Test Role Assignment provides Capacitive deionization an economical and efficient method for removing salt and impurities from water Solution Benefit Continue … Modal? Check that Verb polarity is positive; this rule would not match if the Verb were modal (i.e. only in certain cases), for example if it said “should provide … but” Check that Subject is not negated; this rule would not match if Subject were not positive, for example if it said “no process provides an economical an efficient …” Check that Object is not antagonistic; this rule would not match if Object were, for example “provides a costly and complicated method” no yes Negated? no yes Antagonistic? no yes Capacitive deionization with carbon aerogel electrodes provides an economical and efficient method for removing salt and impurities from water. Verb Subject Object
  • 27. Analyzing A Sentence Carrier’s Infinity™ Air Purifier uses ultraviolet light to eliminate germs such as viruses, molds, bacteria, mildew and mold spores from the indoor air of homes and offices, ensuring a higher indoor air quality . Germ [Problem] Indoor air quality [Benefit] Carrier [Organization] Infinity Air Purifier [Product] Ultraviolet light [Technology] Virus Mold Bacteria Mildew Makes Uses Solves Provides Kind of Mold spore Concepts, ideas and entities extracted from a single sentence.
  • 28. DEMO