Enabling Exploration Through Text Analytics

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    3 Favorites

    Enabling Exploration Through Text Analytics - Presentation Transcript

    1. Enabling Exploration through Text Analytics Daniel Tunkelang Chief Scientist, Endeca
    2. overview
      • information seeking tools
      • need to support exploration
      • text analytics can help
      • you can do this here and now
    3. real-world information seeking examples
      • looking for health information
      • looking for work-related information
      • reminder
      • search and text analytics
      • are a means, not an end
    4. example 1: looking for health information
      • six months into my wife’s pregnancy, we
      • discovered that she had gestational diabetes
      • how to learn more?
    5. google: the default option for most
    6. in government we trust: fda.gov
    7. maybe the private sector knows best: webmd powered by
    8. success – and a sticky site powered by
    9. example 2: looking for work-related information
      • need to ramp up summer
      • interns on text mining
      • how to find a good book?
    10. let’s try google again
    11. google: the gateway to wikipedia?
    12. the library of congress (loc.gov)
    13. triangle research libraries: next-gen catalog powered by
    14. faceted search enables query refinement powered by
    15. take-away #1
      • exploratory search support:
      • a must-have for many information needs
    16. text analytics
      • categorization
      • named entity detection
      • term extraction
      • sentiment analysis
      • vague term, lots of see-alsos
      • text mining
      • information extraction
      • content enrichment
    17. newssift: text analytics enabling exploration powered by categorization named entity detection term extraction sentiment analysis
    18. exploring the news about facebook powered by
    19. facebook: the good powered by Social Utility Iphone Application
    20. facebook: the bad powered by Criminal Behavior Litigation And Settlement
    21. take-away #2
      • text analytics enable
      • exploratory search
    22. text analytics is here and now ? ? ?
    23. lots of off-the-shelf options and more!
    24. caveats
      • rule-based techniques are domain-specific
      • statistical techniques rely on trained models
      • plan for errors, inconsistency
      • document vs. corpus analysis
    25. problems with entity extraction
      • moderate precision, but low recall
      • not just noisy, but inconsistent
      • corpus analysis can help!
      Arrest (1) Asia (1) ALTOONA, PA (1) Abe Lincoln (1) Bob Dole (1) Boston Tea Party (1) Abraham Lincoln (1) Budweiser (1) Australia (1) Adlai Stephenson (1) Boston Tea Party (1) Austin, Texas (1) Abraham Lincoln (1) Boston Globe (1) Austin (1) Abe Weiss (1) Bocuse d’Or World Cuisine Contest (1) Atlanta (2) Abe Lincoln (1) Bob Dole (1) Asia (1) Abbie Hoffman (1) Bloomberg LP (3) Arrest (1) Aaron Sorkin (1) BioDiversity Research Institute (1) Arlington, Va. (2) ARYE BARAK (1) Big Apple Companies (1) Arkansas (7) ANTONIN SCALIA (1) Bear Stearns (2) Arizona (11) ANTHONY MWANGI (1) Bad News Bears (1) Argentina (1) ANDREW LLOYD WEBBER (1) Australian Liberal Party (1) Appalachia (1) ANDERS ERICSSON (1) Arianna Huffington (1) Americas (17) AMY WINEHOUSE (1) Arctic National Wildlife Refuge (1) Allegheny (1) AMANDA MARCOTTE (1) Apple (1) Alaska (3) ALI HASSAN AL (1) American Airlines Inc. (1) Akihabara (1) ALEX TREBEK (1) Amazon.com Inc. (1) Africa (5) AL GORE (1) Air Force (1) Afghanistan (7) ABDULRAHMAN ABDULLAH (1) ABC News Inc. (1) ALTOONA, PA (1) ABDUL-KARIM KHALAF (1) Organization Location Person
    26. look for ways to cheat! recall precision
    27. division of labor people supply vocabulary machine annotates documents http://www.precolumbianwomen.com/images/inca-labor.10.gif
    28. example: ACM digital library
      • opportunity
        • repository of (sometimes) author-tagged documents
        • high-precision tags: very few false positives
      • challenge
        • poor reuse of vocabulary: most tags unique
        • low-recall tags: 90% false negatives
      • as is, tags were not useful for exploration
    29. solution
      • bootstrap on author-supplied tags
      • prune 600K+ tags to 10K by
        • imposing frequency threshold
        • normalizing by case and singular/plural
        • eliminating infrequent subphrases
      • mine documents using resulting vocabulary
      • manually validate most frequently assigned tags
    30. example: a search for boeing powered by
    31. it’s a HITS!
    32. if you prefer sports to computer science
      • no author-supplied tags
      • use search logs instead
      • supplement with authority files
        • team names
        • player names
      • mine documents using resulting vocabulary
    33. roger clemens, then and now powered by
    34. pivoting to a different view powered by
    35. take-away #3
      • this is not vapor ware;
      • text analytics to enable exploration
      • is available here and now
    36. looking forward
      • better tags are the beginning, not the end
      • improve with manual and automatic processing
      • give users control over precision / recall trade-off
      • help users and content creators help you
    37. in closing
      • exploratory search = must-have, not nice-to-have
      • text analytics are a key enabler
      • the technology is real, here, and now
    38. thank you…and come to SIGIR!
      • communication 1.0
      • email: [email_address]
      • communication 2.0
      • blog: http://thenoisychannel.com
      • twitter: http://twitter.com/dtunkelang
      • SIGIR: July 19-23 in Boston
      • Industry Track on July 22 nd !

    + Daniel TunkelangDaniel Tunkelang, 5 months ago

    custom

    1045 views, 3 favs, 6 embeds more stats

    Enterprises are awash in textual documents that rep more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1045
      • 883 on SlideShare
      • 162 from embeds
    • Comments 0
    • Favorites 3
    • Downloads 32
    Most viewed embeds
    • 110 views on http://thenoisychannel.com
    • 26 views on http://smartdatacollective.com
    • 23 views on http://www.scienceforseo.com
    • 1 views on http://localhost:7777
    • 1 views on http://74.125.155.132

    more

    All embeds
    • 110 views on http://thenoisychannel.com
    • 26 views on http://smartdatacollective.com
    • 23 views on http://www.scienceforseo.com
    • 1 views on http://localhost:7777
    • 1 views on http://74.125.155.132
    • 1 views on http://cord.cambridge.ibm.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories