Your SlideShare is downloading. ×
0
Enabling Exploration through Text Analytics Daniel Tunkelang Chief Scientist, Endeca
overview <ul><li>information seeking tools </li></ul><ul><li>need to support exploration </li></ul><ul><li>text analytics ...
real-world information seeking examples <ul><li>looking for health information </li></ul><ul><li>looking for work-related ...
example 1: looking for health information <ul><li>six months into my wife’s pregnancy, we </li></ul><ul><li>discovered tha...
google: the default option for most
in government we trust: fda.gov
maybe the private sector knows best: webmd powered by
success – and a sticky site powered by
example 2: looking for work-related information <ul><li>need to ramp up summer </li></ul><ul><li>interns on text mining </...
let’s try google again
google: the gateway to wikipedia?
the library of congress (loc.gov)
triangle research libraries: next-gen catalog powered by
faceted search enables query refinement powered by
take-away #1 <ul><li>exploratory search support: </li></ul><ul><li>a must-have for many information needs </li></ul>
text analytics <ul><li>categorization </li></ul><ul><li>named entity detection </li></ul><ul><li>term extraction </li></ul...
newssift: text analytics enabling exploration powered by categorization named entity detection term extraction sentiment a...
exploring the news about facebook powered by
facebook: the good powered by Social Utility Iphone Application
facebook: the bad powered by Criminal Behavior Litigation And Settlement
take-away #2 <ul><li>text analytics enable </li></ul><ul><li>exploratory search </li></ul>
text analytics is here and now ? ? ?
lots of off-the-shelf options and more!
caveats <ul><li>rule-based techniques are domain-specific </li></ul><ul><li>statistical techniques rely on trained models ...
problems with entity extraction <ul><li>moderate precision, but low recall </li></ul><ul><li>not just noisy, but inconsist...
look for ways to cheat! recall precision
division of labor people supply vocabulary machine annotates documents http://www.precolumbianwomen.com/images/inca-labor....
example: ACM digital library <ul><li>opportunity </li></ul><ul><ul><li>repository of (sometimes) author-tagged documents <...
solution <ul><li>bootstrap on author-supplied tags </li></ul><ul><li>prune 600K+ tags to 10K by </li></ul><ul><ul><li>impo...
example: a search for boeing powered by
it’s a HITS!
if you prefer sports to computer science <ul><li>no author-supplied tags </li></ul><ul><li>use search logs instead </li></...
roger clemens, then and now powered by
pivoting to a different view powered by
take-away #3 <ul><li>this is not vapor ware; </li></ul><ul><li>text analytics to enable exploration </li></ul><ul><li>is a...
looking forward <ul><li>better tags are the beginning, not the end </li></ul><ul><li>improve with manual and automatic pro...
in closing <ul><li>exploratory search = must-have, not nice-to-have </li></ul><ul><li>text analytics are a key enabler </l...
thank you…and come to SIGIR! <ul><li>communication 1.0 </li></ul><ul><li>email:  [email_address] </li></ul><ul><li>communi...
Upcoming SlideShare
Loading in...5
×

Enabling Exploration Through Text Analytics

2,492

Published on

Enterprises are awash in textual documents that represent valuable information assets. The limited access of conventional search interfaces, however, prevents enterprises from unlocking this value;

* An expert guide to how richer interfaces enable exploration and discovery and how these typically rely on content enrichment techniques that can be unreliable, labor-intensive, or both. It is essential to maximize the effectiveness of content enrichment, not only to achieve the desired value, but also to incent organizations to make the necessary investment.
* Useful insight about content enrichment approaches that have demonstrated success in supporting exploration and discovery.
* Gain insight into both the enrichment techniques and the ways they are used to enable exploratory search.

Daniel Tunkelang, Chief Scientist, Endeca

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,492
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
86
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Enabling Exploration Through Text Analytics"

  1. 1. Enabling Exploration through Text Analytics Daniel Tunkelang Chief Scientist, Endeca
  2. 2. overview <ul><li>information seeking tools </li></ul><ul><li>need to support exploration </li></ul><ul><li>text analytics can help </li></ul><ul><li>you can do this here and now </li></ul>
  3. 3. real-world information seeking examples <ul><li>looking for health information </li></ul><ul><li>looking for work-related information </li></ul><ul><li>reminder </li></ul><ul><li>search and text analytics </li></ul><ul><li>are a means, not an end </li></ul>
  4. 4. example 1: looking for health information <ul><li>six months into my wife’s pregnancy, we </li></ul><ul><li>discovered that she had gestational diabetes </li></ul><ul><li>how to learn more? </li></ul>
  5. 5. google: the default option for most
  6. 6. in government we trust: fda.gov
  7. 7. maybe the private sector knows best: webmd powered by
  8. 8. success – and a sticky site powered by
  9. 9. example 2: looking for work-related information <ul><li>need to ramp up summer </li></ul><ul><li>interns on text mining </li></ul><ul><li>how to find a good book? </li></ul>
  10. 10. let’s try google again
  11. 11. google: the gateway to wikipedia?
  12. 12. the library of congress (loc.gov)
  13. 13. triangle research libraries: next-gen catalog powered by
  14. 14. faceted search enables query refinement powered by
  15. 15. take-away #1 <ul><li>exploratory search support: </li></ul><ul><li>a must-have for many information needs </li></ul>
  16. 16. text analytics <ul><li>categorization </li></ul><ul><li>named entity detection </li></ul><ul><li>term extraction </li></ul><ul><li>sentiment analysis </li></ul><ul><li>vague term, lots of see-alsos </li></ul><ul><li>text mining </li></ul><ul><li>information extraction </li></ul><ul><li>content enrichment </li></ul>
  17. 17. newssift: text analytics enabling exploration powered by categorization named entity detection term extraction sentiment analysis
  18. 18. exploring the news about facebook powered by
  19. 19. facebook: the good powered by Social Utility Iphone Application
  20. 20. facebook: the bad powered by Criminal Behavior Litigation And Settlement
  21. 21. take-away #2 <ul><li>text analytics enable </li></ul><ul><li>exploratory search </li></ul>
  22. 22. text analytics is here and now ? ? ?
  23. 23. lots of off-the-shelf options and more!
  24. 24. caveats <ul><li>rule-based techniques are domain-specific </li></ul><ul><li>statistical techniques rely on trained models </li></ul><ul><li>plan for errors, inconsistency </li></ul><ul><li>document vs. corpus analysis </li></ul>
  25. 25. problems with entity extraction <ul><li>moderate precision, but low recall </li></ul><ul><li>not just noisy, but inconsistent </li></ul><ul><li>corpus analysis can help! </li></ul>Arrest (1) Asia (1) ALTOONA, PA (1) Abe Lincoln (1) Bob Dole (1) Boston Tea Party (1) Abraham Lincoln (1) Budweiser (1) Australia (1) Adlai Stephenson (1) Boston Tea Party (1) Austin, Texas (1) Abraham Lincoln (1) Boston Globe (1) Austin (1) Abe Weiss (1) Bocuse d’Or World Cuisine Contest (1) Atlanta (2) Abe Lincoln (1) Bob Dole (1) Asia (1) Abbie Hoffman (1) Bloomberg LP (3) Arrest (1) Aaron Sorkin (1) BioDiversity Research Institute (1) Arlington, Va. (2) ARYE BARAK (1) Big Apple Companies (1) Arkansas (7) ANTONIN SCALIA (1) Bear Stearns (2) Arizona (11) ANTHONY MWANGI (1) Bad News Bears (1) Argentina (1) ANDREW LLOYD WEBBER (1) Australian Liberal Party (1) Appalachia (1) ANDERS ERICSSON (1) Arianna Huffington (1) Americas (17) AMY WINEHOUSE (1) Arctic National Wildlife Refuge (1) Allegheny (1) AMANDA MARCOTTE (1) Apple (1) Alaska (3) ALI HASSAN AL (1) American Airlines Inc. (1) Akihabara (1) ALEX TREBEK (1) Amazon.com Inc. (1) Africa (5) AL GORE (1) Air Force (1) Afghanistan (7) ABDULRAHMAN ABDULLAH (1) ABC News Inc. (1) ALTOONA, PA (1) ABDUL-KARIM KHALAF (1) Organization Location Person
  26. 26. look for ways to cheat! recall precision
  27. 27. division of labor people supply vocabulary machine annotates documents http://www.precolumbianwomen.com/images/inca-labor.10.gif
  28. 28. example: ACM digital library <ul><li>opportunity </li></ul><ul><ul><li>repository of (sometimes) author-tagged documents </li></ul></ul><ul><ul><li>high-precision tags: very few false positives </li></ul></ul><ul><li>challenge </li></ul><ul><ul><li>poor reuse of vocabulary: most tags unique </li></ul></ul><ul><ul><li>low-recall tags: 90% false negatives </li></ul></ul><ul><li>as is, tags were not useful for exploration </li></ul>
  29. 29. solution <ul><li>bootstrap on author-supplied tags </li></ul><ul><li>prune 600K+ tags to 10K by </li></ul><ul><ul><li>imposing frequency threshold </li></ul></ul><ul><ul><li>normalizing by case and singular/plural </li></ul></ul><ul><ul><li>eliminating infrequent subphrases </li></ul></ul><ul><li>mine documents using resulting vocabulary </li></ul><ul><li>manually validate most frequently assigned tags </li></ul>
  30. 30. example: a search for boeing powered by
  31. 31. it’s a HITS!
  32. 32. if you prefer sports to computer science <ul><li>no author-supplied tags </li></ul><ul><li>use search logs instead </li></ul><ul><li>supplement with authority files </li></ul><ul><ul><li>team names </li></ul></ul><ul><ul><li>player names </li></ul></ul><ul><li>mine documents using resulting vocabulary </li></ul>
  33. 33. roger clemens, then and now powered by
  34. 34. pivoting to a different view powered by
  35. 35. take-away #3 <ul><li>this is not vapor ware; </li></ul><ul><li>text analytics to enable exploration </li></ul><ul><li>is available here and now </li></ul>
  36. 36. looking forward <ul><li>better tags are the beginning, not the end </li></ul><ul><li>improve with manual and automatic processing </li></ul><ul><li>give users control over precision / recall trade-off </li></ul><ul><li>help users and content creators help you </li></ul>
  37. 37. in closing <ul><li>exploratory search = must-have, not nice-to-have </li></ul><ul><li>text analytics are a key enabler </li></ul><ul><li>the technology is real, here, and now </li></ul>
  38. 38. thank you…and come to SIGIR! <ul><li>communication 1.0 </li></ul><ul><li>email: [email_address] </li></ul><ul><li>communication 2.0 </li></ul><ul><li>blog: http://thenoisychannel.com </li></ul><ul><li>twitter: http://twitter.com/dtunkelang </li></ul><ul><li>SIGIR: July 19-23 in Boston </li></ul><ul><li>Industry Track on July 22 nd ! </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×