Enabling Exploration Through Text Analytics - Presentation Transcript
Enabling Exploration through Text Analytics Daniel Tunkelang Chief Scientist, Endeca
overview
information seeking tools
need to support exploration
text analytics can help
you can do this here and now
real-world information seeking examples
looking for health information
looking for work-related information
reminder
search and text analytics
are a means, not an end
example 1: looking for health information
six months into my wife’s pregnancy, we
discovered that she had gestational diabetes
how to learn more?
google: the default option for most
in government we trust: fda.gov
maybe the private sector knows best: webmd powered by
success – and a sticky site powered by
example 2: looking for work-related information
need to ramp up summer
interns on text mining
how to find a good book?
let’s try google again
google: the gateway to wikipedia?
the library of congress (loc.gov)
triangle research libraries: next-gen catalog powered by
faceted search enables query refinement powered by
take-away #1
exploratory search support:
a must-have for many information needs
text analytics
categorization
named entity detection
term extraction
sentiment analysis
vague term, lots of see-alsos
text mining
information extraction
content enrichment
newssift: text analytics enabling exploration powered by categorization named entity detection term extraction sentiment analysis
exploring the news about facebook powered by
facebook: the good powered by Social Utility Iphone Application
facebook: the bad powered by Criminal Behavior Litigation And Settlement
take-away #2
text analytics enable
exploratory search
text analytics is here and now ? ? ?
lots of off-the-shelf options and more!
caveats
rule-based techniques are domain-specific
statistical techniques rely on trained models
plan for errors, inconsistency
document vs. corpus analysis
problems with entity extraction
moderate precision, but low recall
not just noisy, but inconsistent
corpus analysis can help!
Arrest (1) Asia (1) ALTOONA, PA (1) Abe Lincoln (1) Bob Dole (1) Boston Tea Party (1) Abraham Lincoln (1) Budweiser (1) Australia (1) Adlai Stephenson (1) Boston Tea Party (1) Austin, Texas (1) Abraham Lincoln (1) Boston Globe (1) Austin (1) Abe Weiss (1) Bocuse d’Or World Cuisine Contest (1) Atlanta (2) Abe Lincoln (1) Bob Dole (1) Asia (1) Abbie Hoffman (1) Bloomberg LP (3) Arrest (1) Aaron Sorkin (1) BioDiversity Research Institute (1) Arlington, Va. (2) ARYE BARAK (1) Big Apple Companies (1) Arkansas (7) ANTONIN SCALIA (1) Bear Stearns (2) Arizona (11) ANTHONY MWANGI (1) Bad News Bears (1) Argentina (1) ANDREW LLOYD WEBBER (1) Australian Liberal Party (1) Appalachia (1) ANDERS ERICSSON (1) Arianna Huffington (1) Americas (17) AMY WINEHOUSE (1) Arctic National Wildlife Refuge (1) Allegheny (1) AMANDA MARCOTTE (1) Apple (1) Alaska (3) ALI HASSAN AL (1) American Airlines Inc. (1) Akihabara (1) ALEX TREBEK (1) Amazon.com Inc. (1) Africa (5) AL GORE (1) Air Force (1) Afghanistan (7) ABDULRAHMAN ABDULLAH (1) ABC News Inc. (1) ALTOONA, PA (1) ABDUL-KARIM KHALAF (1) Organization Location Person
look for ways to cheat! recall precision
division of labor people supply vocabulary machine annotates documents http://www.precolumbianwomen.com/images/inca-labor.10.gif
example: ACM digital library
opportunity
repository of (sometimes) author-tagged documents
high-precision tags: very few false positives
challenge
poor reuse of vocabulary: most tags unique
low-recall tags: 90% false negatives
as is, tags were not useful for exploration
solution
bootstrap on author-supplied tags
prune 600K+ tags to 10K by
imposing frequency threshold
normalizing by case and singular/plural
eliminating infrequent subphrases
mine documents using resulting vocabulary
manually validate most frequently assigned tags
example: a search for boeing powered by
it’s a HITS!
if you prefer sports to computer science
no author-supplied tags
use search logs instead
supplement with authority files
team names
player names
mine documents using resulting vocabulary
roger clemens, then and now powered by
pivoting to a different view powered by
take-away #3
this is not vapor ware;
text analytics to enable exploration
is available here and now
looking forward
better tags are the beginning, not the end
improve with manual and automatic processing
give users control over precision / recall trade-off
Enterprises are awash in textual documents that rep more
Enterprises are awash in textual documents that represent valuable information assets. The limited access of conventional search interfaces, however, prevents enterprises from unlocking this value;
* An expert guide to how richer interfaces enable exploration and discovery and how these typically rely on content enrichment techniques that can be unreliable, labor-intensive, or both. It is essential to maximize the effectiveness of content enrichment, not only to achieve the desired value, but also to incent organizations to make the necessary investment. * Useful insight about content enrichment approaches that have demonstrated success in supporting exploration and discovery. * Gain insight into both the enrichment techniques and the ways they are used to enable exploratory search.
0 comments
Post a comment