ESSIR LivingKnowledge DiversityEngine tutorial
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

ESSIR LivingKnowledge DiversityEngine tutorial

  • 945 views
Uploaded on

Mike's testbed tutorial

Mike's testbed tutorial

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
945
On Slideshare
945
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. SYMPOSIUM ON BIAS AND DIVERSITY IN IRA TESTBED FOR DIVERSIFICATON IN SEARCH
    Koblenz, August 31, 2011
    Michael Matthews, Barcelona Media/Yahoo! Research
    1
  • 2. OVERVIEW
    Introduction to LivingKnowledge Testbed – The Diversity Engine
    Getting started – Our first application!
    Adding text analysis
    Adding multimedia analysis
    Evaluation
    Indexing and search
    Developing applications
    Future work
    2
  • 3. DIVERSITY ENGINE
    Provide collections, annotation tools and an evaluation framework to allow for collaborative and comparable research
    Supports indexing and searching on a wide variety of document annotations including entities, bias, trust, polarity, and multimedia features
    Support development of bias and diversity aware applications
  • 4. ARCHITECTURE
    Document
    Collections
    Analysis
    Pipeline
    Index/
    Search
    Application
    Development
    NYT
    Yahoo! News
    ARC Crawls
    Evaluation Framework
  • 5. DESIGN DECISIONS
    Use Open Source tools when available
    Programming Language - Java 1.6
    Data format – LK XML
    Analysis tools Operating System – Linux (any software language)
    Indexing/Search - Solr
    GUI – JSP, HTML, JavaScript, CSS
    5
  • 6. LK-XML format.
  • 7. DOCUMENT COLLECTIONS
    Supported Formats -ARC (Internet Memory Crawls) ,Text, HTML. Kyoto, BBN, NYT
    Collections
    Testing Examples included with Diversity Engine
    Large ARCs available from Internet Memory
    Converters provided for other collections (MPQA, BBN, NYT) that have licensing restrictions
    7
  • 8. ANALYSIS MODULES
    8
  • 9. INDEXING/SEARCH
    Solr
    Enterprise search platform built on top of Lucene
    Xml input and output allows for easy integration with Diversity Engine
    Plug-in framework allows customization
    Built-in facet capabilities support indexing and searching on annotations
    Integration
    Converter from LK XML – Solr XML
    Plug-in for facet ranking and speed improvements
    9
  • 10. APPLICATION DEVELOPMENT
    • Basis for LivingKnowledge Applications
    • 11. Future Predictor
    • 12. Media Content Analysis
    • 13. Support development – coding required!
    • 14. Real World Problems
    • 15. HTML Extraction
    • 16. Scaling to Large Collections
    • 17. Provenance
    • 18. Some pluggable GUI components
    • 19. Examples to ease learning curve
    10
  • 20. APPLICATION DEVELOPMENT
    11
  • 21. APPLICATION DEVELOPMENT
    12
  • 22. EVALUATION FRAMEWORK
    • Framework for the evaluation of analysis tools
    • 23. Evaluates any possible annotation pipeline
    • 24. Measures correctness and quality
    • 25. Outputs Precision + Recall
    • 26. Compares annotation output of pipeline with ground truth data
    13
  • 27. OUR FIRST APPLICATION
    Download Diversity Engine release from SourceForge
    tar xzvf [release file]
    cd testbed
    ant build
    apps/testbed conf/testbed/tutorial-application.xml
    What happened?
    197 text files and 127 images files converted from arc format to LK XML and stored in devapps/example/data/lkxml
    2 annotators were run over collection
    OpenNLP for tokenization, sentence splitting, Pos tags
    SST named entity recognizer
    Results stored in devapps/example/data/lkxml
    Files were converted to Solr xml format and indexed using solr
    Solr XML stored to devapps/example/data/solr
    HTML Visualization Files stored in devapps/example/data/html
    ant deploy-testbed
    Solr running at http://localthost:8983/solr/
    Example app running at http://localhost:8983/testbed/
    14
  • 28. EXAMPLE SOLR OUTPUT
    http://localhost:8983/solr/select/?q=putin
    15
  • 29. EXAMPLE APPLICATION
    http://localhost:8983/testbed/results.jsp?query=putin
    16
  • 30. EXAMPLE DOCUMENT
    17
  • 31. CONFIGURATION FILE
    <lk-applicationlogDir="log"appDir="devapps/example">
    <corpusdir="corpora/examples/smallarc"format="arc"/>
    <image-pipeline>
    <annotators>
    </annotators>
    </image-pipeline>
    <pipeline>
    <annotators>
    <annotatorexec="./opennlp"/>
    <annotatorexec="./sst"/>
    </annotators>
    </pipeline>
    <visualize/>
    <indexersolrHomeDir="solr/solr“
    solrDataDir="solr/solr/data“
    converter="conf/testbed/tutorial-lk2solr.xml"/>
    <searcherappTitle="LivingKnowledge - Example Application"
    appShortTitle="Example Application"
    appUrl="http://localhost:8983/solr/">
    <facets>
    <facetfield="per"description="Person"/>
    <facetfield="loc"description="Location"/>
    </facets>
    </searcher>
    </lk-application>
    18
  • 32. TEXT ANALYSIS
    <pipeline>
    <annotators>
    <annotatorexec="./opennlp"/>
    <annotatorexec="./sst"/>
    </annotators>
    </pipeline>
    <pipeline>
    <annotators>
    <annotatorexec="./opennlp"/>
    <annotatorexec="./sst"/>
    <annotatorexec="./facts"/>
    <annotatorexec="./unitn_tagger"/>
    <annotatorexec="./unitn_subjexpr"/>
    </annotators>
    </pipeline>
    apps/testbed –run pipeline conf/testbed/tutorial-application.xml
    apps/testbed –run visualization conf/testbed/tutorial-application.xml
    19
  • 33. TEXT ANALYSIS - FACTS
    devapps/example/data/lkxml/EA-EUElections2009-euobserver-0729-20090729085530-00000.arc.15521713.facts.xml
    20
  • 34. TEXT ANALYSIS - FACTS
    devapps/example/data/html/EA-EUElections2009-euobserver-0729-20090729085530-00000.arc.15521713.html
    21
  • 35. IMAGE ANALYSIS
    <image-pipeline>
    <annotators>
    <annotatorexec="./soton_haarfacedetector"/>
    </annotators>
    </pipeline>
    <pipeline>
    <annotators>
    <annotatorexec="./opennlp"/>
    <annotatorexec="./sst"/>
    <annotatorexec="./facts"/>
    <annotatorexec="./unitn_tagger"/>
    <annotatorexec="./unitn_subjexpr"/>
    <annotatorexec="./imageannots"/>
    </annotators>
    </pipeline>
    apps/testbed –run pipeline,image-pipeline –pipeline imageannotsconf/testbed/tutorial-application.xml
    ls devapps/example/data/lkxml/img/*
    22
  • 36. ANALYSIS API
    Documents in LK XML format
    Annotators passed a single document directory –They should add annotations for each document in directory
    Files will have consistent naming convention
    LkText file = id + “.lktext.xml”
    LkMedia = id + “.lkmedia.xml”
    LkAnnotation = id + “.” + annotatorId + “.xml”
    Annotators will be processed sequentially in the order listed in the XML file
    Annotators can be written in any language but must run on Linux – Helper classes will exist for Java, but there is no obligation to use them.
    Add application calling your new annotator to apps directory
    Add your application to the configuration file as before
    23
  • 37. ANALYSIS API – JAVA
    Extend class org.diversityengine.annotator.AbstractAnnotator
    Implement Methods
    getName()
    getType() - TEXT OR IMAGE
    For Image Analysis implement
    LkAnnotation getLkAnnotation(ImageDocument document)
    For Text Analysis implement
    LkAnnotation getLkAnnotation(TextDocument document)
    In main, instantiate and call annotator
    NewAnnotator annotator = new NewAnnotator()
    annotator.processDirectory(args[0]);
    Add application calling your new annotator to apps directory
    Add your application to the configuration file as before
    24
  • 38. EVALUATION
    Evaluation works with same configuration file. Simply add evaluation element
    <lk-applicationlogDir="log"appDir="devapps/evaluation">
    <corpusdir="corpora/evaluation/sst/text/"format="bbn"/>
    <pipeline>
    <annotators>
    <annotatorexec="./sst"/>
    </annotators>
    </pipeline>
    <evaluationevalDir="evaluation/sst/">
    <evaluatorprovides="ENTITIES"
    goldDir="corpora/evaluation/sst/gold/"
    goldAnnotator="sstgold"
    annotator="sst" />
    </evaluation>
    </lk-application>
    apps/testbed conf/evaluation/sst.xml
    25
  • 39. EVALUATION RESULTS
    <evaluationgoldDir="/home/mikemat/code/livingknowledge/WP6/testbed/corpora/evaluation/sst/gold/"lkDir="/home/mikemat/code/livingknowledge/WP6/testbed/devapps/evaluation/data/lkxml"annotation="sst"goldAnnotation="sstgold"provides="ENTITIES">
    <docs>
    <docid="WSJ0375"N="19"tp="18"fp="1"fn="1" />
    <docid="WSJ0380"N="19"tp="15"fp="4"fn="1" />
    <docid="WSJ0376"N="72"tp="61"fp="11"fn="7" />
    <docid="WSJ0377"N="26"tp="17"fp="9"fn="6" />
    <docid="WSJ0378"N="10"tp="10"fp="0"fn="0" />
    <docid="WSJ0379"N="24"tp="19"fp="5"fn="2" />
    </docs>
    <totalsN="170"tp="140"fp="30"fn="17"p="0.8235294117647058"r="0.89171974522293"f="0.8562691131498471" />
    </evaluation>
    cat evaluation/sst/sst.ENTITIES.xml
    26
  • 40. INDEXING AND SEARCH
    Search Engines - Traditional
    Bag-of-words representation
    Inverted index (words -> documents) for efficiency
    10 docs ranked according tf-idf similarity with query
    Search Engines – Today
    Much metadata associated with documents
    Ranking based on 100s of features (date, location, pagerank, click data, etc, personalization)
    Richer display
    Facets for exploratory search
    Answers when appropriate
    etc..
    Many open source options - Lucene/Solr most widely used
    27
  • 41. APACHE LUCENE/SOLR
    Lucene/Solr
    28
  • 42. FACETED SEARCH
    Diagram by Yonik Seeley
    29
  • 43. FACETED SEACH
    • Summarize query results aggregation properties of returned pages
    • 44. price ranges for product query
    • 45. related people or locations for news query
    • 46. Exploratory Search
    • 47. Show documents that matching the query term and a selected facet
    • 48. Make inferences not clear from simple document list
    • 49. Living Knowledge Analysis is modeled very well by facets
    • 50. Topics as determined by entity and fact extraction
    • 51. Location and Time diversity dimensions
    • 52. Opinions as determined by opinion extraction
    30
  • 53. LK XML TO SOLR
    • Solr has well defined XML input format for adding new documents
    • 54. Diversity Engine provides a simple language to map LX XML to Solr XML
    31
  • 55. LK2SOLR CONVERSION
    <indexersolrHomeDir="solr/solr“
    solrDataDir="solr/solr/data“
    converter="conf/testbed/tutorial-lk2solr.xml"/>
    <lktosolr>
    <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“
    filter="org.diversityengine.solr.converter.filters.PerValueFilter"/>
    <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“
    filter="org.diversityengine.solr.converter.filters.LocValueFilter"/>
    <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" />
    <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“
    type="date"/>
    </lktosolr>
    solr – Name of the field in solr
    annotation – Name of the LKXML Annotation
    value – Value of annotation
    filter – Allows post processing on annotation
    type – Only Date supported currently
    32
  • 56. ADDING FACTS TO INDEX
    <lktosolr>
    <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“
    filter="org.diversityengine.solr.converter.filters.PerValueFilter"/>
    <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“
    filter="org.diversityengine.solr.converter.filters.LocValueFilter"/>
    <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" />
    <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“
    type="date"/>
    <fieldsolr="yago"annotation="yago-entities"value="$text" />
    <fieldsolr="yago-country"annotation="facts"
    value="xpath:/entity-information[facts/type/text()=
    'wordnet_country_108544813']/id/text()" />
    </lktosolr>
    apps/testbed –run convert-solr conf/testbed/tutorial-application.xml
    ls devapps/example/data/solr/*
    apps/testbed –run index conf/testbed/tutorial-application.xml
    33
  • 57. FACTS TO SOLR
    <fieldsolr="yago"annotation="yago-entities"value="$text" />
    34
  • 58. FACTS TO SOLR
    <fieldsolr="yago-country"annotation="facts"
    value="xpath:/entity-information[facts/type/text()=
    'wordnet_country_108544813']/id/text()" />
    35
  • 59. ADDING IMAGES TO INDEX
    <lktosolr>
    <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“
    filter="org.diversityengine.solr.converter.filters.PerValueFilter"/>
    <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“
    filter="org.diversityengine.solr.converter.filters.LocValueFilter"/>
    <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" />
    <fieldsolr="yago"annotation="yago-entities"value="$text" />
    <fieldsolr="yago-country"annotation="facts"
    value="xpath:/entityinformation[facts/type/text()
    ='wordnet_country_108544813']/id/text()" />
    <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“
    type="date"/>
    <fieldsolr="image"annotation="IMAGE_ANNOTS"value="$text" />
    <fieldsolr="bestimage"annotation="BEST_IMAGES"value="$text" />
    </lktosolr>
    apps/testbed –run convert-solr conf/testbed/tutorial-application.xml
    ls devapps/example/data/solr/*
    apps/testbed –run index conf/testbed/tutorial-application.xml
    36
  • 60. APPLICATION DEVELOPMENT
    Examples
    HTML Extraction
    Scaling to Large Collections
    Provenance
    Some pluggable GUI components
    37
  • 61. FACT/IMAGE APPLICATION
    <searcherappTitle="LivingKnowledge - Example Application"
    appShortTitle="Example Application"
    appUrl="http://localhost:8983/solr/">
    <facets>
    <facetfield=“yago"description=“Yago"/> <facetfield=“yago-country"description=“Country"/>
    <facetfield="per"description="Person"/>
    <facetfield="loc"description="Location"/> <facetfield=“image"description=“Images"/> </facets>
    </searcher>
    ant deploy-testbed
    38
  • 62. FACT/IMAGE APPLICATION
    http://localhost:8983/testbed/results.jsp?query=putin
    39
  • 63. OPINION APPLICATION
    Opinions are at sentence level, not document level – same analysis, but different indexing
    cat conf/testbed/tutorial-lk2solr-sentence.xml
    <lktosolrsolrDoc="SENTENCES"contextSize="1">
    <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“
    filter="org.diversityengine.solr.converter.filters.PerValueFilter“ source="solrdoc" />
    <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“
    filter="org.diversityengine.solr.converter.filters.LocValueFilter“
    source="solrdoc" />
    <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" />
    <fieldsolr="yago"annotation="yago-entities"value="$text“
    source="solrdoc" />
    <fieldsolr="image"annotation="IMAGE_ANNOTS"value="$text" />
    <fieldsolr="bestimage"annotation="BEST_IMAGES"value="$text" />
    <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“
    type="date"/>
    <fieldsolr="polarity"
    annotation="MPQA-expressive-subjectivity,MPQA-direct-subjective“
    value="xpath:/node()[@pol]/@pol"source="solrdoc“
    filter="org.diversityengine.solr.converter.filters.PolarityValueFilter"/>
    <fieldsolr="pol-int“
    annotation="MPQA-expressive-subjectivity,MPQA-direct-subjective“
    value="xpath:concat(/node()[@pol and @int]/@pol,/node()[@int and @pol]/@int)“
    source="solrdoc"/>
    </lktosolr>
    apps/testbed –run convert-solr,index
    conf/testbed/tutorial-application-sentence.xml
    ls devapps/example/data/solr/*
    40
  • 64. SOLR XML – SENTENCE
    41
  • 65. OPINION APPLICATION
    modify webappWEB-INFweb.xml
    <web-appxmlns="http://java.sun.com/xml/ns/javaee"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd"
    version="2.5">
    <description>
    LivingKnowledge Testbed Example Application
    </description>
    <display-name>Testbed Examples</display-name>
    <context-param>
    <param-name>applicationDef</param-name>
    <param-value>conf/testbed/tutorial-application-sentence.xml</param-value>
    <description>The Living Knowledge application description XML file </description>
    </context-param>
    </web-app>
    ant deploy-testbed
    42
  • 66. OPINION APPLICATION
    http://localhost:8983/testbed/results.jsp?query=putin
    43
  • 67. HTML EXTRACTION
    44
  • 68. HTML EXTRACTION
    Boilerplate can lead to false positive results and inaccurate facet aggregation
    Real example – before extraction developed, most common person for most queries was in a top story title (on all pages) the day of the crawl!
    Titles, Authors and Dates are important for bias and diversity aware search
    45
  • 69. PROVENANCE
    How an annotation is derived is often as important as the annotation itself
    Users want to verify results
    Developers need to validate results
    Open Provenance provides an open source solution
    Testbed annotations can be extended with Open Provenance chains
    46
  • 70. Provenance Diagram
    47
  • 71. SCALING TO LARGE COLLECTIONS
    In the real world, even “small” datasets have million of documents
    NLP/Image processing is expensive – 1 doc/sec = 11 days for 1 million docs!
    Hadoop Mapper allows for scaling – scales linearly with number of machines
    ZipCollection writer allows partitioning data into subsets for processing
    48
  • 72. COMPONENTS- OPINIONS
    49
  • 73. COMPONENTS - TIME
    50
  • 74. COMPONENTS - GEO
    51
  • 75. FUTURE WORK
    More components
    Maven to manage dependencies
    Better integration of Timeline and Geo visualization components
    Integration of ranking algorithms
    Better Documentation 
    52
  • 76. Thanks!
    LivingKnowledge Partners!
    You for coming!!
    Questions?
    53