ESSIR LivingKnowledge DiversityEngine tutorial
Upcoming SlideShare
Loading in...5
×
 

ESSIR LivingKnowledge DiversityEngine tutorial

on

  • 886 views

Mike's testbed tutorial

Mike's testbed tutorial

Statistics

Views

Total Views
886
Views on SlideShare
886
Embed Views
0

Actions

Likes
0
Downloads
6
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    ESSIR LivingKnowledge DiversityEngine tutorial ESSIR LivingKnowledge DiversityEngine tutorial Presentation Transcript

    • SYMPOSIUM ON BIAS AND DIVERSITY IN IRA TESTBED FOR DIVERSIFICATON IN SEARCH
      Koblenz, August 31, 2011
      Michael Matthews, Barcelona Media/Yahoo! Research
      1
    • OVERVIEW
      Introduction to LivingKnowledge Testbed – The Diversity Engine
      Getting started – Our first application!
      Adding text analysis
      Adding multimedia analysis
      Evaluation
      Indexing and search
      Developing applications
      Future work
      2
    • DIVERSITY ENGINE
      Provide collections, annotation tools and an evaluation framework to allow for collaborative and comparable research
      Supports indexing and searching on a wide variety of document annotations including entities, bias, trust, polarity, and multimedia features
      Support development of bias and diversity aware applications
    • ARCHITECTURE
      Document
      Collections
      Analysis
      Pipeline
      Index/
      Search
      Application
      Development
      NYT
      Yahoo! News
      ARC Crawls
      Evaluation Framework
    • DESIGN DECISIONS
      Use Open Source tools when available
      Programming Language - Java 1.6
      Data format – LK XML
      Analysis tools Operating System – Linux (any software language)
      Indexing/Search - Solr
      GUI – JSP, HTML, JavaScript, CSS
      5
    • LK-XML format.
    • DOCUMENT COLLECTIONS
      Supported Formats -ARC (Internet Memory Crawls) ,Text, HTML. Kyoto, BBN, NYT
      Collections
      Testing Examples included with Diversity Engine
      Large ARCs available from Internet Memory
      Converters provided for other collections (MPQA, BBN, NYT) that have licensing restrictions
      7
    • ANALYSIS MODULES
      8
    • INDEXING/SEARCH
      Solr
      Enterprise search platform built on top of Lucene
      Xml input and output allows for easy integration with Diversity Engine
      Plug-in framework allows customization
      Built-in facet capabilities support indexing and searching on annotations
      Integration
      Converter from LK XML – Solr XML
      Plug-in for facet ranking and speed improvements
      9
    • APPLICATION DEVELOPMENT
      • Basis for LivingKnowledge Applications
      • Future Predictor
      • Media Content Analysis
      • Support development – coding required!
      • Real World Problems
      • HTML Extraction
      • Scaling to Large Collections
      • Provenance
      • Some pluggable GUI components
      • Examples to ease learning curve
      10
    • APPLICATION DEVELOPMENT
      11
    • APPLICATION DEVELOPMENT
      12
    • EVALUATION FRAMEWORK
      • Framework for the evaluation of analysis tools
      • Evaluates any possible annotation pipeline
      • Measures correctness and quality
      • Outputs Precision + Recall
      • Compares annotation output of pipeline with ground truth data
      13
    • OUR FIRST APPLICATION
      Download Diversity Engine release from SourceForge
      tar xzvf [release file]
      cd testbed
      ant build
      apps/testbed conf/testbed/tutorial-application.xml
      What happened?
      197 text files and 127 images files converted from arc format to LK XML and stored in devapps/example/data/lkxml
      2 annotators were run over collection
      OpenNLP for tokenization, sentence splitting, Pos tags
      SST named entity recognizer
      Results stored in devapps/example/data/lkxml
      Files were converted to Solr xml format and indexed using solr
      Solr XML stored to devapps/example/data/solr
      HTML Visualization Files stored in devapps/example/data/html
      ant deploy-testbed
      Solr running at http://localthost:8983/solr/
      Example app running at http://localhost:8983/testbed/
      14
    • EXAMPLE SOLR OUTPUT
      http://localhost:8983/solr/select/?q=putin
      15
    • EXAMPLE APPLICATION
      http://localhost:8983/testbed/results.jsp?query=putin
      16
    • EXAMPLE DOCUMENT
      17
    • CONFIGURATION FILE
      <lk-applicationlogDir="log"appDir="devapps/example">
      <corpusdir="corpora/examples/smallarc"format="arc"/>
      <image-pipeline>
      <annotators>
      </annotators>
      </image-pipeline>
      <pipeline>
      <annotators>
      <annotatorexec="./opennlp"/>
      <annotatorexec="./sst"/>
      </annotators>
      </pipeline>
      <visualize/>
      <indexersolrHomeDir="solr/solr“
      solrDataDir="solr/solr/data“
      converter="conf/testbed/tutorial-lk2solr.xml"/>
      <searcherappTitle="LivingKnowledge - Example Application"
      appShortTitle="Example Application"
      appUrl="http://localhost:8983/solr/">
      <facets>
      <facetfield="per"description="Person"/>
      <facetfield="loc"description="Location"/>
      </facets>
      </searcher>
      </lk-application>
      18
    • TEXT ANALYSIS
      <pipeline>
      <annotators>
      <annotatorexec="./opennlp"/>
      <annotatorexec="./sst"/>
      </annotators>
      </pipeline>
      <pipeline>
      <annotators>
      <annotatorexec="./opennlp"/>
      <annotatorexec="./sst"/>
      <annotatorexec="./facts"/>
      <annotatorexec="./unitn_tagger"/>
      <annotatorexec="./unitn_subjexpr"/>
      </annotators>
      </pipeline>
      apps/testbed –run pipeline conf/testbed/tutorial-application.xml
      apps/testbed –run visualization conf/testbed/tutorial-application.xml
      19
    • TEXT ANALYSIS - FACTS
      devapps/example/data/lkxml/EA-EUElections2009-euobserver-0729-20090729085530-00000.arc.15521713.facts.xml
      20
    • TEXT ANALYSIS - FACTS
      devapps/example/data/html/EA-EUElections2009-euobserver-0729-20090729085530-00000.arc.15521713.html
      21
    • IMAGE ANALYSIS
      <image-pipeline>
      <annotators>
      <annotatorexec="./soton_haarfacedetector"/>
      </annotators>
      </pipeline>
      <pipeline>
      <annotators>
      <annotatorexec="./opennlp"/>
      <annotatorexec="./sst"/>
      <annotatorexec="./facts"/>
      <annotatorexec="./unitn_tagger"/>
      <annotatorexec="./unitn_subjexpr"/>
      <annotatorexec="./imageannots"/>
      </annotators>
      </pipeline>
      apps/testbed –run pipeline,image-pipeline –pipeline imageannotsconf/testbed/tutorial-application.xml
      ls devapps/example/data/lkxml/img/*
      22
    • ANALYSIS API
      Documents in LK XML format
      Annotators passed a single document directory –They should add annotations for each document in directory
      Files will have consistent naming convention
      LkText file = id + “.lktext.xml”
      LkMedia = id + “.lkmedia.xml”
      LkAnnotation = id + “.” + annotatorId + “.xml”
      Annotators will be processed sequentially in the order listed in the XML file
      Annotators can be written in any language but must run on Linux – Helper classes will exist for Java, but there is no obligation to use them.
      Add application calling your new annotator to apps directory
      Add your application to the configuration file as before
      23
    • ANALYSIS API – JAVA
      Extend class org.diversityengine.annotator.AbstractAnnotator
      Implement Methods
      getName()
      getType() - TEXT OR IMAGE
      For Image Analysis implement
      LkAnnotation getLkAnnotation(ImageDocument document)
      For Text Analysis implement
      LkAnnotation getLkAnnotation(TextDocument document)
      In main, instantiate and call annotator
      NewAnnotator annotator = new NewAnnotator()
      annotator.processDirectory(args[0]);
      Add application calling your new annotator to apps directory
      Add your application to the configuration file as before
      24
    • EVALUATION
      Evaluation works with same configuration file. Simply add evaluation element
      <lk-applicationlogDir="log"appDir="devapps/evaluation">
      <corpusdir="corpora/evaluation/sst/text/"format="bbn"/>
      <pipeline>
      <annotators>
      <annotatorexec="./sst"/>
      </annotators>
      </pipeline>
      <evaluationevalDir="evaluation/sst/">
      <evaluatorprovides="ENTITIES"
      goldDir="corpora/evaluation/sst/gold/"
      goldAnnotator="sstgold"
      annotator="sst" />
      </evaluation>
      </lk-application>
      apps/testbed conf/evaluation/sst.xml
      25
    • EVALUATION RESULTS
      <evaluationgoldDir="/home/mikemat/code/livingknowledge/WP6/testbed/corpora/evaluation/sst/gold/"lkDir="/home/mikemat/code/livingknowledge/WP6/testbed/devapps/evaluation/data/lkxml"annotation="sst"goldAnnotation="sstgold"provides="ENTITIES">
      <docs>
      <docid="WSJ0375"N="19"tp="18"fp="1"fn="1" />
      <docid="WSJ0380"N="19"tp="15"fp="4"fn="1" />
      <docid="WSJ0376"N="72"tp="61"fp="11"fn="7" />
      <docid="WSJ0377"N="26"tp="17"fp="9"fn="6" />
      <docid="WSJ0378"N="10"tp="10"fp="0"fn="0" />
      <docid="WSJ0379"N="24"tp="19"fp="5"fn="2" />
      </docs>
      <totalsN="170"tp="140"fp="30"fn="17"p="0.8235294117647058"r="0.89171974522293"f="0.8562691131498471" />
      </evaluation>
      cat evaluation/sst/sst.ENTITIES.xml
      26
    • INDEXING AND SEARCH
      Search Engines - Traditional
      Bag-of-words representation
      Inverted index (words -> documents) for efficiency
      10 docs ranked according tf-idf similarity with query
      Search Engines – Today
      Much metadata associated with documents
      Ranking based on 100s of features (date, location, pagerank, click data, etc, personalization)
      Richer display
      Facets for exploratory search
      Answers when appropriate
      etc..
      Many open source options - Lucene/Solr most widely used
      27
    • APACHE LUCENE/SOLR
      Lucene/Solr
      28
    • FACETED SEARCH
      Diagram by Yonik Seeley
      29
    • FACETED SEACH
      • Summarize query results aggregation properties of returned pages
      • price ranges for product query
      • related people or locations for news query
      • Exploratory Search
      • Show documents that matching the query term and a selected facet
      • Make inferences not clear from simple document list
      • Living Knowledge Analysis is modeled very well by facets
      • Topics as determined by entity and fact extraction
      • Location and Time diversity dimensions
      • Opinions as determined by opinion extraction
      30
    • LK XML TO SOLR
      • Solr has well defined XML input format for adding new documents
      • Diversity Engine provides a simple language to map LX XML to Solr XML
      31
    • LK2SOLR CONVERSION
      <indexersolrHomeDir="solr/solr“
      solrDataDir="solr/solr/data“
      converter="conf/testbed/tutorial-lk2solr.xml"/>
      <lktosolr>
      <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“
      filter="org.diversityengine.solr.converter.filters.PerValueFilter"/>
      <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“
      filter="org.diversityengine.solr.converter.filters.LocValueFilter"/>
      <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" />
      <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“
      type="date"/>
      </lktosolr>
      solr – Name of the field in solr
      annotation – Name of the LKXML Annotation
      value – Value of annotation
      filter – Allows post processing on annotation
      type – Only Date supported currently
      32
    • ADDING FACTS TO INDEX
      <lktosolr>
      <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“
      filter="org.diversityengine.solr.converter.filters.PerValueFilter"/>
      <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“
      filter="org.diversityengine.solr.converter.filters.LocValueFilter"/>
      <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" />
      <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“
      type="date"/>
      <fieldsolr="yago"annotation="yago-entities"value="$text" />
      <fieldsolr="yago-country"annotation="facts"
      value="xpath:/entity-information[facts/type/text()=
      'wordnet_country_108544813']/id/text()" />
      </lktosolr>
      apps/testbed –run convert-solr conf/testbed/tutorial-application.xml
      ls devapps/example/data/solr/*
      apps/testbed –run index conf/testbed/tutorial-application.xml
      33
    • FACTS TO SOLR
      <fieldsolr="yago"annotation="yago-entities"value="$text" />
      34
    • FACTS TO SOLR
      <fieldsolr="yago-country"annotation="facts"
      value="xpath:/entity-information[facts/type/text()=
      'wordnet_country_108544813']/id/text()" />
      35
    • ADDING IMAGES TO INDEX
      <lktosolr>
      <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“
      filter="org.diversityengine.solr.converter.filters.PerValueFilter"/>
      <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“
      filter="org.diversityengine.solr.converter.filters.LocValueFilter"/>
      <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" />
      <fieldsolr="yago"annotation="yago-entities"value="$text" />
      <fieldsolr="yago-country"annotation="facts"
      value="xpath:/entityinformation[facts/type/text()
      ='wordnet_country_108544813']/id/text()" />
      <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“
      type="date"/>
      <fieldsolr="image"annotation="IMAGE_ANNOTS"value="$text" />
      <fieldsolr="bestimage"annotation="BEST_IMAGES"value="$text" />
      </lktosolr>
      apps/testbed –run convert-solr conf/testbed/tutorial-application.xml
      ls devapps/example/data/solr/*
      apps/testbed –run index conf/testbed/tutorial-application.xml
      36
    • APPLICATION DEVELOPMENT
      Examples
      HTML Extraction
      Scaling to Large Collections
      Provenance
      Some pluggable GUI components
      37
    • FACT/IMAGE APPLICATION
      <searcherappTitle="LivingKnowledge - Example Application"
      appShortTitle="Example Application"
      appUrl="http://localhost:8983/solr/">
      <facets>
      <facetfield=“yago"description=“Yago"/> <facetfield=“yago-country"description=“Country"/>
      <facetfield="per"description="Person"/>
      <facetfield="loc"description="Location"/> <facetfield=“image"description=“Images"/> </facets>
      </searcher>
      ant deploy-testbed
      38
    • FACT/IMAGE APPLICATION
      http://localhost:8983/testbed/results.jsp?query=putin
      39
    • OPINION APPLICATION
      Opinions are at sentence level, not document level – same analysis, but different indexing
      cat conf/testbed/tutorial-lk2solr-sentence.xml
      <lktosolrsolrDoc="SENTENCES"contextSize="1">
      <fieldsolr="per"annotation="ENTITIES_CLEAN"value="$text“
      filter="org.diversityengine.solr.converter.filters.PerValueFilter“ source="solrdoc" />
      <fieldsolr="loc"annotation="ENTITIES_CLEAN"value="$text“
      filter="org.diversityengine.solr.converter.filters.LocValueFilter“
      source="solrdoc" />
      <fieldsolr="keywords"annotation="TOP_ENTITIES"value="$text" />
      <fieldsolr="yago"annotation="yago-entities"value="$text“
      source="solrdoc" />
      <fieldsolr="image"annotation="IMAGE_ANNOTS"value="$text" />
      <fieldsolr="bestimage"annotation="BEST_IMAGES"value="$text" />
      <fieldsolr="pubdate"annotation="metainfo:lktext"value="date“
      type="date"/>
      <fieldsolr="polarity"
      annotation="MPQA-expressive-subjectivity,MPQA-direct-subjective“
      value="xpath:/node()[@pol]/@pol"source="solrdoc“
      filter="org.diversityengine.solr.converter.filters.PolarityValueFilter"/>
      <fieldsolr="pol-int“
      annotation="MPQA-expressive-subjectivity,MPQA-direct-subjective“
      value="xpath:concat(/node()[@pol and @int]/@pol,/node()[@int and @pol]/@int)“
      source="solrdoc"/>
      </lktosolr>
      apps/testbed –run convert-solr,index
      conf/testbed/tutorial-application-sentence.xml
      ls devapps/example/data/solr/*
      40
    • SOLR XML – SENTENCE
      41
    • OPINION APPLICATION
      modify webappWEB-INFweb.xml
      <web-appxmlns="http://java.sun.com/xml/ns/javaee"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd"
      version="2.5">
      <description>
      LivingKnowledge Testbed Example Application
      </description>
      <display-name>Testbed Examples</display-name>
      <context-param>
      <param-name>applicationDef</param-name>
      <param-value>conf/testbed/tutorial-application-sentence.xml</param-value>
      <description>The Living Knowledge application description XML file </description>
      </context-param>
      </web-app>
      ant deploy-testbed
      42
    • OPINION APPLICATION
      http://localhost:8983/testbed/results.jsp?query=putin
      43
    • HTML EXTRACTION
      44
    • HTML EXTRACTION
      Boilerplate can lead to false positive results and inaccurate facet aggregation
      Real example – before extraction developed, most common person for most queries was in a top story title (on all pages) the day of the crawl!
      Titles, Authors and Dates are important for bias and diversity aware search
      45
    • PROVENANCE
      How an annotation is derived is often as important as the annotation itself
      Users want to verify results
      Developers need to validate results
      Open Provenance provides an open source solution
      Testbed annotations can be extended with Open Provenance chains
      46
    • Provenance Diagram
      47
    • SCALING TO LARGE COLLECTIONS
      In the real world, even “small” datasets have million of documents
      NLP/Image processing is expensive – 1 doc/sec = 11 days for 1 million docs!
      Hadoop Mapper allows for scaling – scales linearly with number of machines
      ZipCollection writer allows partitioning data into subsets for processing
      48
    • COMPONENTS- OPINIONS
      49
    • COMPONENTS - TIME
      50
    • COMPONENTS - GEO
      51
    • FUTURE WORK
      More components
      Maven to manage dependencies
      Better integration of Timeline and Geo visualization components
      Integration of ranking algorithms
      Better Documentation 
      52
    • Thanks!
      LivingKnowledge Partners!
      You for coming!!
      Questions?
      53