Harvesting and semantically tagging media releases from political websites using web services
Upcoming SlideShare
Loading in...5
×
 

Harvesting and semantically tagging media releases from political websites using web services

on

  • 2,395 views

Presented at VALA2012 by Peter Neish on February 9 2012 describing how media releases were automatically harvested from political websites by polling the RSS feeds or relevant sites. Media releases ...

Presented at VALA2012 by Peter Neish on February 9 2012 describing how media releases were automatically harvested from political websites by polling the RSS feeds or relevant sites. Media releases were semantically tagged using the OpenCalais web service.

Statistics

Views

Total Views
2,395
Slideshare-icon Views on SlideShare
1,625
Embed Views
770

Actions

Likes
1
Downloads
3
Comments
0

8 Embeds 770

http://peter.neish.net 761
http://www.thelibrarynews.com 2
https://www.google.com.au 2
http://services.w3.org 1
http://webcache.googleusercontent.com 1
http://plus.url.google.com 1
http://prlog.ru 1
https://twitter.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Talk about the Parliamentary Library: Established in 1851, building itself 1858–60
  • What will be covered in today’s talk

Harvesting and semantically tagging media releases from political websites using web services Harvesting and semantically tagging media releases from political websites using web services Presentation Transcript

  • Harvesting and semantically tagging media releases from political websites using web services Peter Neish, Systems Officer Victorian Parliamentary Library @peterneish
  • What will be covered
    • Background – why are we interested in media releases
    • What we did and how went about it
      • Part 1: Automatic Harvester
      • Part 2: Semantic Tagging
    • Lessons Learnt, where to from here – that kind of thing
  • About the library
    • Established in 1851
    • Clients:
      • Members of Parliament and their staff
      • Department of Parliamentary Services (especially committees)
      • Academics and public
    • Approx. 25 staff in client support, research, reference, tech services and e-services.
  • Media Releases
    • Media releases play an important part in the political process
    • Establish a party’s position on an issue at a particular point in time
    • Often used in reference requests
    • Library established a media release database in 1992
  • Number of Media Releases per year 0 1000 2000 3000 4000 5000 6000 7000 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
  • Media releases by party 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 ALP Coalition Green Independent
  • Project aims
    • Automate the process of adding media releases to the database
    • Combine our different databases together – better user experience
    • Examine possibility of automatically applying tags to media releases using web services
  • Part 1: Automation
    • Political parties have website
    • Websites have Content Management System (CMS)
    • CMS has RSS feed
    • RSS can be used as a standard input to software
  • What we built Polls RSS feed for links DB Textworks wkhtml2pdf Servlet Metadata
  • Technologies used
    • Java / Tomcat
    • Rome RSS parser https:// rome.dev.java.net /
    • Wkhtmltopdf http:// code.google.com/p/wkhtmltopdf /
    • DB/Textworks
    • MySQL (for semantic tags – later)
    • Results
    • It works – since July 2010 we’ve harvested 11,000 media releases
    • Saved c. 2 days of staff time per week
    • Problems
    • Non-standard content in feeds (e.g. dates)
      • Addressed with Yahoo! Pipes
    • Website’s changing their structure or CMS
  • Part 2: Semantic Tagging
    • Increasing number of media releases meant that manual indexing was too time consuming
    • Examined ways of automatically tagging media releases without human intervention
    • Services examined:
      • Alchemy API
      • Evri
      • OpenAmplify
      • OpenCalais
      • Yahoo Term Extractor
      • Zemanta
  • Open Calais
    • Product of Thomson Reuters – focus is on news articles
    • Good number of tags (not too low or high)
    • Minimal false matches
    • Good documentation and community
    • Generous limits on API calls
    • However: closed box (algorithm secret), recently company appear to have scaled back development
  • Example Open Calais
    • http:// viewer.opencalais.com /
  • Number of Tags assigned by OpenCalais 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 20 40 60 80 100 120 Tags per item Total number
  • User interface
  • Tag Quality 85% 4% 6% 5% Correct Tags Incorrect Tags Repeated Tags Redundant Tags
  • Problems - disambiguation
    • Victoria, Australia
    • Population: 5.5 Million
    • Area: 237,629 km 2
    • Google Hits: 48 million
    • Victoria, Seychelles
    • Population: 25,000
    • Area: 451 km 2
    • Google Hits: 2.5 million
    Photo by http://www.flickr.com/photos/meckimac/2971992/ Photo by http://www.flickr.com/photos/eclogite/257560117/
  • Linked Data
    • OpenCalais links to its own ontology (rich in data for companies, but other classes have limited data)
    • SameAs or web links to:
      • DBpedia
      • Wikipedia
      • Freebase
      • Reuters.com
      • GeoNames
      • Shopping.com
      • IMDB
      • LinkedMDB
  • Conclusion
    • Been able to save 2 days per week of staff time with automatic harvesting
    • Media releases now available as they are released (no backlog)
    • Using web services we have been able to tag content with meaningful semantic tags to enrich our data
    • Can link our databases together using common tags and to other databases in the Linked Data ecosystem