Information Retrieval Challenges
Upcoming SlideShare
Loading in...5
×
 

Information Retrieval Challenges

on

  • 5,078 views

Invited lecture at POSI (Post Graduation on Information Systems at INESC, Lisbon, Portugal) about Information Retrieval Challenges: real-time retrieval, context awareness and inferring identity from ...

Invited lecture at POSI (Post Graduation on Information Systems at INESC, Lisbon, Portugal) about Information Retrieval Challenges: real-time retrieval, context awareness and inferring identity from content.

Statistics

Views

Total Views
5,078
Views on SlideShare
4,908
Embed Views
170

Actions

Likes
1
Downloads
72
Comments
0

5 Embeds 170

http://blog.tarpipe.com 151
http://www.slideshare.net 7
http://www.techgig.com 7
http://static.slideshare.net 3
http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • - Google Chrome plugin <br /> - Get identity information from Google Social Graph API or other means <br />
  • <br />
  • <br />

Information Retrieval Challenges Information Retrieval Challenges Presentation Transcript

  • at io n In fo rm l ri ev a R et en ges C h a ll Bruno Pedro March 2010
  • Bruno Pedro A n e x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
  • What is tarpipe? User
  • What is tarpipe?
  • 3 Challenges • Real-Time Retrieval • Understanding Context • Inferring Identify
  • Real-Time Retrieval http://www.flickr.com/photos/josephrobertson/127758523/
  • WordPress source: wordpress.com • Average ~10K posts/hour • ~3 new posts every second
  • twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
  • Challenge • ~300 reads/second • 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
  • Strategy • Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  • Strategy • Process later: • Regular expressions • Term extraction • Machine learning
  • Context For me context is the key - from that comes the understanding of everything. — Kenneth Noland
  • source: joelonsoftware.com Dogs?
  • source: Google Reader Play Dogs with unmatched title?
  • source: Google Buzz Still doesn’t make a lot of sense...
  • This is the worst case scenario
  • Challenge • Find context from associated content: • Pictures • Comments • Location information • Timelines • Authors
  • Strategy • Associate content through common identifiers • Establish timeline of different pieces • Group pieces by same author • Present in a comprehensible fashion
  • Identity source: abc Australia
  • Many Identifiers • E-mail: user@example.com • facebook: @User Name • flickr: user or User Name (?) • Google Buzz: @user@example.com • twitter: @user ...
  • Addressable • http://facebook.com/user • http://flickr.com/user • http://www.google.com/profiles/user • http://twitter.com/user ...
  • How to make sense?
  • Challenge • Parse every message, tweet or post • Find possible user identifiers • Substitute for meaningful information: • A link to the original profile • Equivalent identity on destination
  • Strategy • Decentralized processing: • Browser based (plugin) • Extract identities from page • Process • Replace with meaningful information
  • Food for thought • PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • Web Finger http://webfinger.org/
  • tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you share your life