Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
at io n
In fo rm          l
          ri ev a
     R et
            en ges
   C h a ll           Bruno Pedro
             ...
Bruno Pedro
A n e x p e r i e n c e d We b d e v e l o p e r a n d
entrepreneur. Has extensive background in
large scale p...
What is tarpipe?




       User
What is tarpipe?
3 Challenges
• Real-Time Retrieval
• Understanding Context
• Inferring Identify
Real-Time Retrieval




             http://www.flickr.com/photos/josephrobertson/127758523/
WordPress


                     source: wordpress.com




• Average ~10K posts/hour
• ~3 new posts every second
twitter


                      source: mashable.com




• Average ~1.1M tweets/hour
• ~300 new tweets every second
Challenge
• ~300 reads/second
• 160 X 300 = 48 KB/second = 4 GB/day
  (approximate calculation)




• How to process all t...
Strategy
• Read and store immediately:
 • High performance write storage
 • No locks allowed
 • Prepare for lots of readin...
Strategy
• Process later:
 • Regular expressions
 • Term extraction
 • Machine learning
Context




     For me context is the key - from
     that comes the understanding of
     everything. — Kenneth Noland
source: joelonsoftware.com




Dogs?
source: Google Reader Play




Dogs with unmatched title?
source: Google Buzz




Still doesn’t make a lot of sense...
This is the
worst case
scenario
Challenge
• Find context from associated content:
 • Pictures
 • Comments
 • Location information
 • Timelines
 • Authors
Strategy
• Associate content through common
  identifiers
• Establish timeline of different pieces
• Group pieces by same a...
Identity




           source: abc Australia
Many Identifiers
• E-mail: user@example.com
• facebook: @User Name
• flickr: user or User Name (?)
• Google Buzz: @user@exam...
Addressable
• http://facebook.com/user
• http://flickr.com/user
• http://www.google.com/profiles/user
• http://twitter.com/u...
How to make sense?
Challenge
• Parse every message, tweet or post
• Find possible user identifiers
• Substitute for meaningful information:
 •...
Strategy
• Decentralized processing:
 • Browser based (plugin)
 • Extract identities from page
 • Process
 • Replace with ...
Food for thought
• PubsubHubbub
  http://code.google.com/p/pubsubhubbub/

• Activity Streams
  http://activitystrea.ms/
• ...
tarpipe streamlines your     tarpipe is one of the most   Today I had a chance to
updates to various social    curious exp...
Upcoming SlideShare
Loading in …5
×

Information Retrieval Challenges

6,069 views

Published on

Invited lecture at POSI (Post Graduation on Information Systems at INESC, Lisbon, Portugal) about Information Retrieval Challenges: real-time retrieval, context awareness and inferring identity from content.

Published in: Technology
  • Be the first to comment

Information Retrieval Challenges

  1. 1. at io n In fo rm l ri ev a R et en ges C h a ll Bruno Pedro March 2010
  2. 2. Bruno Pedro A n e x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
  3. 3. What is tarpipe? User
  4. 4. What is tarpipe?
  5. 5. 3 Challenges • Real-Time Retrieval • Understanding Context • Inferring Identify
  6. 6. Real-Time Retrieval http://www.flickr.com/photos/josephrobertson/127758523/
  7. 7. WordPress source: wordpress.com • Average ~10K posts/hour • ~3 new posts every second
  8. 8. twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
  9. 9. Challenge • ~300 reads/second • 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
  10. 10. Strategy • Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  11. 11. Strategy • Process later: • Regular expressions • Term extraction • Machine learning
  12. 12. Context For me context is the key - from that comes the understanding of everything. — Kenneth Noland
  13. 13. source: joelonsoftware.com Dogs?
  14. 14. source: Google Reader Play Dogs with unmatched title?
  15. 15. source: Google Buzz Still doesn’t make a lot of sense...
  16. 16. This is the worst case scenario
  17. 17. Challenge • Find context from associated content: • Pictures • Comments • Location information • Timelines • Authors
  18. 18. Strategy • Associate content through common identifiers • Establish timeline of different pieces • Group pieces by same author • Present in a comprehensible fashion
  19. 19. Identity source: abc Australia
  20. 20. Many Identifiers • E-mail: user@example.com • facebook: @User Name • flickr: user or User Name (?) • Google Buzz: @user@example.com • twitter: @user ...
  21. 21. Addressable • http://facebook.com/user • http://flickr.com/user • http://www.google.com/profiles/user • http://twitter.com/user ...
  22. 22. How to make sense?
  23. 23. Challenge • Parse every message, tweet or post • Find possible user identifiers • Substitute for meaningful information: • A link to the original profile • Equivalent identity on destination
  24. 24. Strategy • Decentralized processing: • Browser based (plugin) • Extract identities from page • Process • Replace with meaningful information
  25. 25. Food for thought • PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • Web Finger http://webfinger.org/
  26. 26. tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you share your life

×