at io n
In fo rm          l
          ri ev a
     R et
            en ges
   C h a ll           Bruno Pedro
             ...
Bruno Pedro
A n e x p e r i e n c e d We b d e v e l o p e r a n d
entrepreneur. Has extensive background in
large scale p...
What is tarpipe?




       User
What is tarpipe?
3 Challenges
• Real-Time Retrieval
• Understanding Context
• Inferring Identify
Real-Time Retrieval




             http://www.flickr.com/photos/josephrobertson/127758523/
WordPress


                     source: wordpress.com




• Average ~10K posts/hour
• ~3 new posts every second
twitter


                      source: mashable.com




• Average ~1.1M tweets/hour
• ~300 new tweets every second
Challenge
• ~300 reads/second
• 160 X 300 = 48 KB/second = 4 GB/day
  (approximate calculation)




• How to process all t...
Strategy
• Read and store immediately:
 • High performance write storage
 • No locks allowed
 • Prepare for lots of readin...
Strategy
• Process later:
 • Regular expressions
 • Term extraction
 • Machine learning
Context




     For me context is the key - from
     that comes the understanding of
     everything. — Kenneth Noland
source: joelonsoftware.com




Dogs?
source: Google Reader Play




Dogs with unmatched title?
source: Google Buzz




Still doesn’t make a lot of sense...
This is the
worst case
scenario
Challenge
• Find context from associated content:
 • Pictures
 • Comments
 • Location information
 • Timelines
 • Authors
Strategy
• Associate content through common
  identifiers
• Establish timeline of different pieces
• Group pieces by same a...
Identity




           source: abc Australia
Many Identifiers
• E-mail: user@example.com
• facebook: @User Name
• flickr: user or User Name (?)
• Google Buzz: @user@exam...
Addressable
• http://facebook.com/user
• http://flickr.com/user
• http://www.google.com/profiles/user
• http://twitter.com/u...
How to make sense?
Challenge
• Parse every message, tweet or post
• Find possible user identifiers
• Substitute for meaningful information:
 •...
Strategy
• Decentralized processing:
 • Browser based (plugin)
 • Extract identities from page
 • Process
 • Replace with ...
Food for thought
• PubsubHubbub
  http://code.google.com/p/pubsubhubbub/

• Activity Streams
  http://activitystrea.ms/
• ...
tarpipe streamlines your     tarpipe is one of the most   Today I had a chance to
updates to various social    curious exp...
Upcoming SlideShare
Loading in...5
×

Information Retrieval Challenges

4,153

Published on

Invited lecture at POSI (Post Graduation on Information Systems at INESC, Lisbon, Portugal) about Information Retrieval Challenges: real-time retrieval, context awareness and inferring identity from content.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,153
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
81
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide























  • - Google Chrome plugin
    - Get identity information from Google Social Graph API or other means


  • Information Retrieval Challenges

    1. 1. at io n In fo rm l ri ev a R et en ges C h a ll Bruno Pedro March 2010
    2. 2. Bruno Pedro A n e x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
    3. 3. What is tarpipe? User
    4. 4. What is tarpipe?
    5. 5. 3 Challenges • Real-Time Retrieval • Understanding Context • Inferring Identify
    6. 6. Real-Time Retrieval http://www.flickr.com/photos/josephrobertson/127758523/
    7. 7. WordPress source: wordpress.com • Average ~10K posts/hour • ~3 new posts every second
    8. 8. twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
    9. 9. Challenge • ~300 reads/second • 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
    10. 10. Strategy • Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
    11. 11. Strategy • Process later: • Regular expressions • Term extraction • Machine learning
    12. 12. Context For me context is the key - from that comes the understanding of everything. — Kenneth Noland
    13. 13. source: joelonsoftware.com Dogs?
    14. 14. source: Google Reader Play Dogs with unmatched title?
    15. 15. source: Google Buzz Still doesn’t make a lot of sense...
    16. 16. This is the worst case scenario
    17. 17. Challenge • Find context from associated content: • Pictures • Comments • Location information • Timelines • Authors
    18. 18. Strategy • Associate content through common identifiers • Establish timeline of different pieces • Group pieces by same author • Present in a comprehensible fashion
    19. 19. Identity source: abc Australia
    20. 20. Many Identifiers • E-mail: user@example.com • facebook: @User Name • flickr: user or User Name (?) • Google Buzz: @user@example.com • twitter: @user ...
    21. 21. Addressable • http://facebook.com/user • http://flickr.com/user • http://www.google.com/profiles/user • http://twitter.com/user ...
    22. 22. How to make sense?
    23. 23. Challenge • Parse every message, tweet or post • Find possible user identifiers • Substitute for meaningful information: • A link to the original profile • Equivalent identity on destination
    24. 24. Strategy • Decentralized processing: • Browser based (plugin) • Extract identities from page • Process • Replace with meaningful information
    25. 25. Food for thought • PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • Web Finger http://webfinger.org/
    26. 26. tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you share your life
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×