Successfully reported this slideshow.

Information Retrieval Challenges

2

Share

Upcoming SlideShare
Doonish
Doonish
Loading in …3
×
1 of 26
1 of 26

Information Retrieval Challenges

2

Share

Download to read offline

Invited lecture at POSI (Post Graduation on Information Systems at INESC, Lisbon, Portugal) about Information Retrieval Challenges: real-time retrieval, context awareness and inferring identity from content.

Invited lecture at POSI (Post Graduation on Information Systems at INESC, Lisbon, Portugal) about Information Retrieval Challenges: real-time retrieval, context awareness and inferring identity from content.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Information Retrieval Challenges

  1. 1. at io n In fo rm l ri ev a R et en ges C h a ll Bruno Pedro March 2010
  2. 2. Bruno Pedro A n e x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
  3. 3. What is tarpipe? User
  4. 4. What is tarpipe?
  5. 5. 3 Challenges • Real-Time Retrieval • Understanding Context • Inferring Identify
  6. 6. Real-Time Retrieval http://www.flickr.com/photos/josephrobertson/127758523/
  7. 7. WordPress source: wordpress.com • Average ~10K posts/hour • ~3 new posts every second
  8. 8. twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
  9. 9. Challenge • ~300 reads/second • 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
  10. 10. Strategy • Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  11. 11. Strategy • Process later: • Regular expressions • Term extraction • Machine learning
  12. 12. Context For me context is the key - from that comes the understanding of everything. — Kenneth Noland
  13. 13. source: joelonsoftware.com Dogs?
  14. 14. source: Google Reader Play Dogs with unmatched title?
  15. 15. source: Google Buzz Still doesn’t make a lot of sense...
  16. 16. This is the worst case scenario
  17. 17. Challenge • Find context from associated content: • Pictures • Comments • Location information • Timelines • Authors
  18. 18. Strategy • Associate content through common identifiers • Establish timeline of different pieces • Group pieces by same author • Present in a comprehensible fashion
  19. 19. Identity source: abc Australia
  20. 20. Many Identifiers • E-mail: user@example.com • facebook: @User Name • flickr: user or User Name (?) • Google Buzz: @user@example.com • twitter: @user ...
  21. 21. Addressable • http://facebook.com/user • http://flickr.com/user • http://www.google.com/profiles/user • http://twitter.com/user ...
  22. 22. How to make sense?
  23. 23. Challenge • Parse every message, tweet or post • Find possible user identifiers • Substitute for meaningful information: • A link to the original profile • Equivalent identity on destination
  24. 24. Strategy • Decentralized processing: • Browser based (plugin) • Extract identities from page • Process • Replace with meaningful information
  25. 25. Food for thought • PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • Web Finger http://webfinger.org/
  26. 26. tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you share your life

Editor's Notes
























  • - Google Chrome plugin
    - Get identity information from Google Social Graph API or other means


  • ×