Information Retrieval Challenges

at io n
In fo rm          l
          ri ev a
     R et
            en ges
   C h a ll           Bruno Pedro
                        March 2010
Bruno Pedro
A n e x p e r i e n c e d We b d e v e l o p e r a n d
entrepreneur. Has extensive background in
large scale projects and technical writing.

http://tarpipe.com/user/bpedro
What is tarpipe?




       User
What is tarpipe?
3 Challenges
• Real-Time Retrieval
• Understanding Context
• Inferring Identify
Real-Time Retrieval




             http://www.flickr.com/photos/josephrobertson/127758523/
WordPress


                     source: wordpress.com




• Average ~10K posts/hour
• ~3 new posts every second
twitter


                      source: mashable.com




• Average ~1.1M tweets/hour
• ~300 new tweets every second
Challenge
• ~300 reads/second
• 160 X 300 = 48 KB/second = 4 GB/day
  (approximate calculation)




• How to process all this information?
Strategy
• Read and store immediately:
 • High performance write storage
 • No locks allowed
 • Prepare for lots of reading errors
Strategy
• Process later:
 • Regular expressions
 • Term extraction
 • Machine learning
Context




     For me context is the key - from
     that comes the understanding of
     everything. — Kenneth Noland
source: joelonsoftware.com




Dogs?
source: Google Reader Play




Dogs with unmatched title?
source: Google Buzz




Still doesn’t make a lot of sense...
This is the
worst case
scenario
Challenge
• Find context from associated content:
 • Pictures
 • Comments
 • Location information
 • Timelines
 • Authors
Strategy
• Associate content through common
  identifiers
• Establish timeline of different pieces
• Group pieces by same author
• Present in a comprehensible fashion
Identity




           source: abc Australia
Many Identifiers
• E-mail: user@example.com
• facebook: @User Name
• flickr: user or User Name (?)
• Google Buzz: @user@example.com
• twitter: @user
 ...
Addressable
• http://facebook.com/user
• http://flickr.com/user
• http://www.google.com/profiles/user
• http://twitter.com/user
  ...
How to make sense?
Challenge
• Parse every message, tweet or post
• Find possible user identifiers
• Substitute for meaningful information:
 • A link to the original profile
 • Equivalent identity on destination
Strategy
• Decentralized processing:
 • Browser based (plugin)
 • Extract identities from page
 • Process
 • Replace with meaningful information
Food for thought
• PubsubHubbub
  http://code.google.com/p/pubsubhubbub/

• Activity Streams
  http://activitystrea.ms/
• Web Finger
  http://webfinger.org/
tarpipe streamlines your     tarpipe is one of the most   Today I had a chance to
updates to various social    curious experiments in       spend time experimenting
web sites, creating simple   social media that I've       with tarpipe and I have to
or complex workflows to       seen lately. The service     say that I am intrigued by
update several buckets in    has the potential to be      the concept and impressed
one fell swoop.              the answer to the lament     by the implementation.
                             I first talked about in The
Adam Pash                    looming crisis: Personal     Jeff Barr
lifehacker                   syndication overload.        Amazon.com

                             Rafe Needleman
                             CNET news




thank you                                                 share your life
1 of 26

More Related Content

Similar to Information Retrieval Challenges(20)

More from Bruno Pedro(20)

What are Web APIsWhat are Web APIs
What are Web APIs
Bruno Pedro454 views
Growing your business with an APIGrowing your business with an API
Growing your business with an API
Bruno Pedro500 views
Product growth with an APIProduct growth with an API
Product growth with an API
Bruno Pedro533 views
APIs Love to ChatAPIs Love to Chat
APIs Love to Chat
Bruno Pedro909 views
How to Automate API TestingHow to Automate API Testing
How to Automate API Testing
Bruno Pedro7.9K views
Asynchronous Microservices in nodejsAsynchronous Microservices in nodejs
Asynchronous Microservices in nodejs
Bruno Pedro8.3K views
How to Automate API DiscoveryHow to Automate API Discovery
How to Automate API Discovery
Bruno Pedro7.3K views
Api Design & The Paris SubwayApi Design & The Paris Subway
Api Design & The Paris Subway
Bruno Pedro2.3K views
The importance of /meThe importance of /me
The importance of /me
Bruno Pedro1.6K views
Maintainable consumersMaintainable consumers
Maintainable consumers
Bruno Pedro1.3K views
API Code GenerationAPI Code Generation
API Code Generation
Bruno Pedro9.3K views
Who's using your API?Who's using your API?
Who's using your API?
Bruno Pedro4.5K views
node-fsnode-fs
node-fs
Bruno Pedro1.5K views
Is OAuth Really Secure?Is OAuth Really Secure?
Is OAuth Really Secure?
Bruno Pedro3.3K views
tarpipe WordPress plugin demotarpipe WordPress plugin demo
tarpipe WordPress plugin demo
Bruno Pedro761 views
OAuth checklistOAuth checklist
OAuth checklist
Bruno Pedro1.5K views
Everything OAuthEverything OAuth
Everything OAuth
Bruno Pedro3.6K views
Activity Streams And ContextsActivity Streams And Contexts
Activity Streams And Contexts
Bruno Pedro943 views

Recently uploaded(20)

Java Platform Approach 1.0 - Picnic MeetupJava Platform Approach 1.0 - Picnic Meetup
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver20 views
ISWC2023-McGuinnessTWC16x9FinalShort.pdfISWC2023-McGuinnessTWC16x9FinalShort.pdf
ISWC2023-McGuinnessTWC16x9FinalShort.pdf
Deborah McGuinness80 views
Green Leaf Consulting: Capabilities DeckGreen Leaf Consulting: Capabilities Deck
Green Leaf Consulting: Capabilities Deck
GreenLeafConsulting147 views
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet44 views

Information Retrieval Challenges

  • 1. at io n In fo rm l ri ev a R et en ges C h a ll Bruno Pedro March 2010
  • 2. Bruno Pedro A n e x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
  • 5. 3 Challenges • Real-Time Retrieval • Understanding Context • Inferring Identify
  • 6. Real-Time Retrieval http://www.flickr.com/photos/josephrobertson/127758523/
  • 7. WordPress source: wordpress.com • Average ~10K posts/hour • ~3 new posts every second
  • 8. twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
  • 9. Challenge • ~300 reads/second • 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
  • 10. Strategy • Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  • 11. Strategy • Process later: • Regular expressions • Term extraction • Machine learning
  • 12. Context For me context is the key - from that comes the understanding of everything. — Kenneth Noland
  • 14. source: Google Reader Play Dogs with unmatched title?
  • 15. source: Google Buzz Still doesn’t make a lot of sense...
  • 16. This is the worst case scenario
  • 17. Challenge • Find context from associated content: • Pictures • Comments • Location information • Timelines • Authors
  • 18. Strategy • Associate content through common identifiers • Establish timeline of different pieces • Group pieces by same author • Present in a comprehensible fashion
  • 19. Identity source: abc Australia
  • 20. Many Identifiers • E-mail: user@example.com • facebook: @User Name • flickr: user or User Name (?) • Google Buzz: @user@example.com • twitter: @user ...
  • 21. Addressable • http://facebook.com/user • http://flickr.com/user • http://www.google.com/profiles/user • http://twitter.com/user ...
  • 22. How to make sense?
  • 23. Challenge • Parse every message, tweet or post • Find possible user identifiers • Substitute for meaningful information: • A link to the original profile • Equivalent identity on destination
  • 24. Strategy • Decentralized processing: • Browser based (plugin) • Extract identities from page • Process • Replace with meaningful information
  • 25. Food for thought • PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • Web Finger http://webfinger.org/
  • 26. tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you share your life

Editor's Notes

  1. - Google Chrome plugin - Get identity information from Google Social Graph API or other means