Your SlideShare is downloading. ×
  • Like
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

  • 180 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
180
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Un-Structured ! Or: How I Learned to Stop Worrying and Love the XML Mike Nibeck, Asim Shaikh
  • 2. 1st NF, 2nd NF, 3rd NF ! It’s The Way It’s Done
  • 3. Maintainability vs. Performance
  • 4. I’m Feeling Lucky
  • 5. Solr Extension  of   Apache  Lucene Full  Text  Search Open  Interfaces   (XML,  JSON,  HTTP) Faceted  Search Database  Ingest Document  Indexing   (PDF,  Word,  etc) Spelling   Suggestions Auto  Suggest “Cloudy” Advanced  Input   Parsing Relevance  Ranking v4.4
  • 6. You got your chocolate in my peanut butter!
  • 7. It’s a Hammer. A really nice, efficient and free hammer.
  • 8. A Mental Shift Pancakes & Relevancy
  • 9. Chronicling America • 6.8 million documents • 10 Billion vectors • 50,000 queries/day • Index 250GB • +100K documents per month Congress.gov • 4 million documents • 3.3+ million queries/day (user and system) • 36 GB indexes •Adding many thousands/ month Library Web Search • 18+ million documents • 9,000 queries/day • 28GB index size • + many thousands/month World Digital Library • 120k documents • 7 different languages • 10-50k queries/day • Index < 1GB • +100 documents/month
  • 10. Load Balancer Database Filesystem Indexing SOLR Cores SOLR Cores Users App Servers Web Cache Legacy Systems Data Partners Solr Architecture - congress.gov ETL Processing Extract Translate Load Master Data Sources
  • 11. Analyzers,Tokenizers and Filters. Oh My!
  • 12. Cores? We Don’t Need No Stinkin' Cores
  • 13. Data Import Handler
  • 14. Next Steps
  • 15. Open Source Tools • PHP / Zend • Python / Django • MySQL • RabbitMQ •Varnish • Jenkins • Graphite, Statsd
  • 16. Mike Nibeck - mnib@loc.gov ! Asim Shaikh - ashaikh@loc.gov