Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

646 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
646
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

  1. 1. Un-Structured ! Or: How I Learned to Stop Worrying and Love the XML Mike Nibeck, Asim Shaikh
  2. 2. 1st NF, 2nd NF, 3rd NF ! It’s The Way It’s Done
  3. 3. Maintainability vs. Performance
  4. 4. I’m Feeling Lucky
  5. 5. Solr Extension  of   Apache  Lucene Full  Text  Search Open  Interfaces   (XML,  JSON,  HTTP) Faceted  Search Database  Ingest Document  Indexing   (PDF,  Word,  etc) Spelling   Suggestions Auto  Suggest “Cloudy” Advanced  Input   Parsing Relevance  Ranking v4.4
  6. 6. You got your chocolate in my peanut butter!
  7. 7. It’s a Hammer. A really nice, efficient and free hammer.
  8. 8. A Mental Shift Pancakes & Relevancy
  9. 9. Chronicling America • 6.8 million documents • 10 Billion vectors • 50,000 queries/day • Index 250GB • +100K documents per month Congress.gov • 4 million documents • 3.3+ million queries/day (user and system) • 36 GB indexes •Adding many thousands/ month Library Web Search • 18+ million documents • 9,000 queries/day • 28GB index size • + many thousands/month World Digital Library • 120k documents • 7 different languages • 10-50k queries/day • Index < 1GB • +100 documents/month
  10. 10. Load Balancer Database Filesystem Indexing SOLR Cores SOLR Cores Users App Servers Web Cache Legacy Systems Data Partners Solr Architecture - congress.gov ETL Processing Extract Translate Load Master Data Sources
  11. 11. Analyzers,Tokenizers and Filters. Oh My!
  12. 12. Cores? We Don’t Need No Stinkin' Cores
  13. 13. Data Import Handler
  14. 14. Next Steps
  15. 15. Open Source Tools • PHP / Zend • Python / Django • MySQL • RabbitMQ •Varnish • Jenkins • Graphite, Statsd
  16. 16. Mike Nibeck - mnib@loc.gov ! Asim Shaikh - ashaikh@loc.gov

×