Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Un-Structured 	

!
Or: How I Learned to Stop
Worrying and Love the XML
Mike Nibeck, Asim Shaikh
1st NF, 2nd NF, 3rd NF	

!
It’s The Way It’s Done
Maintainability vs.
Performance
I’m Feeling Lucky
Solr
Extension	
  of	
  
Apache	
  Lucene
Full	
  Text	
  Search Open	
  Interfaces	
  
(XML,	
  JSON,	
  HTTP)
Faceted	
 ...
You got your chocolate in
my peanut butter!
It’s a Hammer. 	

A really nice, efficient
and free hammer.
A Mental Shift	

Pancakes & Relevancy
Chronicling America
• 6.8 million documents	

• 10 Billion vectors	

• 50,000 queries/day	

• Index 250GB 	

• +100K docum...
Load Balancer
Database Filesystem
Indexing
SOLR Cores SOLR Cores
Users
App Servers
Web Cache
Legacy Systems
Data Partners
...
Analyzers,Tokenizers and
Filters. Oh My!
Cores? We Don’t Need
No Stinkin' Cores
Data Import Handler
Next Steps
Open Source Tools
• PHP / Zend	

• Python / Django	

• MySQL	

• RabbitMQ	

•Varnish	

• Jenkins	

• Graphite, Statsd
Mike Nibeck - mnib@loc.gov	

!
Asim Shaikh - ashaikh@loc.gov
Upcoming SlideShare
Loading in …5
×

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

703 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

  1. 1. Un-Structured ! Or: How I Learned to Stop Worrying and Love the XML Mike Nibeck, Asim Shaikh
  2. 2. 1st NF, 2nd NF, 3rd NF ! It’s The Way It’s Done
  3. 3. Maintainability vs. Performance
  4. 4. I’m Feeling Lucky
  5. 5. Solr Extension  of   Apache  Lucene Full  Text  Search Open  Interfaces   (XML,  JSON,  HTTP) Faceted  Search Database  Ingest Document  Indexing   (PDF,  Word,  etc) Spelling   Suggestions Auto  Suggest “Cloudy” Advanced  Input   Parsing Relevance  Ranking v4.4
  6. 6. You got your chocolate in my peanut butter!
  7. 7. It’s a Hammer. A really nice, efficient and free hammer.
  8. 8. A Mental Shift Pancakes & Relevancy
  9. 9. Chronicling America • 6.8 million documents • 10 Billion vectors • 50,000 queries/day • Index 250GB • +100K documents per month Congress.gov • 4 million documents • 3.3+ million queries/day (user and system) • 36 GB indexes •Adding many thousands/ month Library Web Search • 18+ million documents • 9,000 queries/day • 28GB index size • + many thousands/month World Digital Library • 120k documents • 7 different languages • 10-50k queries/day • Index < 1GB • +100 documents/month
  10. 10. Load Balancer Database Filesystem Indexing SOLR Cores SOLR Cores Users App Servers Web Cache Legacy Systems Data Partners Solr Architecture - congress.gov ETL Processing Extract Translate Load Master Data Sources
  11. 11. Analyzers,Tokenizers and Filters. Oh My!
  12. 12. Cores? We Don’t Need No Stinkin' Cores
  13. 13. Data Import Handler
  14. 14. Next Steps
  15. 15. Open Source Tools • PHP / Zend • Python / Django • MySQL • RabbitMQ •Varnish • Jenkins • Graphite, Statsd
  16. 16. Mike Nibeck - mnib@loc.gov ! Asim Shaikh - ashaikh@loc.gov

×