0
Un-Structured 	

!
Or: How I Learned to Stop
Worrying and Love the XML
Mike Nibeck, Asim Shaikh
1st NF, 2nd NF, 3rd NF	

!
It’s The Way It’s Done
Maintainability vs.
Performance
I’m Feeling Lucky
Solr
Extension	
  of	
  
Apache	
  Lucene
Full	
  Text	
  Search Open	
  Interfaces	
  
(XML,	
  JSON,	
  HTTP)
Faceted	
 ...
You got your chocolate in
my peanut butter!
It’s a Hammer. 	

A really nice, efficient
and free hammer.
A Mental Shift	

Pancakes & Relevancy
Chronicling America
• 6.8 million documents	

• 10 Billion vectors	

• 50,000 queries/day	

• Index 250GB 	

• +100K docum...
Load Balancer
Database Filesystem
Indexing
SOLR Cores SOLR Cores
Users
App Servers
Web Cache
Legacy Systems
Data Partners
...
Analyzers,Tokenizers and
Filters. Oh My!
Cores? We Don’t Need
No Stinkin' Cores
Data Import Handler
Next Steps
Open Source Tools
• PHP / Zend	

• Python / Django	

• MySQL	

• RabbitMQ	

•Varnish	

• Jenkins	

• Graphite, Statsd
Mike Nibeck - mnib@loc.gov	

!
Asim Shaikh - ashaikh@loc.gov
Upcoming SlideShare
Loading in...5
×

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

273

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
273
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh"

  1. 1. Un-Structured ! Or: How I Learned to Stop Worrying and Love the XML Mike Nibeck, Asim Shaikh
  2. 2. 1st NF, 2nd NF, 3rd NF ! It’s The Way It’s Done
  3. 3. Maintainability vs. Performance
  4. 4. I’m Feeling Lucky
  5. 5. Solr Extension  of   Apache  Lucene Full  Text  Search Open  Interfaces   (XML,  JSON,  HTTP) Faceted  Search Database  Ingest Document  Indexing   (PDF,  Word,  etc) Spelling   Suggestions Auto  Suggest “Cloudy” Advanced  Input   Parsing Relevance  Ranking v4.4
  6. 6. You got your chocolate in my peanut butter!
  7. 7. It’s a Hammer. A really nice, efficient and free hammer.
  8. 8. A Mental Shift Pancakes & Relevancy
  9. 9. Chronicling America • 6.8 million documents • 10 Billion vectors • 50,000 queries/day • Index 250GB • +100K documents per month Congress.gov • 4 million documents • 3.3+ million queries/day (user and system) • 36 GB indexes •Adding many thousands/ month Library Web Search • 18+ million documents • 9,000 queries/day • 28GB index size • + many thousands/month World Digital Library • 120k documents • 7 different languages • 10-50k queries/day • Index < 1GB • +100 documents/month
  10. 10. Load Balancer Database Filesystem Indexing SOLR Cores SOLR Cores Users App Servers Web Cache Legacy Systems Data Partners Solr Architecture - congress.gov ETL Processing Extract Translate Load Master Data Sources
  11. 11. Analyzers,Tokenizers and Filters. Oh My!
  12. 12. Cores? We Don’t Need No Stinkin' Cores
  13. 13. Data Import Handler
  14. 14. Next Steps
  15. 15. Open Source Tools • PHP / Zend • Python / Django • MySQL • RabbitMQ •Varnish • Jenkins • Graphite, Statsd
  16. 16. Mike Nibeck - mnib@loc.gov ! Asim Shaikh - ashaikh@loc.gov
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×