ProjectHub

3,707 views

Published on

Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

Published in: Technology
1 Comment
4 Likes
Statistics
Notes
  • I was interested to see nobody seems to find Nutch that usable - It seems pretty hard to use - that helped me to decide to drop nutch from my 'todo' list - Thanks!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
3,707
On SlideShare
0
From Embeds
0
Number of Embeds
1,052
Actions
Shares
0
Downloads
44
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide

ProjectHub

  1. 1. ProjectHub Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends Otis Gospodneti ć ◦◦ [email_address] ◦◦ @ otisg Sematext Int'l ◦◦ www.sematext.com ◦◦ @ sematext
  2. 2. What I Will Cover <ul><li>Who I am </li></ul><ul><li>What Why Where </li></ul><ul><li>Architecture </li></ul><ul><li>Info Gathering & Indexing </li></ul><ul><li>Search & Extra Search Dog Food </li></ul><ul><li>Performance & Analytics </li></ul><ul><li>Ops & Stats </li></ul>
  3. 3. About Otis Gospodneti ć <ul><li>Lucene/Solr/Nutch/Mahout/... committer </li></ul><ul><li>Lucene in Action 1 & 2 co-author </li></ul><ul><li>Lucene Consulting since 2005 </li></ul><ul><li>Sematext International since 2007 </li></ul>
  4. 4. About Sematext <ul><li>Search ( Lucene, Solr, Elastic Search... ) </li></ul><ul><li>Web Crawling (Nutch) </li></ul><ul><li>Machine Learning (Mahout) </li></ul><ul><li>Big Data (Hadoop, HBase, Voldemort...) </li></ul>
  5. 5. What <ul><li>Search everything about a Software Project </li></ul><ul><li>Lucene & Hadoop </li></ul><ul><ul><li>All sub-projects </li></ul></ul><ul><ul><li>All content </li></ul></ul><ul><ul><ul><li>Mailing list archives </li></ul></ul></ul><ul><ul><ul><li>JIRA issues </li></ul></ul></ul><ul><ul><ul><li>Web site & Wiki pages </li></ul></ul></ul><ul><ul><ul><li>Source code (local syntax highlighting), trunk </li></ul></ul></ul><ul><ul><ul><li>Javadoc, trunk </li></ul></ul></ul>
  6. 6.
  7. 7. Why <ul><li>We need it </li></ul><ul><li>Other Hadoop, Lucene, Solr... users need it </li></ul><ul><li>Our own playground </li></ul><ul><li>Live product demos </li></ul><ul><li>Yummy dog food </li></ul>
  8. 8. Where <ul><li>search-lucene.com </li></ul><ul><li>search-hadoop.com </li></ul><ul><li>Other suggestions / needs? </li></ul><ul><li>In your Enterprise? </li></ul>
  9. 9. Architecture
  10. 10. Tool Matrix Data Source Fetch Parse JIRA URLConnection (feed) Digester (feed) DOM (item) ML FileInputStream (fs) URLConnection (feed) Droid (works, unused) Digester (feed) MIME4J (mbox) Web site Droids Tika via Droids Wiki Droids Tika via Droids Source code svn co QDox Javadoc svn co QDox
  11. 11. Information Gathering <ul><li>Multiple independent JVM processes (cron) </li></ul><ul><li>Different polling frequencies </li></ul><ul><li>Different data sources / formats: </li></ul><ul><ul><li>RSS (JIRA, Mailing Lists) </li></ul></ul><ul><ul><li>Mbox (Mailing Lists) </li></ul></ul><ul><ul><li>HTTP/HTML (Web site, Wiki) </li></ul></ul><ul><ul><li>Subversion (source code, Javadoc) </li></ul></ul><ul><li>Nutch is a beast. Droids is light & simple. </li></ul><ul><li>ML thread detection is tricky </li></ul><ul><li>Finding deleted docs (Wiki, Web, Javadoc...) </li></ul>
  12. 12. Thread Detection <ul><li>Email clients are kaput </li></ul><ul><li>SMTP headers are unreliable </li></ul><ul><li>Heuristics are needed </li></ul><ul><ul><li>Try headers </li></ul></ul><ul><ul><li>Fall back to subjects (get subject skeleton, calculate hash) </li></ul></ul><ul><ul><li>Factor in time (4 weeks) </li></ul></ul><ul><ul><li>Use index for thread info retrieval </li></ul></ul><ul><li>Q: Are there any libraries for this? </li></ul>
  13. 13. Indexing <ul><li>Use StreamingUpdateSolrServer </li></ul><ul><li>AutoCommit use-case </li></ul><ul><li>Solr index abuse: track seen/unseen </li></ul><ul><li>&qsrc=indexer </li></ul><ul><li>&warmUp=true </li></ul><ul><li>Separate processes – easier reindexing (esp. with frequent project infra changes) </li></ul><ul><li>Treating quoted portions of ML messages </li></ul>
  14. 14. Search <ul><li>Facets (multi-select) </li></ul><ul><ul><li>Project </li></ul></ul><ul><ul><li>Data source/type </li></ul></ul><ul><ul><li>Author (based on names only) </li></ul></ul><ul><li>Boosting more recent documents vs. pure relevance vs. newest/oldest first </li></ul><ul><li>give equivalent of 0.5 year to docs w/ empty updateDate field (e.g. javadocs) </li></ul><ul><li>recip(map(ms(NOW,updateDate),6.32e11,3.16e12,1.58e10),3.16e-11,4,1)^4 </li></ul>
  15. 15. Search cont'd <ul><li>Query Spellchecker </li></ul><ul><li>Sematext components: </li></ul><ul><ul><li>ReSearcher & Relaxer </li></ul></ul><ul><ul><li>AutoComplete </li></ul></ul><ul><ul><li>Key Phrase Extractor (2 approaches) </li></ul></ul><ul><li>Threaded vs. flat view </li></ul><ul><li>In-document search term highlighting </li></ul><ul><li>Short URLs </li></ul>
  16. 16. Search cont'd
  17. 17. Dog food #1: Auto-Complete <ul><li>Source: nightly refreshed subject and titles </li></ul><ul><li>Approach: go directly to selection </li></ul><ul><li>sematext.com/products/autocomplete/ </li></ul>
  18. 18. Dog food #2: ReSearcher & Relaxer <ul><li>Avoid “sorry, no/poor matches” </li></ul><ul><li>Multiple algos trigger re-searching </li></ul><ul><li>Different forms of relaxing </li></ul><ul><li>sematext.com/products/dym-researcher/ </li></ul>
  19. 19. Dog food #3: Key Phrases <ul><li>Help narrow search results, like facets </li></ul><ul><li>2 types: </li></ul><ul><ul><li>Stored in index vs. calculated from top N hits </li></ul></ul><ul><li>sematext.com/products/key-phrase-extractor/ </li></ul>
  20. 20. Basic Search Analytics <ul><li>Top queries, top terms... </li></ul><ul><li>Daily, weekly, monthly </li></ul><ul><li>MRR </li></ul><ul><ul><li>http://en.wikipedia.org/wiki/Mean_reciprocal_rank </li></ul></ul>
  21. 21. Very Basic Search Analytics
  22. 22. Real Search Analytics
  23. 23. Performance & Monitoring: RPM
  24. 24. Availability: Site24x7.com
  25. 25. Operations <ul><li>Small EC2 instance: 1.7 GB RAM </li></ul><ul><li>EBS for data - got burnt once </li></ul><ul><li>Local disk for index </li></ul><ul><li>Solr 1.4.1 multi-core </li></ul><ul><li>Performance monitoring via RPM </li></ul><ul><li>Availability & performance via site24x7.com </li></ul>
  26. 26. Statistics <ul><li>search-hadoop.com: </li></ul><ul><ul><li>110K+ documents </li></ul></ul><ul><ul><li>~700 MB optimized </li></ul></ul><ul><li>search-lucene.com </li></ul><ul><ul><li>170K+ documents </li></ul></ul><ul><ul><li>~900 MB optimized </li></ul></ul>
  27. 27. Future <ul><li>Field collapsing (threads) </li></ul><ul><li>Bot detection (load) DONE </li></ul><ul><li>Solr duplicate detection (release notes) </li></ul><ul><li>Relevance tuning (MRR) </li></ul><ul><li>Open sourcing? </li></ul>
  28. 28. <ul><li>World-wide! </li></ul><ul><li>Search & Data Analytics </li></ul><ul><li>Machine Learning & NLP </li></ul><ul><li>Big Data </li></ul><ul><li>[email_address] </li></ul>WE ARE HIRING
  29. 29. Questions <ul><li>? </li></ul>
  30. 30. Contact <ul><li>sematext.com </li></ul><ul><li>blog.sematext.com </li></ul><ul><li>@ sematext </li></ul><ul><li>@ otisg </li></ul><ul><li>[email_address] </li></ul>

×