ApacheCon NA 2011 report


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ApacheCon NA 2011 report

  1. 1. ApacheCon NA 2011 Report 2011/12/19 @ijokarumawak
  2. 2. About myself <ul><li>Nutch </li></ul><ul><li>Cloudera Certified </li></ul><ul><ul><li>Hadoop Developer </li></ul></ul><ul><ul><li>Hadoop Administrator </li></ul></ul><ul><li>CouchDB JP </li></ul>
  3. 3. ApacheCon <ul><li>http://na11.apachecon.com/ </li></ul><ul><li>2 days training </li></ul><ul><li>3 days sessions </li></ul><ul><ul><li>Keynotes, 5 tracks </li></ul></ul><ul><ul><li>Over 80 sessions </li></ul></ul><ul><li>Slide and audio files </li></ul><ul><ul><li>http://lanyrd.com/2011/apachecon-north-america/ </li></ul></ul>
  4. 4. Why did I go there? <ul><li>Because I wanted to! </li></ul>Nov 5,6: CouchHack Nov 7: CouchConf Berlin Nov 3: Left Japan Nov 14: Came back Nov 9-11: ApachCon Nov 12: Apach BarCamp Image from: http:// en.wikipedia.org/wiki/File:World_map_blank_gmt.svg
  5. 5. Keynote| Building in Security and Innovation <ul><li>David A. Wheeler </li></ul><ul><ul><li>A specialist at developing </li></ul></ul><ul><ul><li>Secure Open Source Software </li></ul></ul><ul><li>The importance of developing secure software </li></ul><ul><li>Do not make the same mistake </li></ul><ul><li>Learn how to make it secure before start to develop it </li></ul>
  6. 6. Keynote | The Apache Way Done Right: The Success of Hadoop <ul><li>Eric Baldeschwieler </li></ul><ul><ul><li>co-founder and the CEO of </li></ul></ul><ul><li>History of Hadoop </li></ul><ul><li>Difficulty of leading a huge community </li></ul><ul><li>“ Being optimistic and good things will happen.” </li></ul>
  7. 7. Keynote | Watson, a Reasoning System: based on Apache Inside! <ul><li>David Boloker </li></ul><ul><ul><li>CTO of IBM's Emerging Internet Technology group </li></ul></ul><ul><li>IBM’s Watson won Jeopardy </li></ul><ul><li>Commercialization of Watson </li></ul><ul><ul><li>Its target is medical field </li></ul></ul>
  8. 8. Lucene/Solr Meet up <ul><li>Discussion with core committers of Lucene/Solr </li></ul><ul><ul><li>Erik Hatcher </li></ul></ul><ul><ul><li>Chris Hostetter </li></ul></ul><ul><ul><li>Simon Willnauer </li></ul></ul><ul><li>We are supposed to drink beer, aren't we? </li></ul>
  9. 9. Sessions I attended to <ul><li>Lucene 4.0 - next generation open source search </li></ul><ul><ul><li>Simon Willnauer </li></ul></ul><ul><li>Solr Flair </li></ul><ul><ul><li>Erik Hatcher </li></ul></ul><ul><li>And more… 20 sessions! </li></ul><ul><li>http://www.atware.co.jp/category/column/apachecon-na-2011/ </li></ul>
  10. 10. <ul><li>Lucene 4.0 </li></ul><ul><li>- next generation open source search - </li></ul><ul><li>by Simon Willnauer </li></ul>
  11. 11. about the author <ul><li>Lucene core committer </li></ul><ul><li>Project Management Committee chair (PMC) </li></ul><ul><li>Berlin Buzzwords co-founder </li></ul><ul><ul><li>http://berlinbuzzwords.de/ </li></ul></ul><ul><li>Community portal targeting OpenSource Search </li></ul><ul><ul><li>http:// www.searchworkings.org / </li></ul></ul>
  12. 12. Lucene 4.0 <ul><li>The latest is currently Lucene 3.5.0 </li></ul><ul><li>When does the Lucene 4.0 come out? </li></ul><ul><ul><li>Any time. He doesn’t know. </li></ul></ul>
  13. 13. IndexWriter & IndexReader <ul><li>Talk to a Directory (file system) </li></ul><ul><li>Just a factory for input and output streams </li></ul><ul><li>From Lucene4 </li></ul><ul><ul><li>Flex API on the Codec layer </li></ul></ul><ul><li>Codec </li></ul><ul><ul><li>Defines the file format </li></ul></ul><ul><ul><li>Data structures </li></ul></ul><ul><ul><li>Fields, term dictionaries </li></ul></ul><ul><ul><li>You can use MySQL as a backup </li></ul></ul><ul><ul><ul><li>(it’s not a good idea though) </li></ul></ul></ul><ul><li>90% won’t get in touch </li></ul><ul><ul><li>10% might be researchers </li></ul></ul><ul><li>Backward compatibility </li></ul>File System Directory Codec Flex API IndexWriter & Reader
  14. 14. Storing Strings in UTF8 <ul><li>Lucene 3 uses UTF16 </li></ul><ul><li>From Lucene 4, UTF8 </li></ul><ul><li>Performance will improve when you switched to Lucene 4 </li></ul>
  15. 15. PostingsFormat <ul><li>PostingsFormat can be defined per field </li></ul><ul><li>field:uid = Pulsing – PostingsFormat </li></ul><ul><ul><li>Usually 1 doc per uid </li></ul></ul><ul><ul><li>Inlines postings into term dictionary </li></ul></ul><ul><ul><li>Safes additional disc lookup </li></ul></ul><ul><li>field:spell = Memory – PostingsFormat </li></ul><ul><ul><li>Spelling correction doesn’t need posting list traversal </li></ul></ul><ul><ul><li>Large amount of key lookups </li></ul></ul><ul><ul><li>Load terms into RAM </li></ul></ul><ul><li>field:body = Default – PostingsFormat </li></ul><ul><li>Primary Key lookup </li></ul><ul><ul><li>170K qps -> 550K qps with Memory PostingsFormat </li></ul></ul>Term Dictionary Posting List Term Posting List RAM Terms
  16. 16. IndexDocValues <ul><li>Lucene uses inverted index ( Term to Doc ) </li></ul><ul><ul><li>It’s not good at to get a value of certain field from a document </li></ul></ul><ul><li>Fast access to a certain field’s value for every document </li></ul><ul><ul><li>To sort documents or to display doc’s values not only its ID </li></ul></ul><ul><ul><li>Stored Fields </li></ul></ul><ul><ul><ul><li>It works but it’s not an efficient way </li></ul></ul></ul><ul><ul><ul><li>It’s designed for bulk read </li></ul></ul></ul><ul><ul><li>FieldCache ( on RAM ) </li></ul></ul><ul><ul><ul><li>Undo the entire work in the indexing time to make an array (un-inverting) </li></ul></ul></ul><ul><ul><ul><li>It works well until certain size of the index </li></ul></ul></ul><ul><ul><ul><li>It can be a problem under real-time or near-real-time usecases </li></ul></ul></ul><ul><ul><li>IndexDocValue </li></ul></ul><ul><ul><ul><li>1 value per field, type safe </li></ul></ul></ul><ul><ul><ul><li>It can reside on disk </li></ul></ul></ul><ul><li>Reading 10M docs from a disc </li></ul><ul><ul><li>FieldCache: 3161 ms </li></ul></ul><ul><ul><li>DocValues: 90 ms </li></ul></ul>Term Doc Doc Doc How to sort docs?
  17. 17. DWPT (Document Writer Per Thread) <ul><li>In Lucene 3 </li></ul><ul><ul><li>IndexWriter merges segments and flushes it to the disk </li></ul></ul><ul><ul><li>While flushing data, multi-threaded IndexWriter takes a break </li></ul></ul><ul><li>From Lucene 4 </li></ul><ul><ul><li>IndexWriter doesn’t merge data anymore </li></ul></ul><ul><ul><li>It flushes its own segment to the disc simultaneously </li></ul></ul><ul><ul><li>less RAM more Concurrency </li></ul></ul>
  18. 18. Automaton Query <ul><li>Automaton Query </li></ul><ul><ul><li>RegExp: (ftp|http).* </li></ul></ul><ul><ul><li>Fuzzy: dogs~1 </li></ul></ul><ul><ul><li>Fuzzy-Prefix: (dogs~1).* </li></ul></ul><ul><li>Fuzzy query was too slow to use in production </li></ul><ul><ul><li>Prior to 4.0, Fuzzy query took the simple yet horribly costly brute force approach </li></ul></ul><ul><ul><li>In Lucene 3 this is about 0.1 - 0.2 QPS </li></ul></ul><ul><ul><li>Now it’s 50 QPS, 20k% improvement! </li></ul></ul><ul><ul><li>http://java.dzone.com/news/lucenes-fuzzyquery-100-times </li></ul></ul>
  19. 19. <ul><li>Solr Flair </li></ul><ul><li>by Erik Hatcher </li></ul>
  20. 20. Solr Flair User Interfaces User Interactions Ajax suggestion Did you mean? – Spell Checking Facet Cluster .. So on
  21. 21. wt = velocity <ul><li>http://wiki.apache.org/solr/VelocityResponseWriter </li></ul><ul><li>Solritas </li></ul><ul><li>/browse </li></ul>
  22. 22. Prism <ul><li>https://github.com/lucidimagination/Prism </li></ul><ul><li>Requires </li></ul><ul><ul><li>Lucid Works Enterprise </li></ul></ul><ul><ul><li>JRuby with Sinatra gem installed </li></ul></ul><ul><li>Production use of LucidWorks Enterprise requires an annual subscription </li></ul><ul><ul><li>It’s free to play :’) </li></ul></ul>
  23. 23. blacklight <ul><li>http://projectblacklight.org/ </li></ul><ul><li>Ruby on Rails </li></ul><ul><li>DEMO </li></ul><ul><ul><li>http:// demo.projectblacklight.org / </li></ul></ul><ul><li>Being used by Universities </li></ul><ul><ul><li>University of Versinia </li></ul></ul><ul><ul><ul><li>http:// search.lib.virginia.edu/catalog?portal = all&q = lucene </li></ul></ul></ul><ul><ul><li>Stanford University </li></ul></ul><ul><ul><ul><li>http:// searchworks.stanford.edu/?q = lucene+in+action&search_field =search </li></ul></ul></ul>
  24. 24. VUFind <ul><li>http:// vufind.org / </li></ul><ul><li>blacklight competitor </li></ul><ul><li>library resource portal </li></ul><ul><li>PHP </li></ul><ul><li>DEMO </li></ul><ul><ul><li>http://vufind.org/demo/ </li></ul></ul>
  25. 25. TwigKit <ul><li>http:// twigkit.com / </li></ul><ul><li>JSP tag library </li></ul><ul><li>Search UI components </li></ul><ul><li>Samples </li></ul><ul><ul><li>http:// twigkit.com/components.html </li></ul></ul>
  26. 26. Ajax Solr <ul><li>https://github.com/evolvingweb/ajax-solr </li></ul><ul><li>Javascript library goes with JQuery </li></ul><ul><li>DEMO </li></ul><ul><ul><li>http://evolvingweb.github.com/ajax-solr/examples/reuters/index.html </li></ul></ul>
  27. 27. ApacheCon 2012 <ul><li>ApacheCon EUROPE </li></ul><ul><li>November 2012 </li></ul><ul><li>Germany!!? </li></ul>
  28. 28. Thank you!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.