Lucene revolutionmontysolr 2011_presentation

726 views

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
726
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • mention the transition/collaboration: cern-desy-fermilab-slac\n
  • paradigm of a full result set\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Python: fast-prototyping, easy for students (who write a lot of the code)\n
  • \n
  • X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
  • X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
  • X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
  • X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
  • I don’t mention some options like writing JNI ourselves or using intermediaries other than remote objects (eg. shared memory, if that would be possible)\n
  • everybody thinks Jython, right? No!\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • These are only some important features, omitted is simplicity and beauty (JEPP eval is just ugly way of doing things), documentation, community, support etc.\n
  • \n
  • \n
  • Make sure that it is clear that processes can have threads - here it is not clear what is process and what is thread (it is not visible)\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Truly bi-directional\n We can call Python functions and pass Java objects\n From inside Python we can call Java object/methods\n
  • \n
  • the real-code example is in appendix #3\n
  • \n
  • \n
  • the real code is in appendix #4\n
  • note: don’t forget to mention how the multiprocessing is saving memory on the linux systems (due to the read-write and forking). This is effectively an alternative to Python WSGI that cannot run multiprocessing. We show that it is possible to use multiprocessing effectively.\n
  • the real code is in appendix #3\n\n
  • \n
  • more precise - montysolr intro (include)\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • TODO:\nInvnenio is the same as Django\nToday, Solr can now do 2nd order operations\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Lucene revolutionmontysolr 2011_presentation

    1. 1. MontySolr:Embedding CPython in Solr Roman Chyla, CERN roman.chyla@cern.ch, May 26, 2011
    2. 2. Why should I care?- Our challenge is to connect Python and Java- Without compromises- We created MontySolr extension - Robust, tested (will be used by our system) - But works for any Python application (eg. Django) - And for any C/C++ app that Python understands! - Open source (GPL v2)- Try it out! - https://github.com/romanchyla/montysolr 2
    3. 3. Outline‣ Context- The Challenge- Key components - Available technologies - Our approach - Problems solved- Evaluation- Wrap-up 3
    4. 4. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
    5. 5. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
    6. 6. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
    7. 7. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
    8. 8. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
    9. 9. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
    10. 10. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
    11. 11. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
    12. 12. SPIRES- Stanford Linear Accelerator Center - SLAC- High-Energy Physics Literature Database- Started December 1991 - The first web outside Europe/CERN - The first database on web 5
    13. 13. SPIRES- Stanford Linear Accelerator Center - SLAC- High-Energy Physics Literature Database- Started December 1991 - The first web outside Europe/CERN - The first database on web 5
    14. 14. 6
    15. 15. 7
    16. 16. Invenio- Integrated digital library software behind INSPIRE- Used by very large institutional repositories - http://repositories.webometrics.info/toprep_inst.asp- Customizable virtual collections- Flexible management of metadata - 3 000 authors per article- Powerful search engine - Incl. citation map analysis- Written in Python (since 2001) - 290 000 lines of code 8
    17. 17. Outline- Context‣ The Challenge- Key components - Available technologies - Our approach - Problems solved- Evaluation- Wrap-up 9
    18. 18. The Challenge- HEP scientific community - Searches metadata oriented- However fulltexts are changing the situation- And we want to provide even better service - Bigger volumes of data - NLP processing - Semantic search 10
    19. 19. The Challenge Invenio 11
    20. 20. The Challenge Query: supersymmetry AND author:ellis Invenio 11
    21. 21. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry 11
    22. 22. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry IDs: 1;2;3;9.... 11
    23. 23. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry IDs: 1;2;3;9.... 11
    24. 24. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry IDs: 1;2;3;9.... 11
    25. 25. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry IDs: 1;2;3;9.... 11
    26. 26. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9.... 11
    27. 27. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9.... 1. only IDs, no score = no ranking 11
    28. 28. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9....2. score merging 1. only IDs,difficult (if no scoreavailable) = no ranking 11
    29. 29. The Challenge 3. push IDs ? (eg._faceting) Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9....2. score merging 1. only IDs,difficult (if no scoreavailable) = no ranking 11
    30. 30. What is the “best” solution?- We love Python...- ...and our applications are written in Python...- But what if Solr is the master search engine?- Merge results inside Solr? - Typical size: 1-10 mil. IDs - Expected latency: 1-2 s.- What we want to achieve: - Fast transfer of hits from Invenio to Solr - Leverage the power of both (no compromises) - Developer-friendly integration, simplicity- Additional concerns: 12
    31. 31. Outline- Context- The Challenge‣ Key components - Available technologies - Our approach - Evaluation- Demonstration- Wrap-up 13
    32. 32. To embed Solr (in Java app)- Your app simulates Java web container? - use EmbeddedSolrServer- It knows nothing about Java servlets? - use DirectConnect class- Maybe we are too lazy? - Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well 14
    33. 33. To embed Solr (in Java app)- Your app simulates Java web container? - use EmbeddedSolrServer- It knows nothing about Java servlets? - use DirectConnect class- Maybe we are too lazy? - Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well 14
    34. 34. To embed Solr (in Java app)- Your app simulates Java web container? - use EmbeddedSolrServer- It knows nothing about Java servlets? - use DirectConnect class- Maybe we are too lazy? - Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well 14
    35. 35. To embed Solr (in Java app)- Your app simulates Java web container? - use EmbeddedSolrServer- It knows nothing about Java servlets? - use DirectConnect class- Maybe we are too lazy? - Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well 14
    36. 36. To embed Solr (in Java app)- Your app simulates Java web container? - use EmbeddedSolrServer- It knows nothing about Java servlets? - use DirectConnect class- Maybe we are too lazy? - Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well 14
    37. 37. To use Solr in non-Java app- Solr is already usable via HTTP requests, but we need something else here...- Remote objects/calls? - Pyro, execnet, CORBA, SOAP... - or simply pipes?- Access Python from Java? - Jython - JEPP- Access Java from Python? - JPype - JCC 15
    38. 38. Jython?- Implementation of Python in 100% Java- Both Java and Python code- Truly multithreaded- C modules will not work - but see http://bit.ly/iTRYbb- Slower than CPython 16
    39. 39. Jython?- Implementation of Python in 100% Java- Both Java and Python code- Truly multithreaded- C modules will not work - but see http://bit.ly/iTRYbb- Slower than CPython 17
    40. 40. Jython?- Implementation of Python in 100% Java- Both Java and Python code- Truly multithreaded- C modules will not work - but see http://bit.ly/iTRYbb- Slower than CPython 17
    41. 41. JEPP - Java Embedded Python- Python code runs inside Python interpreter- Embeds CPython interpreter via Java Native Interface (JNI) in Java- http://jepp.sourceforge.net/ - recently updated (27-Jan) - but JCC is more active 18
    42. 42. JEPP - Java Embedded Python 19
    43. 43. JCC- Embeds JVM in Python- C++ code generator- C++ object interface wraps a Java library- C++ wrappers conform to Pythons C type system- result: complete Python extension module 20
    44. 44. JCC 21
    45. 45. JCC 21
    46. 46. JCC 21
    47. 47. To use Solr in non-Java app Jython JCC JEPPPython ✓ ✓CModulesSpeed ✓ ?No code ✓ ✓changesAccess from ✓ ✓PythonAccess from ✓ ... ✓Java 22
    48. 48. The first try Invenio Solr JCC 23
    49. 49. Devil is in details... 24
    50. 50. GIL - Global Interpreter Lock Unfortunately Python webapp is not like Java... 25
    51. 51. GIL - Global Interpreter LockWe can have 200 threads, but only 4 will run at time... 26
    52. 52. GIL - Global Interpreter Lock 27
    53. 53. Fortunately solution exists- JCC can embed Python inside Java - Special thanks to Andi Vajda! (JCC creator)- We write ‘empty’ classes in Java ...- ... and implement them in Python Python /w Java inside Java /w Python inside 28
    54. 54. The second try Solr /w Invenio Invenio (backend) frontend XML JCC 29
    55. 55. Implementing the bridge- Special Java class- With method pythonExtension()- Native method pythonDecRef() - JCC provides its implementation- And number of other native methods - These will be implemented using Python- Like writing JNI Java/C code but without compilation... 30
    56. 56. MontySolr extension- JCC has great potential, but also added complexity...- So the MontySolr project was born - Modules must be built in shared mode - JCC dynamic library loaded and started from the main thread - Simple mechanism of the Python bridge and message - Configurable handlers on the Python side - Secured dereferencing of the native objects - Threading on the Java side - Multiprocessing on the Python side - Easy ant targets (compilation) ... 31
    57. 57. Hello World - Java partpublic class MontySolrBridge extends BasicBridge implementsPythonBridge { private long pythonObject; public void pythonExtension(long pythonObject) { this.pythonObject = pythonObject; } public long pythonExtension() { return this.pythonObject; } public void finalize() throws Throwable { pythonDecRef(); } public native void pythonDecRef(); public void sendMessage(PythonMessage message) { PythonVM vm = PythonVM.get(); vm.acquireThreadState(); receive_message(message); vm.releaseThreadState(); } public native void receive_message(PythonMessage message);} 32
    58. 58. Hello World - Python partfrom montysolr import MontySolrBridgeclass SimpleBridge(MontySolrBridge): def __init__(self): super(SimpleBridge, self).__init__() def receive_message(self, message): query = message.getParam(‘query’) message.setResults(‘Hello world!’) print ‘Python received from Java:’, query 33
    59. 59. Example - running MontySolr- Java side - JRE (32/64 bit) - Standard Solr/Lucene jars - JCC dynamic library- Python side - Python interpreter (32/64 bit) - 4 Python modules (jcc, solr, lucene, montysolr)- In the main thread - First we load JCC - Then start Python interpreter ... - ... load Python handlers 34
    60. 60. Solr as search service Solr /w Invenio Invenio (backend) frontend XML JCC 35
    61. 61. Example Solr MyCustom Handler 36
    62. 62. Example refersto:author:ellis Solr MyCustom Handler 37
    63. 63. Example - Solr custom handler MontySolrVM.INSTANCE.sendMessage(message); PythonMessage msg = MontySolrVM.INSTANCE .createMessage("perform_search") .setSender("Invenio") .setParam("query","refersto:author:ellis"); MontySolrVM.INSTANCE.sendMessage(msg); Object result = msg.getResults(); if (result != null) { int[] hits = (int[]) message.getResults(); } 38
    64. 64. Example - JNI connection refersto:author:ellis Solr MyCustom Python Handler Bridge 39
    65. 65. Example - JNI connection refersto:author:ellis Solr MyCustom Python Invenio Handler Bridge wrappers 40
    66. 66. Example - Python side # handler is made ‘visible’ at startup SolrpieTarget(Invenio:perform_search, perform_search) # search time - called from Java def perform_search(message): query = message.getParam(“query”) hits = call_real_search(query) # cast Python list into Java array message.setResults(JArray_ints(hits)) 41
    67. 67. Example refersto:author:ellis Solr Invenio Invenio MyCustom Python Invenio Handler Bridge wrappers Invenio Invenio 42
    68. 68. Example - Java side again MontySolrVM.INSTANCE.sendMessage(message); PythonMessage msg = MontySolrVM.INSTANCE .createMessage("perform_search") .setSender("Invenio") .setParam("query","refersto:author:ellis"); MontySolrVM.INSTANCE.sendMessage(msg); Object result = msg.getResults(); if (result != null) { int[] hits = (int[]) message.getResults(); } 43
    69. 69. Solr as search service Solr /w Invenio Apache (backend) webserver XML Invenio Invenio JCC 44
    70. 70. Outline- Context- The Challenge- Key components - Available technologies - Our approach - Problems solved‣ Evaluation- Wrap-up 45
    71. 71. Memory and garbage collection 46
    72. 72. Comparing speed and load... 47
    73. 73. The effect of cache 48
    74. 74. Robust?- Extensive siege tests show very good performance and stability under high load - 100-200 users, complex searches - 50 concurrent users, citation analysis - JCC incurs small overhead- We detected no memory leaks - The same as dbpedia.org- But watch out for errors in C - An error in C module brings down the whole JVM - (errors in pure Python module can be handled) 49
    75. 75. Easy to develop/maintain?- Added complexity - Java in the toolbox - Need to compile C++ extensions - Python/OS version dependencies- For this we get - Easy integration with Invenio - The best of two applications - A lot of features for free - And we can control Solr from Python! 50
    76. 76. Outline- Context- The Challenge- Key components - Available technologies - Our approach - Problems solved- Evaluation‣ Wrap-up 51
    77. 77. Wrap-up- Our challenge was to connect two different languages/systems- And we wanted to get the best of the two... - So we had to plug Python into Solr - And now our Solr knows citation analysis!- We created MontySolr extension - Robust, tested (will be used by INSPIRE) - Works for any Python application (eg. Django) - And for any C/C++ app that Python understands! - Free software license- Try it out! Help us make it better! - https://github.com/romanchyla/montysolr 52
    78. 78. Questions?- MontySolr - https://github.com/romanchyla/montysolr- Roman Chyla - Fellow, CERN Scientific Information Service - roman.chyla@cern.ch - @rchyla - https://svnweb.cern.ch/trac/rcarepo
    79. 79. Additional information 54
    80. 80. Links- Invenio platform - http://invenio-software.org/- INSPIRE Digital library - http://inspirebeta.net/- Diagrams of JCC and JEPP - Andreas Schreiber : Mixing Java and Python - http://www.slideshare.net/onyame/mixing-python-and- java- On Jython C Extension API - http://stackoverflow.com/questions/3097466/using- numpy-and-cpython-with-jython- Demo of a running service: - http://insdev01.cern.ch 55
    81. 81. #1 - How to embed Solr (standard)- solr.client.solrj.embedded.EmbeddedSolrServer 56
    82. 82. #2 - How to embed Solr (simplified)- solr.servlet.DirectSolrConnection- like previous, but simpler- all the queries are sent as strings, everything is just a string- very flexible and probably suitable for quick integration 57
    83. 83. #2 - How to embed Solr (simplified)- solr.servlet.DirectSolrConnection- like previous, but simpler- all the queries are sent as strings, everything is just a string- very flexible and probably suitable for quick integration 57
    84. 84. #3 - Example of a Solr custom handler 58
    85. 85. #4 - Example Python handler 59

    ×