Lucene revolutionmontysolr 2011_presentation

  • 494 views
Uploaded on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
494
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • mention the transition/collaboration: cern-desy-fermilab-slac\n
  • paradigm of a full result set\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Python: fast-prototyping, easy for students (who write a lot of the code)\n
  • \n
  • X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
  • X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
  • X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
  • X - not spend time on the code\n“I was waiting for the point ‘this is the solution’” :)\n\nad #1\n solr.client.solrj.embedded.EmbeddedSolrServer\n Solr is running as an embedded process, not inside a servlet container\n the default/recommended way\nad #2\n solr.servlet.DirectSolrConnect\n like previous, but simpler\n all the queries are sent as strings, everything is just a string\n very flexible and probably suitable for quick integration\n
  • I don’t mention some options like writing JNI ourselves or using intermediaries other than remote objects (eg. shared memory, if that would be possible)\n
  • everybody thinks Jython, right? No!\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • These are only some important features, omitted is simplicity and beauty (JEPP eval is just ugly way of doing things), documentation, community, support etc.\n
  • \n
  • \n
  • Make sure that it is clear that processes can have threads - here it is not clear what is process and what is thread (it is not visible)\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Truly bi-directional\n We can call Python functions and pass Java objects\n From inside Python we can call Java object/methods\n
  • \n
  • the real-code example is in appendix #3\n
  • \n
  • \n
  • the real code is in appendix #4\n
  • note: don’t forget to mention how the multiprocessing is saving memory on the linux systems (due to the read-write and forking). This is effectively an alternative to Python WSGI that cannot run multiprocessing. We show that it is possible to use multiprocessing effectively.\n
  • the real code is in appendix #3\n\n
  • \n
  • more precise - montysolr intro (include)\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • TODO:\nInvnenio is the same as Django\nToday, Solr can now do 2nd order operations\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. MontySolr:Embedding CPython in Solr Roman Chyla, CERN roman.chyla@cern.ch, May 26, 2011
  • 2. Why should I care?- Our challenge is to connect Python and Java- Without compromises- We created MontySolr extension - Robust, tested (will be used by our system) - But works for any Python application (eg. Django) - And for any C/C++ app that Python understands! - Open source (GPL v2)- Try it out! - https://github.com/romanchyla/montysolr 2
  • 3. Outline‣ Context- The Challenge- Key components - Available technologies - Our approach - Problems solved- Evaluation- Wrap-up 3
  • 4. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
  • 5. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
  • 6. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
  • 7. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
  • 8. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
  • 9. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
  • 10. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
  • 11. CERN- European Organization for Nuclear Research - Switzerland, Geneva- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide 4
  • 12. SPIRES- Stanford Linear Accelerator Center - SLAC- High-Energy Physics Literature Database- Started December 1991 - The first web outside Europe/CERN - The first database on web 5
  • 13. SPIRES- Stanford Linear Accelerator Center - SLAC- High-Energy Physics Literature Database- Started December 1991 - The first web outside Europe/CERN - The first database on web 5
  • 14. 6
  • 15. 7
  • 16. Invenio- Integrated digital library software behind INSPIRE- Used by very large institutional repositories - http://repositories.webometrics.info/toprep_inst.asp- Customizable virtual collections- Flexible management of metadata - 3 000 authors per article- Powerful search engine - Incl. citation map analysis- Written in Python (since 2001) - 290 000 lines of code 8
  • 17. Outline- Context‣ The Challenge- Key components - Available technologies - Our approach - Problems solved- Evaluation- Wrap-up 9
  • 18. The Challenge- HEP scientific community - Searches metadata oriented- However fulltexts are changing the situation- And we want to provide even better service - Bigger volumes of data - NLP processing - Semantic search 10
  • 19. The Challenge Invenio 11
  • 20. The Challenge Query: supersymmetry AND author:ellis Invenio 11
  • 21. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry 11
  • 22. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry IDs: 1;2;3;9.... 11
  • 23. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry IDs: 1;2;3;9.... 11
  • 24. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry IDs: 1;2;3;9.... 11
  • 25. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry IDs: 1;2;3;9.... 11
  • 26. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9.... 11
  • 27. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9.... 1. only IDs, no score = no ranking 11
  • 28. The Challenge Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9....2. score merging 1. only IDs,difficult (if no scoreavailable) = no ranking 11
  • 29. The Challenge 3. push IDs ? (eg._faceting) Query: supersymmetry AND author:ellis Invenio fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9....2. score merging 1. only IDs,difficult (if no scoreavailable) = no ranking 11
  • 30. What is the “best” solution?- We love Python...- ...and our applications are written in Python...- But what if Solr is the master search engine?- Merge results inside Solr? - Typical size: 1-10 mil. IDs - Expected latency: 1-2 s.- What we want to achieve: - Fast transfer of hits from Invenio to Solr - Leverage the power of both (no compromises) - Developer-friendly integration, simplicity- Additional concerns: 12
  • 31. Outline- Context- The Challenge‣ Key components - Available technologies - Our approach - Evaluation- Demonstration- Wrap-up 13
  • 32. To embed Solr (in Java app)- Your app simulates Java web container? - use EmbeddedSolrServer- It knows nothing about Java servlets? - use DirectConnect class- Maybe we are too lazy? - Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well 14
  • 33. To embed Solr (in Java app)- Your app simulates Java web container? - use EmbeddedSolrServer- It knows nothing about Java servlets? - use DirectConnect class- Maybe we are too lazy? - Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well 14
  • 34. To embed Solr (in Java app)- Your app simulates Java web container? - use EmbeddedSolrServer- It knows nothing about Java servlets? - use DirectConnect class- Maybe we are too lazy? - Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well 14
  • 35. To embed Solr (in Java app)- Your app simulates Java web container? - use EmbeddedSolrServer- It knows nothing about Java servlets? - use DirectConnect class- Maybe we are too lazy? - Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well 14
  • 36. To embed Solr (in Java app)- Your app simulates Java web container? - use EmbeddedSolrServer- It knows nothing about Java servlets? - use DirectConnect class- Maybe we are too lazy? - Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well 14
  • 37. To use Solr in non-Java app- Solr is already usable via HTTP requests, but we need something else here...- Remote objects/calls? - Pyro, execnet, CORBA, SOAP... - or simply pipes?- Access Python from Java? - Jython - JEPP- Access Java from Python? - JPype - JCC 15
  • 38. Jython?- Implementation of Python in 100% Java- Both Java and Python code- Truly multithreaded- C modules will not work - but see http://bit.ly/iTRYbb- Slower than CPython 16
  • 39. Jython?- Implementation of Python in 100% Java- Both Java and Python code- Truly multithreaded- C modules will not work - but see http://bit.ly/iTRYbb- Slower than CPython 17
  • 40. Jython?- Implementation of Python in 100% Java- Both Java and Python code- Truly multithreaded- C modules will not work - but see http://bit.ly/iTRYbb- Slower than CPython 17
  • 41. JEPP - Java Embedded Python- Python code runs inside Python interpreter- Embeds CPython interpreter via Java Native Interface (JNI) in Java- http://jepp.sourceforge.net/ - recently updated (27-Jan) - but JCC is more active 18
  • 42. JEPP - Java Embedded Python 19
  • 43. JCC- Embeds JVM in Python- C++ code generator- C++ object interface wraps a Java library- C++ wrappers conform to Pythons C type system- result: complete Python extension module 20
  • 44. JCC 21
  • 45. JCC 21
  • 46. JCC 21
  • 47. To use Solr in non-Java app Jython JCC JEPPPython ✓ ✓CModulesSpeed ✓ ?No code ✓ ✓changesAccess from ✓ ✓PythonAccess from ✓ ... ✓Java 22
  • 48. The first try Invenio Solr JCC 23
  • 49. Devil is in details... 24
  • 50. GIL - Global Interpreter Lock Unfortunately Python webapp is not like Java... 25
  • 51. GIL - Global Interpreter LockWe can have 200 threads, but only 4 will run at time... 26
  • 52. GIL - Global Interpreter Lock 27
  • 53. Fortunately solution exists- JCC can embed Python inside Java - Special thanks to Andi Vajda! (JCC creator)- We write ‘empty’ classes in Java ...- ... and implement them in Python Python /w Java inside Java /w Python inside 28
  • 54. The second try Solr /w Invenio Invenio (backend) frontend XML JCC 29
  • 55. Implementing the bridge- Special Java class- With method pythonExtension()- Native method pythonDecRef() - JCC provides its implementation- And number of other native methods - These will be implemented using Python- Like writing JNI Java/C code but without compilation... 30
  • 56. MontySolr extension- JCC has great potential, but also added complexity...- So the MontySolr project was born - Modules must be built in shared mode - JCC dynamic library loaded and started from the main thread - Simple mechanism of the Python bridge and message - Configurable handlers on the Python side - Secured dereferencing of the native objects - Threading on the Java side - Multiprocessing on the Python side - Easy ant targets (compilation) ... 31
  • 57. Hello World - Java partpublic class MontySolrBridge extends BasicBridge implementsPythonBridge { private long pythonObject; public void pythonExtension(long pythonObject) { this.pythonObject = pythonObject; } public long pythonExtension() { return this.pythonObject; } public void finalize() throws Throwable { pythonDecRef(); } public native void pythonDecRef(); public void sendMessage(PythonMessage message) { PythonVM vm = PythonVM.get(); vm.acquireThreadState(); receive_message(message); vm.releaseThreadState(); } public native void receive_message(PythonMessage message);} 32
  • 58. Hello World - Python partfrom montysolr import MontySolrBridgeclass SimpleBridge(MontySolrBridge): def __init__(self): super(SimpleBridge, self).__init__() def receive_message(self, message): query = message.getParam(‘query’) message.setResults(‘Hello world!’) print ‘Python received from Java:’, query 33
  • 59. Example - running MontySolr- Java side - JRE (32/64 bit) - Standard Solr/Lucene jars - JCC dynamic library- Python side - Python interpreter (32/64 bit) - 4 Python modules (jcc, solr, lucene, montysolr)- In the main thread - First we load JCC - Then start Python interpreter ... - ... load Python handlers 34
  • 60. Solr as search service Solr /w Invenio Invenio (backend) frontend XML JCC 35
  • 61. Example Solr MyCustom Handler 36
  • 62. Example refersto:author:ellis Solr MyCustom Handler 37
  • 63. Example - Solr custom handler MontySolrVM.INSTANCE.sendMessage(message); PythonMessage msg = MontySolrVM.INSTANCE .createMessage("perform_search") .setSender("Invenio") .setParam("query","refersto:author:ellis"); MontySolrVM.INSTANCE.sendMessage(msg); Object result = msg.getResults(); if (result != null) { int[] hits = (int[]) message.getResults(); } 38
  • 64. Example - JNI connection refersto:author:ellis Solr MyCustom Python Handler Bridge 39
  • 65. Example - JNI connection refersto:author:ellis Solr MyCustom Python Invenio Handler Bridge wrappers 40
  • 66. Example - Python side # handler is made ‘visible’ at startup SolrpieTarget(Invenio:perform_search, perform_search) # search time - called from Java def perform_search(message): query = message.getParam(“query”) hits = call_real_search(query) # cast Python list into Java array message.setResults(JArray_ints(hits)) 41
  • 67. Example refersto:author:ellis Solr Invenio Invenio MyCustom Python Invenio Handler Bridge wrappers Invenio Invenio 42
  • 68. Example - Java side again MontySolrVM.INSTANCE.sendMessage(message); PythonMessage msg = MontySolrVM.INSTANCE .createMessage("perform_search") .setSender("Invenio") .setParam("query","refersto:author:ellis"); MontySolrVM.INSTANCE.sendMessage(msg); Object result = msg.getResults(); if (result != null) { int[] hits = (int[]) message.getResults(); } 43
  • 69. Solr as search service Solr /w Invenio Apache (backend) webserver XML Invenio Invenio JCC 44
  • 70. Outline- Context- The Challenge- Key components - Available technologies - Our approach - Problems solved‣ Evaluation- Wrap-up 45
  • 71. Memory and garbage collection 46
  • 72. Comparing speed and load... 47
  • 73. The effect of cache 48
  • 74. Robust?- Extensive siege tests show very good performance and stability under high load - 100-200 users, complex searches - 50 concurrent users, citation analysis - JCC incurs small overhead- We detected no memory leaks - The same as dbpedia.org- But watch out for errors in C - An error in C module brings down the whole JVM - (errors in pure Python module can be handled) 49
  • 75. Easy to develop/maintain?- Added complexity - Java in the toolbox - Need to compile C++ extensions - Python/OS version dependencies- For this we get - Easy integration with Invenio - The best of two applications - A lot of features for free - And we can control Solr from Python! 50
  • 76. Outline- Context- The Challenge- Key components - Available technologies - Our approach - Problems solved- Evaluation‣ Wrap-up 51
  • 77. Wrap-up- Our challenge was to connect two different languages/systems- And we wanted to get the best of the two... - So we had to plug Python into Solr - And now our Solr knows citation analysis!- We created MontySolr extension - Robust, tested (will be used by INSPIRE) - Works for any Python application (eg. Django) - And for any C/C++ app that Python understands! - Free software license- Try it out! Help us make it better! - https://github.com/romanchyla/montysolr 52
  • 78. Questions?- MontySolr - https://github.com/romanchyla/montysolr- Roman Chyla - Fellow, CERN Scientific Information Service - roman.chyla@cern.ch - @rchyla - https://svnweb.cern.ch/trac/rcarepo
  • 79. Additional information 54
  • 80. Links- Invenio platform - http://invenio-software.org/- INSPIRE Digital library - http://inspirebeta.net/- Diagrams of JCC and JEPP - Andreas Schreiber : Mixing Java and Python - http://www.slideshare.net/onyame/mixing-python-and- java- On Jython C Extension API - http://stackoverflow.com/questions/3097466/using- numpy-and-cpython-with-jython- Demo of a running service: - http://insdev01.cern.ch 55
  • 81. #1 - How to embed Solr (standard)- solr.client.solrj.embedded.EmbeddedSolrServer 56
  • 82. #2 - How to embed Solr (simplified)- solr.servlet.DirectSolrConnection- like previous, but simpler- all the queries are sent as strings, everything is just a string- very flexible and probably suitable for quick integration 57
  • 83. #2 - How to embed Solr (simplified)- solr.servlet.DirectSolrConnection- like previous, but simpler- all the queries are sent as strings, everything is just a string- very flexible and probably suitable for quick integration 57
  • 84. #3 - Example of a Solr custom handler 58
  • 85. #4 - Example Python handler 59