Cpython embedded in solr - By Roman Chyla

MontySolr:
Embedding CPython in Solr
Roman Chyla, CERN
roman.chyla@cern.ch, May 26, 2011

Thursday, May 26, 2011

Why should I care?
- Our challenge is to connect Python and Java
- Without compromises
- We created MontySolr extension
- Robust, tested (will be used by our system)
- But works for any Python application (eg. Django)
- And for any C/C++ app that Python understands!
- Open source (GPL v2)
- Try it out!
- https://github.com/romanchyla/montysolr

2

Outline

‣ Context
- The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
- Evaluation
- Wrap-up

3

CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide

4

SPIRES
- Stanford Linear Accelerator Center - SLAC
- High-Energy Physics Literature Database
- Started December 1991
- The ﬁrst web outside Europe/CERN
- The ﬁrst database on web

5

Invenio
- Integrated digital library software behind INSPIRE
- Used by very large institutional repositories
- http://repositories.webometrics.info/toprep_inst.asp
- Customizable virtual collections
- Flexible management of metadata
- 3 000 authors per article
- Powerful search engine
- Incl. citation map analysis
- Written in Python (since 2001)
- 290 000 lines of code

8

Outline

- Context
‣ The Challenge
- Key components
- Our approach
- Problems solved
- Evaluation
- Wrap-up

9

The Challenge
- HEP scientiﬁc community
- Searches metadata oriented
- However fulltexts are changing the situation
- And we want to provide even better service
- Bigger volumes of data
- NLP processing
- Semantic search

10

The Challenge

Invenio

11

The Challenge

Query: supersymmetry AND author:ellis

Invenio

11

The Challenge


Invenio fulltext:supersymmetry

11

The Challenge



IDs: 1;2;3;9....

11

The Challenge


1-6M IDs

IDs: 1;2;3;9....

11

The Challenge


1-6M IDs

IDs: 1;2;3;9....

1. only IDs,
no score
= no ranking

11

The Challenge


1-6M IDs

IDs: 1;2;3;9....

2. score merging 1. only IDs,
difﬁcult (if no score
available) = no ranking

11

The Challenge
3. push IDs ?
(eg._faceting)

1-6M IDs

IDs: 1;2;3;9....

2. score merging 1. only IDs,
difﬁcult (if no score
available) = no ranking

11

What is the “best” solution?
- We love Python...
- ...and our applications are written in Python...

- But what if Solr is the master search engine?
- Merge results inside Solr?
- Typical size: 1-10 mil. IDs
- Expected latency: 1-2 s.
- What we want to achieve:
- Fast transfer of hits from Invenio to Solr
- Leverage the power of both (no compromises)
- Developer-friendly integration, simplicity
12

Outline

- Context
- The Challenge
‣ Key components
- Our approach
- Evaluation
- Demonstration
- Wrap-up

13

To embed Solr (in Java app)

- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well

14

To use Solr in non-Java app
- Solr is already usable via HTTP requests, but we
need something else here...
- Remote objects/calls?
- Pyro, execnet, CORBA, SOAP...
- or simply pipes?
- Access Python from Java?
- Jython
- JEPP
- Access Java from Python?
- JPype
- JCC
15

Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded

- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython

16

Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded

- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython

17

JEPP - Java Embedded Python
- Python code runs inside
Python interpreter
- Embeds CPython interpreter
via Java Native Interface
(JNI) in Java
- http://jepp.sourceforge.net/
- recently updated (27-Jan)
- but JCC is more active

18

JEPP - Java Embedded Python

19

JCC
- Embeds JVM in Python
- C++ code generator
- C++ object interface
wraps a Java library
- C++ wrappers conform
to Python's C type
system
- result: complete Python
extension module

20

JCC

21

To use Solr in non-Java app

Jython JCC JEPP

Python ✓ ✓
CModules
Speed ✓ ?

No code ✓ ✓
changes
Access from ✓ ✓
Python
Access from ✓ ... ✓
Java
22

The ﬁrst try

Invenio

Solr

JCC

23

Devil is in details...

24

GIL - Global Interpreter Lock
Unfortunately Python webapp is not like Java...

25


We can have 200 threads, but only 4 will run at time...
26


27

Fortunately solution exists
- JCC can embed Python inside Java
- Special thanks to Andi Vajda! (JCC creator)
- We write ‘empty’ classes in Java ...
- ... and implement them in Python

Python /w Java inside Java /w Python inside 28

The second try

Solr /w Invenio
Invenio (backend)
frontend

XML

JCC

29

Implementing the bridge
- Special Java class
- With method pythonExtension()
- Native method pythonDecRef()
- JCC provides its implementation
- And number of other native methods
- These will be implemented using Python
- Like writing JNI Java/C code but without
compilation...

30

MontySolr extension
- JCC has great potential, but also added
complexity...
- So the MontySolr project was born
- Modules must be built in shared mode
- JCC dynamic library loaded and started from the main
thread
- Simple mechanism of the Python bridge and message
- Conﬁgurable handlers on the Python side
- Secured dereferencing of the native objects
- Threading on the Java side
- Multiprocessing on the Python side
- Easy ant targets (compilation) ...
31

Hello World - Java part
public class MontySolrBridge extends BasicBridge implements
PythonBridge {
private long pythonObject;
public void pythonExtension(long pythonObject) {
this.pythonObject = pythonObject;
}
public long pythonExtension() {
return this.pythonObject;
}
public void finalize() throws Throwable {
pythonDecRef();
}
public native void pythonDecRef();
public void sendMessage(PythonMessage message) {
PythonVM vm = PythonVM.get();
vm.acquireThreadState();
receive_message(message);
vm.releaseThreadState();
}
public native void receive_message(PythonMessage message);
} 32

Hello World - Python part

from montysolr import MontySolrBridge

class SimpleBridge(MontySolrBridge):

def __init__(self):
super(SimpleBridge, self).__init__()

def receive_message(self, message):
query = message.getParam(‘query’)
message.setResults(‘Hello world!’)
print ‘Python received from Java:’, query

33

Example - running MontySolr
- Java side
- JRE (32/64 bit)
- Standard Solr/Lucene jars
- JCC dynamic library
- Python side
- Python interpreter (32/64 bit)
- 4 Python modules (jcc, solr, lucene, montysolr)
- In the main thread
- First we load JCC
- Then start Python interpreter ...
- ... load Python handlers

34

Solr as search service

Solr /w Invenio
Invenio (backend)
frontend

XML

JCC

35

Example

Solr

MyCustom
Handler

36

Example
refersto:author:ellis
Solr

MyCustom
Handler

37

Example - Solr custom handler

MontySolrVM.INSTANCE.sendMessage(message);

PythonMessage msg = MontySolrVM.INSTANCE
.createMessage("perform_search")
.setSender("Invenio")
.setParam("query","refersto:author:ellis");

MontySolrVM.INSTANCE.sendMessage(msg);
Object result = msg.getResults();
if (result != null) {
int[] hits = (int[]) message.getResults();

}

38

Example - JNI connection
Solr

MyCustom Python
Handler Bridge

39

Example - JNI connection
Solr

MyCustom Python Invenio
Handler Bridge wrappers

40

Example - Python side

# handler is made ‘visible’ at startup
SolrpieTarget('Invenio:perform_search',
perform_search)

# search time - called from Java
def perform_search(message):
query = message.getParam(“query”)
hits = call_real_search(query)
# cast Python list into Java array
message.setResults(JArray_ints(hits))

41

Example
Solr

Invenio

Invenio
MyCustom Python Invenio
Handler Bridge wrappers
Invenio

Invenio

42

Example - Java side again

MontySolrVM.INSTANCE.sendMessage(message);

PythonMessage msg = MontySolrVM.INSTANCE
.createMessage("perform_search")
.setSender("Invenio")
.setParam("query","refersto:author:ellis");

MontySolrVM.INSTANCE.sendMessage(msg);
Object result = msg.getResults();
if (result != null) {
int[] hits = (int[]) message.getResults();

}

43

Solr as search service

Solr /w Invenio
Apache (backend)
webserver

XML

Invenio
Invenio

JCC

44

Outline

- Context
- The Challenge
- Key components
- Our approach
- Problems solved
‣ Evaluation
- Wrap-up

45

Memory and garbage collection

46

Comparing speed and load...

47

The effect of cache

48

Robust?
- Extensive siege tests show very good
performance and stability under high load
- 100-200 users, complex searches
- 50 concurrent users, citation analysis
- JCC incurs small overhead
- We detected no memory leaks
- The same as dbpedia.org
- But watch out for errors in C
- An error in C module brings down the whole JVM
- (errors in pure Python module can be handled)

49

Easy to develop/maintain?
- Added complexity
- Java in the toolbox
- Need to compile C++ extensions
- Python/OS version dependencies

- For this we get
- Easy integration with Invenio
- The best of two applications
- A lot of features for free
- And we can control Solr from Python!

50

Outline

- Context
- The Challenge
- Key components
- Our approach
- Problems solved
- Evaluation
‣ Wrap-up

51

Wrap-up
- Our challenge was to connect two different
languages/systems
- And we wanted to get the best of the two...
- So we had to plug Python into Solr
- And now our Solr knows citation analysis!
- We created MontySolr extension
- Robust, tested (will be used by INSPIRE)
- Works for any Python application (eg. Django)
- And for any C/C++ app that Python understands!
- Free software license
- Try it out! Help us make it better!
52

Questions?
- MontySolr

- Roman Chyla
- Fellow, CERN Scientiﬁc Information Service
- roman.chyla@cern.ch
- @rchyla
- https://svnweb.cern.ch/trac/rcarepo


Additional information

54

Links
- Invenio platform
- http://invenio-software.org/
- INSPIRE Digital library
- http://inspirebeta.net/
- Diagrams of JCC and JEPP
- Andreas Schreiber : Mixing Java and Python
- http://www.slideshare.net/onyame/mixing-python-and-
java
- On Jython C Extension API
- http://stackoverﬂow.com/questions/3097466/using-
numpy-and-cpython-with-jython
- Demo of a running service:
- http://insdev01.cern.ch 55

#1 - How to embed Solr (standard)
- solr.client.solrj.embedded.EmbeddedSolrServer

56

#2 - How to embed Solr (simpliﬁed)
- solr.servlet.DirectSolrConnection
- like previous, but simpler
- all the queries are sent as strings, everything is
just a string
- very ﬂexible and probably suitable for quick
integration

57

#3 - Example of a Solr custom handler

58

#4 - Example Python handler

59

Cpython embedded in solr - By Roman Chyla

Recommended

Recommended

More Related Content

Similar to Cpython embedded in solr - By Roman Chyla

Similar to Cpython embedded in solr - By Roman Chyla (14)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

Cpython embedded in solr - By Roman Chyla