Embedding CPython in Solr

MontySolr:
Embedding CPython in Solr
Roman Chyla, CERN
roman.chyla@cern.ch, May 26, 2011

Thursday, May 26, 2011

Why should I care?
- Our challenge is to connect Python and Java
- Without compromises
- We created MontySolr extension
- Robust, tested (will be used by our system)
- But works for any Python application (eg. Django)
- And for any C/C++ app that Python understands!
- Open source (GPL v2)
- Try it out!
- https://github.com/romanchyla/montysolr

2

Outline

‣ Context
- The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
- Evaluation
- Wrap-up

3

CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide

4

SPIRES
- Stanford Linear Accelerator Center - SLAC
- High-Energy Physics Literature Database
- Started December 1991
- The ﬁrst web outside Europe/CERN
- The ﬁrst database on web

5

Invenio
- Integrated digital library software behind INSPIRE
- Used by very large institutional repositories
- http://repositories.webometrics.info/toprep_inst.asp
- Customizable virtual collections
- Flexible management of metadata
- 3 000 authors per article
- Powerful search engine
- Incl. citation map analysis
- Written in Python (since 2001)
- 290 000 lines of code

8

Outline

- Context
‣ The Challenge
- Key components
- Our approach
- Problems solved
- Evaluation
- Wrap-up

9

The Challenge
- HEP scientiﬁc community
- Searches metadata oriented
- However fulltexts are changing the situation
- And we want to provide even better service
- Bigger volumes of data
- NLP processing
- Semantic search

10

The Challenge

Invenio

11

The Challenge

Query: supersymmetry AND author:ellis

Invenio

11

The Challenge


Invenio fulltext:supersymmetry

11

The Challenge



IDs: 1;2;3;9....

11

The Challenge


1-6M IDs

IDs: 1;2;3;9....

11

The Challenge


1-6M IDs

IDs: 1;2;3;9....

1. only IDs,
no score
= no ranking

11

The Challenge


1-6M IDs

IDs: 1;2;3;9....

2. score merging 1. only IDs,
difﬁcult (if no score
available) = no ranking

11

The Challenge
3. push IDs ?
(eg._faceting)

1-6M IDs

IDs: 1;2;3;9....

2. score merging 1. only IDs,
difﬁcult (if no score
available) = no ranking

11

What is the “best” solution?
- We love Python...
- ...and our applications are written in Python...

- But what if Solr is the master search engine?
- Merge results inside Solr?
- Typical size: 1-10 mil. IDs
- Expected latency: 1-2 s.
- What we want to achieve:
- Fast transfer of hits from Invenio to Solr
- Leverage the power of both (no compromises)
- Developer-friendly integration, simplicity
12

Outline

- Context
- The Challenge
‣ Key components
- Our approach
- Evaluation
- Demonstration
- Wrap-up

13

To embed Solr (in Java app)

- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well

14

To use Solr in non-Java app
- Solr is already usable via HTTP requests, but we
need something else here...
- Remote objects/calls?
- Pyro, execnet, CORBA, SOAP...
- or simply pipes?
- Access Python from Java?
- Jython
- JEPP
- Access Java from Python?
- JPype
- JCC
15

Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded

- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython

16

Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded

- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython

17

JEPP - Java Embedded Python
- Python code runs inside
Python interpreter
- Embeds CPython interpreter
via Java Native Interface
(JNI) in Java
- http://jepp.sourceforge.net/
- recently updated (27-Jan)
- but JCC is more active

18

JEPP - Java Embedded Python

19

JCC
- Embeds JVM in Python
- C++ code generator
- C++ object interface
wraps a Java library
- C++ wrappers conform
to Python's C type
system
- result: complete Python
extension module

20

JCC

21

To use Solr in non-Java app

Jython JCC JEPP

Python ✓ ✓
CModules
Speed ✓ ?

No code ✓ ✓
changes
Access from ✓ ✓
Python
Access from ✓ ... ✓
Java
22

The ﬁrst try

Invenio

Solr

JCC

23

Devil is in details...

24

GIL - Global Interpreter Lock
Unfortunately Python webapp is not like Java...

25


We can have 200 threads, but only 4 will run at time...
26


27

Fortunately solution exists
- JCC can embed Python inside Java
- Special thanks to Andi Vajda! (JCC creator)
- We write ‘empty’ classes in Java ...
- ... and implement them in Python

Python /w Java inside Java /w Python inside 28

The second try

Solr /w Invenio
Invenio (backend)
frontend

XML

JCC

29

Implementing the bridge
- Special Java class
- With method pythonExtension()
- Native method pythonDecRef()
- JCC provides its implementation
- And number of other native methods
- These will be implemented using Python
- Like writing JNI Java/C code but without
compilation...

30

MontySolr extension
- JCC has great potential, but also added
complexity...
- So the MontySolr project was born
- Modules must be built in shared mode
- JCC dynamic library loaded and started from the main
thread
- Simple mechanism of the Python bridge and message
- Conﬁgurable handlers on the Python side
- Secured dereferencing of the native objects
- Threading on the Java side
- Multiprocessing on the Python side
- Easy ant targets (compilation) ...
31

Hello World - Java part
public class MontySolrBridge extends BasicBridge implements
PythonBridge {
private long pythonObject;
public void pythonExtension(long pythonObject) {
this.pythonObject = pythonObject;
}
public long pythonExtension() {
return this.pythonObject;
}
public void finalize() throws Throwable {
pythonDecRef();
}
public native void pythonDecRef();
public void sendMessage(PythonMessage message) {
PythonVM vm = PythonVM.get();
vm.acquireThreadState();
receive_message(message);
vm.releaseThreadState();
}
public native void receive_message(PythonMessage message);
} 32

Hello World - Python part

from montysolr import MontySolrBridge

class SimpleBridge(MontySolrBridge):

def __init__(self):
super(SimpleBridge, self).__init__()

def receive_message(self, message):
query = message.getParam(‘query’)
message.setResults(‘Hello world!’)
print ‘Python received from Java:’, query

33

Example - running MontySolr
- Java side
- JRE (32/64 bit)
- Standard Solr/Lucene jars
- JCC dynamic library
- Python side
- Python interpreter (32/64 bit)
- 4 Python modules (jcc, solr, lucene, montysolr)
- In the main thread
- First we load JCC
- Then start Python interpreter ...
- ... load Python handlers

34

Solr as search service

Solr /w Invenio
Invenio (backend)
frontend

XML

JCC

35

Example

Solr

MyCustom
Handler

36

Example
refersto:author:ellis
Solr

MyCustom
Handler

37

Example - Solr custom handler

MontySolrVM.INSTANCE.sendMessage(message);

PythonMessage msg = MontySolrVM.INSTANCE
.createMessage("perform_search")
.setSender("Invenio")
.setParam("query","refersto:author:ellis");

MontySolrVM.INSTANCE.sendMessage(msg);
Object result = msg.getResults();
if (result != null) {
int[] hits = (int[]) message.getResults();

}

38

Example - JNI connection
Solr

MyCustom Python
Handler Bridge

39

Example - JNI connection
Solr

MyCustom Python Invenio
Handler Bridge wrappers

40

Example - Python side

# handler is made ‘visible’ at startup
SolrpieTarget('Invenio:perform_search',
perform_search)

# search time - called from Java
def perform_search(message):
query = message.getParam(“query”)
hits = call_real_search(query)
# cast Python list into Java array
message.setResults(JArray_ints(hits))

41

Example
Solr

Invenio

Invenio
MyCustom Python Invenio
Handler Bridge wrappers
Invenio

Invenio

42

Example - Java side again

MontySolrVM.INSTANCE.sendMessage(message);

PythonMessage msg = MontySolrVM.INSTANCE
.createMessage("perform_search")
.setSender("Invenio")
.setParam("query","refersto:author:ellis");

MontySolrVM.INSTANCE.sendMessage(msg);
Object result = msg.getResults();
if (result != null) {
int[] hits = (int[]) message.getResults();

}

43

Solr as search service

Solr /w Invenio
Apache (backend)
webserver

XML

Invenio
Invenio

JCC

44

Outline

- Context
- The Challenge
- Key components
- Our approach
- Problems solved
‣ Evaluation
- Wrap-up

45

Memory and garbage collection

46

Comparing speed and load...

47

The effect of cache

48

Robust?
- Extensive siege tests show very good
performance and stability under high load
- 100-200 users, complex searches
- 50 concurrent users, citation analysis
- JCC incurs small overhead
- We detected no memory leaks
- The same as dbpedia.org
- But watch out for errors in C
- An error in C module brings down the whole JVM
- (errors in pure Python module can be handled)

49

Easy to develop/maintain?
- Added complexity
- Java in the toolbox
- Need to compile C++ extensions
- Python/OS version dependencies

- For this we get
- Easy integration with Invenio
- The best of two applications
- A lot of features for free
- And we can control Solr from Python!

50

Outline

- Context
- The Challenge
- Key components
- Our approach
- Problems solved
- Evaluation
‣ Wrap-up

51

Wrap-up
- Our challenge was to connect two different
languages/systems
- And we wanted to get the best of the two...
- So we had to plug Python into Solr
- And now our Solr knows citation analysis!
- We created MontySolr extension
- Robust, tested (will be used by INSPIRE)
- Works for any Python application (eg. Django)
- And for any C/C++ app that Python understands!
- Free software license
- Try it out! Help us make it better!
52

Questions?
- MontySolr

- Roman Chyla
- Fellow, CERN Scientiﬁc Information Service
- roman.chyla@cern.ch
- @rchyla
- https://svnweb.cern.ch/trac/rcarepo


Additional information

54

Links
- Invenio platform
- http://invenio-software.org/
- INSPIRE Digital library
- http://inspirebeta.net/
- Diagrams of JCC and JEPP
- Andreas Schreiber : Mixing Java and Python
- http://www.slideshare.net/onyame/mixing-python-and-
java
- On Jython C Extension API
- http://stackoverﬂow.com/questions/3097466/using-
numpy-and-cpython-with-jython
- Demo of a running service:
- http://insdev01.cern.ch 55

#1 - How to embed Solr (standard)
- solr.client.solrj.embedded.EmbeddedSolrServer

56

#2 - How to embed Solr (simpliﬁed)
- solr.servlet.DirectSolrConnection
- like previous, but simpler
- all the queries are sent as strings, everything is
just a string
- very ﬂexible and probably suitable for quick
integration

57

#3 - Example of a Solr custom handler

58

#4 - Example Python handler

59

Embedding CPython in Solr

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Embedding CPython in Solr

Similar to Embedding CPython in Solr (16)

More from Lucidworks (Archived)

More from Lucidworks (Archived) (20)

Embedding CPython in Solr