How to integrate python into a scala stack

3,062 views
2,726 views

Published on

Published in: Technology
1 Comment
6 Likes
Statistics
Notes
  • This is great, I am starting a similar project (on a much smaller POC scale). In the case of JEPP what are the downsides of having two runtime environments, other than of course, all the problems one runtime environment can have x2.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
3,062
On SlideShare
0
From Embeds
0
Number of Embeds
189
Actions
Shares
0
Downloads
23
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

How to integrate python into a scala stack

  1. 1. Scala and Python Integrating scikit-learn into a Scala Stack to build realtime predictive models Dan Chiao VP Engineering
  2. 2. Why it was necessary We pivoted
  3. 3. The original product • Social data append – PeopleGraph: match email addresses to public demographics and social profiles – BrandGraph: match company URLs to public firmographics and social profiles • Requirements – Integrate a large (and expanding) number of web data sources (REST, SOAP, flat files) – Realtime processing of large volumes of contacts (60 queries/s)
  4. 4. The original technology stack • Scala – Best of both worlds • Concise functional syntax • Java libraries and deployment architecture • Scala-specific libraries (Dispatch, Lift Web Framework) • Twitter (soon to be Apache) Storm – Streaming intake and normalization of large amounts of data • MongoDB – Expanding data sources = constantly updating schema – Most sophisticated query syntax of NoSQL options • AWS and Azure – Well, duh
  5. 5. The new product • Moving up the application stack – Focus on the most compelling single-use case for our data – Fliptop SpendScore • Predictive analytics for sales and marketing teams • “Machine learning for Salesforce”
  6. 6. The updated technology stack • Still need to wrangle large amounts of data, so no changes there • New requirement: fast, scalable machine learning
  7. 7. Why not Scala (Java) native? • The options – Apache Mahout • Only skeleton implementations for most sophicated machine learning techniques (e.g. Random Forest, Adaboost) • Customer-specific models – don’t need Big Data – Weka – GPL – Scala-native libraries – Too early to use in production
  8. 8. Why Python? • scikit-learn – Mature – around since 2006 – Actively-developed – Last stable release Aug 2013 – Sophisticated – Random Forest and Adaboost classifier show comparable performance to R • Why not R? Not really production grade.
  9. 9. Requirements • APIs to exploit Python’s modeling power – Train, predict, model info query, etc. • Scalability – On demand Python serving nodes
  10. 10. Tools for Scala-Python Integration • Reimplementation of Python – Jython (JPython) • Communication through JNI – Jepp • Communication through IPC – Apache Thrift • Communication through REST API calls – Bottle
  11. 11. Jython • Re-Implementation of Python in Java • Can import and use any Java class. • Includes almost all of the modules in the standard Python distribution – Except some of the modules implemented originally in C. • Compiles to Java bytecode – either on demand or statically. 1 1
  12. 12. Jython 1 2 JVM Scala Code Python Code Jython
  13. 13. Jython • Lacks support for lots of extensions for scientific computing – Numpy, Scipy, etc. • JyNI (Jython Native Interface) to the rescue? – Specifically designed to support CPython extensions like Numpy, Scipy – Still in alpha 1 3
  14. 14. Communication through JNI • Jepp (Java Embedded Python) – Embeds CPython in Java – Runs Python code in CPython – Leverages both JNI and Python/C for integration
  15. 15. Python Interpreter Jepp 1 5 JVM Scala Code Python Code JNI Jepp
  16. 16. Jepp 1 6 object TestJepp extends App { val jep = new Jep() jep.runScript("python_util.py") val a = (2).asInstanceOf[AnyRef] val b = (3).asInstanceOf[AnyRef] val sumByPython = jep.invoke("python_add", a, b) println(sumByPython.asInstanceOf[Int]) } def python_add(a, b): return a + b python_util.py TestJepp.scala
  17. 17. Communication through IPC • Apache Thrift – Developed & open-sourced by Facebook – More community support than Protobuf, Avro – IDL-based (Interface Definition Language) – Generates server/client code in specified languages – Take care of protocol and transport layer details – Comes with generators for Java, Python, C++, etc. • No Scala generator • Scrooge (Twitter) to the rescue! 1 7
  18. 18. Thrift – IDL 1 8 namespace java python_service_test namespace py python_service_test service PythonAddService { i32 pythonAdd (1:i32 a, 2:i32 b), } TestThrift.thrift $ thrift --gen java --gen py TestThrift.thrift
  19. 19. Thrift – Python Server 1 9 class ExampleHandler(python_service_test.PythonAddService.Iface): def pythonAdd(self, a, b): return a + b handler = ExampleHandler() processor = Example.Processor(handler) transport = TSocket.TServerSocket(9090) tfactory = TTransport.TBufferedTransportFactory() pfactory = TBinaryProtocol.TBinaryProtocolFactory() server = TServer.TThreadedServer(processor, transport, tfactory, pfactory) server.serve() PythonAddServer.py class Iface: def pythonAdd(self, a, b): pass PythonAddService.p y
  20. 20. Thrift – Scala Client 2 0 object PythonAddClient extends App { val transport: TTransport = new TSocket("localhost", 9090) val protocol: TProtocol = new TBinaryProtocol(transport) val client = new PythonAddService.Client(protocol) transport.open() val sumByPython = client.python_add(3, 5) println("3 + 5 = " + sumByPython) transport.close() } PythonAddClient.sc ala
  21. 21. Thrift 2 1 JVM Scala Code Thrift Python Code Python Interpreter Thrift Python Code Python Interpreter Thrift … Auto Balancing、 Built-in Encryption
  22. 22. REST API Architecture 2 2 …Bottle Python Code Bottle Python Code Bottle Python Code JVM Scala Code Auto Balancer? Encoding?
  23. 23. Thrift v.s. REST Thrift REST Load Balancer ✔ Encode/Decode ✔ Low Learning Curve ✔ No Dependency ✔ Does it matter? No (AWS & Azure) No (We’re already doing it) Yes Yes
  24. 24. Fliptop’s Architecture 2 4 Load Balancer …Bottle Python Code Bottle Python Code Bottle Python Code JVM Scala Code 5 Python servers ~5,000 requests/sec
  25. 25. Summary • Jython • (✓) Tight integration with Scala/Java • (✗) Lack support for C extensions (JyNI might help in the future) • Jepp • (✓) Access high quality Python extensions with CPython speed • (✗) Two runtime environments • Thrift, REST • (✓) Language-independent development • (✗) Bigger communication overhead 2 5
  26. 26. Questions? Ask this guy
  27. 27. Thank You 2 7

×