Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building distributed processing system from scratch - Part 2


Published on

Continuation of

Published in: Data & Analytics
  • Be the first to comment

Building distributed processing system from scratch - Part 2

  1. 1. Distributed Systems from Scratch - Part 2 Handling third party libraries
  2. 2. ● Madhukara Phatak ● Big data consultant and trainer at ● Consult in Hadoop, Spark and Scala ●
  3. 3. Agenda ● Idea ● Motivation ● Architecture of existing big data system ● Function abstraction ● Third party libraries ● Implementing third party libraries ● MySQL task ● Code example
  4. 4. Idea “What it takes to build a distributed processing system like Spark?”
  5. 5. Motivation ● First version of Spark only had 1600 lines of Scala code ● Had all basic pieces of RDD and ability to run distributed system using Mesos ● Recreating the same code with step by step understanding ● Ample of time in hand
  6. 6. Distributed systems from 30000ft Distributed Storage(HDFS/S3) Distributed Cluster management (YARN/Mesos) Distributed Processing Systems (Spark/MapReduce) Data Applications
  7. 7. Our distributed system Mesos Scala function based abstraction Scala functions to express logic
  8. 8. Function abstraction ● The whole spark API can be summarized a scala function which can represented as follow () => T ● This scala function can be parallelized and sent over network to run on multiple systems using mesos ● The function is represented as a task inside the framework ● FunctionTask.scala
  9. 9. Spark API as distributed function ● Initial API of the spark revolved around scala function abstraction for processing as with RDD for data abstraction ● Every API like map, flatMap represented as a function task which takes one parameter and return one value ● The distribution of the functions are initially done by the mesos which later ported to other cluster management ● This shows how the spark started with functional programming
  10. 10. Till now ● Discussion about Mesos and its abstraction ● Hello world code on Mesos ● Defining Function interface ● Implementing ○ Scheduler to run scala code ○ Custom executor for scala ○ Serialize and Deserialize scala function ●
  11. 11. What a local function can do? ● Access to the local data. Even in spark, normally the function access the hdfs local data ● Ability to access the classes provided by the framework ● Any logic which can be serialized What it cannot do? ● Access classes outside from the framework ● Access the results of other functions (shuffle) ● Access to lookup data (broadcast)
  12. 12. Need of third party libraries ● Ability to add third party libraries in a distributed system framework is important ● Third party libraries allow us to ○ Connect to third party sources ○ Use library to implement custom logic like matrix manipulation inside function abstraction ○ Ability to extend base framework using set of libraries ex: spark-sql ○ Ability to optimize for specific hardware
  13. 13. Approaches to third party libraries ● There are two different approaches to distribute third party jars ● UberJar - Build all the dependencies with your application code to single jar ● Second approach is to distribute the libraries separately and adding them to the classpath of executors ● UberJar suffers from issues of jar size and versioning ● So we are going follow second approach which is similar to one followed in Spark
  14. 14. Design for distributing jars Executor 1 Executor 2 Jar serving http server Scheduler code Scheduler/Driver Download jars over http Download jars over http
  15. 15. Distributing jars ● Third party jars are distributed over http protocol over the cluster ● Whenever the scheduler/drives comes up it starts a http server to serve the jars passed on to it by user ● Whenever executors are created, scheduler passes on the uri of the http server to connect ● Executors connect to the jar server and download the jars to respective machine. Then they add them to their classpath.
  16. 16. Code for implementing ● We need multiple changes to our existing code base to support third party jars ● The following are the different steps ○ Implementation of embedded http server ○ Change to scheduler to start http server ○ Change to executor to download jars and add it to classpath ○ A function which uses third party library
  17. 17. Http Server ● We implement an embedded http server using jetty ● Jetty is a popular http server and J2EE servlet container from eclipse organization ● One of the strength of jetty is it can be embedded inside another program to provide http interfaces to certain functionality ● Initial versions of Spark used jetty for jar distribution. Newer version uses netty. ● ● HttpServer.scala
  18. 18. Scheduler change ● Once we have http server, now we need to start when we start our scheduler ● We will use registered callback for creating our jar server. ● As part of starting the jar server, we will copy all the jars provided by the user to a location which will beame base director for the server. ● Once we have the server running, we pass on the server uri to all the executors ● TaskScheduler.scala
  19. 19. Executor side ● In executor, we download the jars using calls to the jar server running on master ● Once we downloaded the jars, we add it the classpath using URLClassLoader ● We use above classloader to run our functions so that it has access all the jars ● We plug this code in the registered callback of the executor so it run only once ● TaskExecutor.scala
  20. 20. MySQL function ● This example is a function which access the mysql class to run jdbc against a mysql instance ● We ship mysql jar using our jar distributed framework so it will be not part of our application jar ● There is no change in our function api as it’s a normal function as other examples ● MySQLTask.scala
  21. 21. References ● setup-ubuntu/ ● scala/ ● executor-scala/ ● party-libraries-in-mesos/