Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building massive scale,    fault tolerant,job processing systems    with Scala Akka      framework     Vignesh Sukumar    ...
About me• Storage group, Backend Engineering at Box• Love enterprise software!• Interested in Big Data and building distri...
About Box• Leader in enterprise cloud collaboration and  storage• Cutting-edge work in backend, frontend,  platform and en...
Talk outline• Job processing requirements• Traditional & new models for job processing• Akka actors framework• Achieving a...
Typical architecture in a cloud storage             environment
Practical realities•Storage nodes are usually of varyingconfigurations (OS, processing power, storagecapacity, etc) mainly...
Job processing requirements• Iterate over all files (billions, petabyte scale):  for ex, check consistency of all files• H...
Traditional job processing model
Why traditional models fail in cloud       storage environments• Not scalable: petabyte scale, billions of files• Insecure...
Compute on Storage• Move job computation directly to storage  nodes• Utilize abundant CPU on storage nodes• Metadata store...
Master – slave architecture
Benefits• High IO throughput: Direct access; no transfer  of files over a network• Secure: files do not leave storage node...
Master node• Responsible for accepting job submissions and  splitting them to tasks for slave nodes• Stateful: keeps durab...
Agent• Runs directly on the storage nodes on a  machine-independent JVM container• Stateless: no task state is maintained•...
Implementation with the  the Scala Akka Actor       framework
Actors• Concurrent threads abstraction with no  shared state• Exchange messages• Asynchronous, non-blocking• Multiple acto...
Actors and messages• Class MyActor extends Actor {  def receive = {    case MsgType1 => // do something  }}// instantiatio...
Agent Actor System
Achieving high IO throughput• Parallel, asynchronous IO through “Futures”val fileIOResult = Future {  // issue high latenc...
Controlling system throughput• The problem: agents need to throttle  themselves as storage nodes serve live traffic• Adjus...
Controlling throughput: Examples•Parallelism parameters can be gotten from aseparate configuration service on a per nodeba...
Fine grained fault tolerance with              Supervisors• Parents of child actors can define specific  fault-handling st...
Supervision strategy: ExamplesClass TaskActor extends Actor {  // create child workers  override val supervisorStrategy = ...
Unit testing• Scalatra test framework: very easy to read!  TaskActorTest.receive(BadFileMsg) must throw  FileNotFoundExcep...
Takeaways• Keep your architecture simple by modeling  actor message flow along the same paths as  parent-child actor hiera...
Upcoming SlideShare
Loading in …5
×

Building large scale, job processing systems with Scala Akka Actor framework

14,835 views

Published on

The Akka Actor framework is designed to be a fast message processing system. In this talk, we will explain how, at Box, we have used this framework to develop a large scale job processing system that works on billions of data files and achieves a high degree of throughput and fault tolerance. Over the course of the talk, we will explore the usage of Akka framework’s Supervisor functionality to provide a more controllable fault-tolerance strategy, and how we can use Futures to manage asynchronous jobs.

Published in: Technology
  • Be the first to comment

Building large scale, job processing systems with Scala Akka Actor framework

  1. Building massive scale, fault tolerant,job processing systems with Scala Akka framework Vignesh Sukumar SVCC 2012
  2. About me• Storage group, Backend Engineering at Box• Love enterprise software!• Interested in Big Data and building distributed systems in the cloud
  3. About Box• Leader in enterprise cloud collaboration and storage• Cutting-edge work in backend, frontend, platform and engineering services• A really fun place to work – we have a long slide!
  4. Talk outline• Job processing requirements• Traditional & new models for job processing• Akka actors framework• Achieving and controlling high IO throughput• Fine-grained fault tolerance
  5. Typical architecture in a cloud storage environment
  6. Practical realities•Storage nodes are usually of varyingconfigurations (OS, processing power, storagecapacity, etc) mainly because of rapid evolutionin provisioning operations•Some nodes are more over-worked than theothers (for ex, accepting live uploads)•Billions of files; petabytes
  7. Job processing requirements• Iterate over all files (billions, petabyte scale): for ex, check consistency of all files• High throughput• Fault tolerant• Secure
  8. Traditional job processing model
  9. Why traditional models fail in cloud storage environments• Not scalable: petabyte scale, billions of files• Insecure: cannot move files out of storage nodes• No performance control: easy to overwhelm any storage node• No fine grained fault tolerance
  10. Compute on Storage• Move job computation directly to storage nodes• Utilize abundant CPU on storage nodes• Metadata store still stays in a highly available system like a RDBMS• Results from operations on a file are completely independent
  11. Master – slave architecture
  12. Benefits• High IO throughput: Direct access; no transfer of files over a network• Secure: files do not leave storage nodes• Better performance control: compute can easily monitor system load and back off• Better fault tolerance handling: finer grained handling of errors
  13. Master node• Responsible for accepting job submissions and splitting them to tasks for slave nodes• Stateful: keeps durable copy of jobs and tasks in Zookeeper• Horizontally scalable: service can be run on multiple nodes
  14. Agent• Runs directly on the storage nodes on a machine-independent JVM container• Stateless: no task state is maintained• Monitors system load with back-off• Reports results directly to master without synchronizing with other agents
  15. Implementation with the the Scala Akka Actor framework
  16. Actors• Concurrent threads abstraction with no shared state• Exchange messages• Asynchronous, non-blocking• Multiple actors can map to a single OS thread• Parent-children hierarchical relationship
  17. Actors and messages• Class MyActor extends Actor { def receive = { case MsgType1 => // do something }}// instantiation and sending messages val actorRef = system.actorOf(Props(new MyActor))actorRef ! MsgType1
  18. Agent Actor System
  19. Achieving high IO throughput• Parallel, asynchronous IO through “Futures”val fileIOResult = Future { // issue high latency tasks like file IO }val networkIOResult = Future { // read from network }Futures.awaitAll(<wait time>, fileIOResult, networkIOResult)fileIOResult onSuccess { // do something }networkIOResult onFailure { // retry }
  20. Controlling system throughput• The problem: agents need to throttle themselves as storage nodes serve live traffic• Adjust number of parallel workers dynamically through a monitoring service
  21. Controlling throughput: Examples•Parallelism parameters can be gotten from aseparate configuration service on a per nodebasis•Some machines can be speeded up and othersslowed down this way•The configuration can be updated on a cronschedule to speed up during weekends
  22. Fine grained fault tolerance with Supervisors• Parents of child actors can define specific fault-handling strategies for each failure scenario in their children• Components can fail gracefully without affecting the entire system
  23. Supervision strategy: ExamplesClass TaskActor extends Actor { // create child workers override val supervisorStrategy = OneForOneStrategy(maxNrOrRetries = 3) { case SqlException => Resume // retry the same file case FileCorruptionException => Stop // don’t clobber it! case IOException => Restart // report and move on}
  24. Unit testing• Scalatra test framework: very easy to read! TaskActorTest.receive(BadFileMsg) must throw FileNotFoundException• Mocks for network and database callsval mockHttp = mock[HttpExecutor]TaskActorTest ! doHttpPostthere was atLeastOne(mockHttp).POST• Extensive testing of failure injection scenarios
  25. Takeaways• Keep your architecture simple by modeling actor message flow along the same paths as parent-child actor hierarchy (i.e., no message exchange between peer child actors)• Design and implement for component failures• Write unit tests extensively: we did not have any fundamental level functionality breakage• Box Engineering is awesome!

×