SlideShare a Scribd company logo
1 of 25
Building massive scale,
    fault tolerant,
job processing systems
    with Scala Akka
      framework
     Vignesh Sukumar
        SVCC 2012
About me

• Storage group, Backend Engineering at Box
• Love enterprise software!
• Interested in Big Data and building distributed
  systems in the cloud
About Box

• Leader in enterprise cloud collaboration and
  storage
• Cutting-edge work in backend, frontend,
  platform and engineering services
• A really fun place to work – we have a long
  slide!
Talk outline
• Job processing requirements
• Traditional & new models for job processing

• Akka actors framework
• Achieving and controlling high IO throughput
• Fine-grained fault tolerance
Typical architecture in a cloud storage
             environment
Practical realities

•Storage nodes are usually of varying
configurations (OS, processing power, storage
capacity, etc) mainly because of rapid evolution
in provisioning operations
•Some nodes are more over-worked than the
others (for ex, accepting live uploads)
•Billions of files; petabytes
Job processing requirements

• Iterate over all files (billions, petabyte scale):
  for ex, check consistency of all files

• High throughput

• Fault tolerant

• Secure
Traditional job processing model
Why traditional models fail in cloud
       storage environments
• Not scalable: petabyte scale, billions of files
• Insecure: cannot move files out of storage
  nodes
• No performance control: easy to overwhelm
  any storage node
• No fine grained fault tolerance
Compute on Storage

• Move job computation directly to storage
  nodes
• Utilize abundant CPU on storage nodes
• Metadata store still stays in a highly available
  system like a RDBMS
• Results from operations on a file are
  completely independent
Master – slave architecture
Benefits

• High IO throughput: Direct access; no transfer
  of files over a network
• Secure: files do not leave storage nodes
• Better performance control: compute can
  easily monitor system load and back off
• Better fault tolerance handling: finer grained
  handling of errors
Master node

• Responsible for accepting job submissions and
  splitting them to tasks for slave nodes
• Stateful: keeps durable copy of jobs and tasks
  in Zookeeper
• Horizontally scalable: service can be run on
  multiple nodes
Agent

• Runs directly on the storage nodes on a
  machine-independent JVM container
• Stateless: no task state is maintained
• Monitors system load with back-off
• Reports results directly to master without
  synchronizing with other agents
Implementation with the
  the Scala Akka Actor
       framework
Actors

• Concurrent threads abstraction with no
  shared state
• Exchange messages
• Asynchronous, non-blocking
• Multiple actors can map to a single OS thread
• Parent-children hierarchical relationship
Actors and messages
• Class MyActor extends Actor {
  def receive = {
    case MsgType1 => // do something
  }
}

// instantiation and sending messages
 val actorRef = system.actorOf(Props(new MyActor))
actorRef ! MsgType1
Agent Actor System
Achieving high IO throughput
• Parallel, asynchronous IO through “Futures”
val fileIOResult = Future {
  // issue high latency tasks like file IO
 }
val networkIOResult = Future { // read from network }

Futures.awaitAll(<wait time>, fileIOResult, networkIOResult)
fileIOResult onSuccess { // do something }
networkIOResult onFailure { // retry }
Controlling system throughput

• The problem: agents need to throttle
  themselves as storage nodes serve live traffic

• Adjust number of parallel workers dynamically
  through a monitoring service
Controlling throughput: Examples

•Parallelism parameters can be gotten from a
separate configuration service on a per node
basis
•Some machines can be speeded up and others
slowed down this way
•The configuration can be updated on a cron
schedule to speed up during weekends
Fine grained fault tolerance with
              Supervisors

• Parents of child actors can define specific
  fault-handling strategies for each failure
  scenario in their children
• Components can fail gracefully without
  affecting the entire system
Supervision strategy: Examples


Class TaskActor extends Actor {
  // create child workers
  override val supervisorStrategy = OneForOneStrategy(maxNrOrRetries = 3) {
   case SqlException => Resume // retry the same file
   case FileCorruptionException => Stop // don’t clobber it!
   case IOException => Restart // report and move on
}
Unit testing

• Scalatra test framework: very easy to read!
  TaskActorTest.receive(BadFileMsg) must throw
  FileNotFoundException
• Mocks for network and database calls
val mockHttp = mock[HttpExecutor]
TaskActorTest ! doHttpPost
there was atLeastOne(mockHttp).POST


• Extensive testing of failure injection scenarios
Takeaways
• Keep your architecture simple by modeling
  actor message flow along the same paths as
  parent-child actor hierarchy (i.e., no message
  exchange between peer child actors)
• Design and implement for component failures
• Write unit tests extensively: we did not have
  any fundamental level functionality breakage
• Box Engineering is awesome!

More Related Content

What's hot

ASP.NET Core MVC + Web API with Overview
ASP.NET Core MVC + Web API with OverviewASP.NET Core MVC + Web API with Overview
ASP.NET Core MVC + Web API with OverviewShahed Chowdhuri
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Sandesh Rao
 
Introduction to oracle database (basic concepts)
Introduction to oracle database (basic concepts)Introduction to oracle database (basic concepts)
Introduction to oracle database (basic concepts)Bilal Arshad
 
The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLMorgan Tocker
 
Oracle Management Cloud, OMC architecture
Oracle Management Cloud, OMC architecture Oracle Management Cloud, OMC architecture
Oracle Management Cloud, OMC architecture Samir El-Nabawy
 
12. oracle database architecture
12. oracle database architecture12. oracle database architecture
12. oracle database architectureAmrit Kaur
 
Oracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTSOracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTSChristian Gohmann
 
Uml diagrams
Uml diagramsUml diagrams
Uml diagramsbarney92
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabaseTung Nguyen Thanh
 
What’s New in Oracle Database 19c - Part 1
What’s New in Oracle Database 19c - Part 1What’s New in Oracle Database 19c - Part 1
What’s New in Oracle Database 19c - Part 1Satishbabu Gunukula
 
ASP.NET MVC Presentation
ASP.NET MVC PresentationASP.NET MVC Presentation
ASP.NET MVC Presentationivpol
 
Zero Data Loss Recovery Appliance - Deep Dive
Zero Data Loss Recovery Appliance - Deep DiveZero Data Loss Recovery Appliance - Deep Dive
Zero Data Loss Recovery Appliance - Deep DiveDaniele Massimi
 
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL ShellMySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL ShellMiguel Araújo
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONOracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONMarkus Michalewicz
 
Standard Edition High Availability (SEHA) - The Why, What & How
Standard Edition High Availability (SEHA) - The Why, What & HowStandard Edition High Availability (SEHA) - The Why, What & How
Standard Edition High Availability (SEHA) - The Why, What & HowMarkus Michalewicz
 
Software Engineering - chp5- software architecture
Software Engineering - chp5- software architectureSoftware Engineering - chp5- software architecture
Software Engineering - chp5- software architectureLilia Sfaxi
 
Oracle 12cR2 Installation On Linux With ASM
Oracle 12cR2 Installation On Linux With ASMOracle 12cR2 Installation On Linux With ASM
Oracle 12cR2 Installation On Linux With ASMArun Sharma
 
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c FeaturesBest Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c FeaturesMarkus Michalewicz
 

What's hot (20)

ASP.NET Core MVC + Web API with Overview
ASP.NET Core MVC + Web API with OverviewASP.NET Core MVC + Web API with Overview
ASP.NET Core MVC + Web API with Overview
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
 
Introduction to oracle database (basic concepts)
Introduction to oracle database (basic concepts)Introduction to oracle database (basic concepts)
Introduction to oracle database (basic concepts)
 
The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQL
 
Oracle Management Cloud, OMC architecture
Oracle Management Cloud, OMC architecture Oracle Management Cloud, OMC architecture
Oracle Management Cloud, OMC architecture
 
REST & RESTful Web Services
REST & RESTful Web ServicesREST & RESTful Web Services
REST & RESTful Web Services
 
Laravel Tutorial PPT
Laravel Tutorial PPTLaravel Tutorial PPT
Laravel Tutorial PPT
 
12. oracle database architecture
12. oracle database architecture12. oracle database architecture
12. oracle database architecture
 
Oracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTSOracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTS
 
Uml diagrams
Uml diagramsUml diagrams
Uml diagrams
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL Database
 
What’s New in Oracle Database 19c - Part 1
What’s New in Oracle Database 19c - Part 1What’s New in Oracle Database 19c - Part 1
What’s New in Oracle Database 19c - Part 1
 
ASP.NET MVC Presentation
ASP.NET MVC PresentationASP.NET MVC Presentation
ASP.NET MVC Presentation
 
Zero Data Loss Recovery Appliance - Deep Dive
Zero Data Loss Recovery Appliance - Deep DiveZero Data Loss Recovery Appliance - Deep Dive
Zero Data Loss Recovery Appliance - Deep Dive
 
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL ShellMySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONOracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLON
 
Standard Edition High Availability (SEHA) - The Why, What & How
Standard Edition High Availability (SEHA) - The Why, What & HowStandard Edition High Availability (SEHA) - The Why, What & How
Standard Edition High Availability (SEHA) - The Why, What & How
 
Software Engineering - chp5- software architecture
Software Engineering - chp5- software architectureSoftware Engineering - chp5- software architecture
Software Engineering - chp5- software architecture
 
Oracle 12cR2 Installation On Linux With ASM
Oracle 12cR2 Installation On Linux With ASMOracle 12cR2 Installation On Linux With ASM
Oracle 12cR2 Installation On Linux With ASM
 
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c FeaturesBest Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
 

Similar to Massive scale job processing with Scala Akka framework

Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Ilya Ganelin
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAPaolo Platter
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with EpsilonSina Madani
 
Typesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayTypesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayLuka Zakrajšek
 
Indic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvmIndic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvmIndicThreads
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio, Inc.
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez DataWorks Summit
 
Enhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptxEnhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptxRohit Radhakrishnan
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examplesPeter Lawrey
 
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache TomcatCase Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache TomcatVMware Hyperic
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreDataWorks Summit
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 

Similar to Massive scale job processing with Scala Akka framework (20)

Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKA
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with Epsilon
 
Typesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and PlayTypesafe stack - Scala, Akka and Play
Typesafe stack - Scala, Akka and Play
 
Indic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvmIndic threads pune12-typesafe stack software development on the jvm
Indic threads pune12-typesafe stack software development on the jvm
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata Services
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
Enhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptxEnhanced Reframework Session_16-07-2022.pptx
Enhanced Reframework Session_16-07-2022.pptx
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOL
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examples
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache TomcatCase Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
Case Study: Migrating Hyperic from EJB to Spring from JBoss to Apache Tomcat
 
Road Trip To Component
Road Trip To ComponentRoad Trip To Component
Road Trip To Component
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Massive scale job processing with Scala Akka framework

  • 1. Building massive scale, fault tolerant, job processing systems with Scala Akka framework Vignesh Sukumar SVCC 2012
  • 2. About me • Storage group, Backend Engineering at Box • Love enterprise software! • Interested in Big Data and building distributed systems in the cloud
  • 3. About Box • Leader in enterprise cloud collaboration and storage • Cutting-edge work in backend, frontend, platform and engineering services • A really fun place to work – we have a long slide!
  • 4. Talk outline • Job processing requirements • Traditional & new models for job processing • Akka actors framework • Achieving and controlling high IO throughput • Fine-grained fault tolerance
  • 5. Typical architecture in a cloud storage environment
  • 6. Practical realities •Storage nodes are usually of varying configurations (OS, processing power, storage capacity, etc) mainly because of rapid evolution in provisioning operations •Some nodes are more over-worked than the others (for ex, accepting live uploads) •Billions of files; petabytes
  • 7. Job processing requirements • Iterate over all files (billions, petabyte scale): for ex, check consistency of all files • High throughput • Fault tolerant • Secure
  • 9. Why traditional models fail in cloud storage environments • Not scalable: petabyte scale, billions of files • Insecure: cannot move files out of storage nodes • No performance control: easy to overwhelm any storage node • No fine grained fault tolerance
  • 10. Compute on Storage • Move job computation directly to storage nodes • Utilize abundant CPU on storage nodes • Metadata store still stays in a highly available system like a RDBMS • Results from operations on a file are completely independent
  • 11. Master – slave architecture
  • 12. Benefits • High IO throughput: Direct access; no transfer of files over a network • Secure: files do not leave storage nodes • Better performance control: compute can easily monitor system load and back off • Better fault tolerance handling: finer grained handling of errors
  • 13. Master node • Responsible for accepting job submissions and splitting them to tasks for slave nodes • Stateful: keeps durable copy of jobs and tasks in Zookeeper • Horizontally scalable: service can be run on multiple nodes
  • 14. Agent • Runs directly on the storage nodes on a machine-independent JVM container • Stateless: no task state is maintained • Monitors system load with back-off • Reports results directly to master without synchronizing with other agents
  • 15. Implementation with the the Scala Akka Actor framework
  • 16. Actors • Concurrent threads abstraction with no shared state • Exchange messages • Asynchronous, non-blocking • Multiple actors can map to a single OS thread • Parent-children hierarchical relationship
  • 17. Actors and messages • Class MyActor extends Actor { def receive = { case MsgType1 => // do something } } // instantiation and sending messages val actorRef = system.actorOf(Props(new MyActor)) actorRef ! MsgType1
  • 19. Achieving high IO throughput • Parallel, asynchronous IO through “Futures” val fileIOResult = Future { // issue high latency tasks like file IO } val networkIOResult = Future { // read from network } Futures.awaitAll(<wait time>, fileIOResult, networkIOResult) fileIOResult onSuccess { // do something } networkIOResult onFailure { // retry }
  • 20. Controlling system throughput • The problem: agents need to throttle themselves as storage nodes serve live traffic • Adjust number of parallel workers dynamically through a monitoring service
  • 21. Controlling throughput: Examples •Parallelism parameters can be gotten from a separate configuration service on a per node basis •Some machines can be speeded up and others slowed down this way •The configuration can be updated on a cron schedule to speed up during weekends
  • 22. Fine grained fault tolerance with Supervisors • Parents of child actors can define specific fault-handling strategies for each failure scenario in their children • Components can fail gracefully without affecting the entire system
  • 23. Supervision strategy: Examples Class TaskActor extends Actor { // create child workers override val supervisorStrategy = OneForOneStrategy(maxNrOrRetries = 3) { case SqlException => Resume // retry the same file case FileCorruptionException => Stop // don’t clobber it! case IOException => Restart // report and move on }
  • 24. Unit testing • Scalatra test framework: very easy to read! TaskActorTest.receive(BadFileMsg) must throw FileNotFoundException • Mocks for network and database calls val mockHttp = mock[HttpExecutor] TaskActorTest ! doHttpPost there was atLeastOne(mockHttp).POST • Extensive testing of failure injection scenarios
  • 25. Takeaways • Keep your architecture simple by modeling actor message flow along the same paths as parent-child actor hierarchy (i.e., no message exchange between peer child actors) • Design and implement for component failures • Write unit tests extensively: we did not have any fundamental level functionality breakage • Box Engineering is awesome!

Editor's Notes

  1. 1. Example of a job is to check consistency of all the files: this will involve iterating over every file on all storage nodes, reading file and verifying content integrity.
  2. Scalability: non-performant because of the IO bottleneck in getting files to the application cluster Insecure: application clusters can store the files locally. It’s easy to melt a single a storage node by reading or writing a lot to it Cannot perform fine grained fault tolerance