SlideShare a Scribd company logo
KEY CONCEPTS FOR
SCALABLE STATEFUL
SERVICES
Nikolay Novik
https://github.com/jettify
PyConUA 2017
I AM ...
Software Engineer: at DataRobot Ukraine
Github:
Twitter:
aio-libs:
My Projects:
database clients: aiomysql, aioobc, aiogibson
web and etc: aiomonitor,
aiohttp_debugtoolbar, aiobotocore,
aiohttp_mako, aiohttp_admin, aiorwlock
https://github.com/jettify
https://twitter.com/isinf
https://github.com/aio-libs
POLL: HAVE YOU EVER READ DYNAMO PAPER?
1. I read this papers.
2. I heard about this paper and know key ideas.
3. I think distributed systems is kinda cool.
AGENDA
1. Motivation, why and when we might want to user stateful services.
2. Industry examples: Uber, Halo 4, DragonAge, HPC
3. Problem statement, required components
4. Overview of consistent hashing, gossip dissemination and swim failure
detection
5. Possible improvements
USE STATELESS (DUCK TAPE) WHEN YOU CAN!
Stateless protocol is proved technique, use it like duck tape
ISSUES WITH STATELESS SERVICES
Soft real time is requirement
State serialization
Wasteful data fetching
DB leaky transactions
STATELESS SERVICE EXAMPLE
Notice that user data fetched several times and cached on
multiple servers.
BENEFITS OF STATEFUL SERVICES
Data locality, logic executed where data is stored with fast
access
Lower latency state in memory, no need extra network hops
Higher performance no need to deserialize data
STATEFUL SERVICE EXAMPLE
Avoided are extra trips to the database which reduces latency.
Even if the database is down the request can be handled.
INDUSTRY EXAMPLE: UBER
Geo spatial index service to match driver and user
INDUSTRY EXAMPLE: HALO 4
Orleans used as backbone for server part of Halo game,
including: presence, statistics, cheat detection, etc
INDUSTRY EXAMPLE: HPC
San Diego Supercomputer Center uses Serf to coordinate
compute resources in multiple locations, cluster size is about
2k nodes
LETS TRY TO SOLVE CLOSE TO REAL WORLD
PROBLEM: PREDICTION SERVICE
Services that predicts reselling prices of different products,
based on product specification
User enters used product specs, and obtains price estimate
Each product category
FUNCTIONAL REQUIREMENTS
Dynamic scaling
Fault tolerance
Exploit data
locality
Flexible API
REQUIRED COMPONENTS
1. Work distribution and routing move job request to
appropriate node
2. Cluster membership update provide means to determine
nodes participating in cluster in stable and cluster resizing
conditions
3. Failure detector periodically check nodes and remove
unresponsive/dead ones
ROUTING. NAIVE SOLUTION WITH HARD CODED
CLUSTER NODES
Very easy to implement, viable solution when dynamic
resizing is not required
Does not support dynamic scaling in or scaling out
Requires cluster restart for changing nodes configuration
ROUTING. CONSISTENT HASHING SOLUTION
This simple algorithms made Akamai multi billion worth
company
CONSISTENT HASHING. BASIC IDEA
Consistent hashing minimizes number of keys, need to be
remapped
http://blog.carlosgaldino.com/consistent-hashing.html
CONSISTENT HASHING. ADDING NODE
In case of adding capacity, only fraction of keys will be moved
CONSISTENT HASHING. REMOVING NODE
In case of node failure next address will handle related keys
CONSISTENT HASHING. VIRTUAL NODES
Virtual nodes help with keys distribution, moving it close to
1/n
CLUSTER MEMBERSHIP PROBLEM
We have routing and job distribution, lets figure out how to
add and remove nodes.
WHY NOT JUST USE ZOOKEEPER/CONSUL/ECTD
(OR IN OTHER WORDS ZAB, PAXOS, RAFT)?
Issues
Availability
Performance
Network partitions
Operation overhead
TYPICAL SYSTEM WITH COORDINATION
Zookeeper forces own
view
Possible links:
but for FD used only
Nodes availability
decision best when it
is local
n(n−1)
2
n
CLUSTER MEMBERSHIP UPDATE PROBLEM. NAIVE
SOLUTION
Broadcast: could be used for cluster membership update
Use network broadcast (usually disabled)
Send message one by one to each peer(not reliable)
Xerox invented gossip protocols: and
.
GOSSIP PROTOCOL
anti-entropy rumor
mongering
GOSSIP OVERVIEW
Basic gossip protocol
Send message to k
random peers
peers retransmit
message to next k
random peers
in steps,
information will be
disseminated
log(n)
GOSSIP PROTOCOL VS PACKET LOSS
Heavy packet loss does not stop dissemination, it simply will
take a bit longer, 2 times for 50% loss.
FAILURE DETECTION PROTOCOL
We can route jobs and communicate cluster update, last
component is failure detector.
Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM
(JACM) 43.2 (1996): 225-267.
FAILURE DETECTORS FOR ASYNCHRONOUS
SYSTEMS
In asynchronous distributed systems, the detection of crash
failures is imperfect. There will be false positives and false
negatives.
FAILURE DETECTORS. PROPERTIES
Completeness - every crashed process is eventually
suspected
Accuracy - no correct process is ever suspected
Speed - how fast we can detect fault node
Network message load - number of messages required
during protocol period
BASIC FAILURE DETECTOR
Each process periodically sends out an incremented
heartbeat counter to the outside world.
Another process is detected as failed when a heartbeat is not
received from it for some time
BASIC FAILURE DETECTOR. PROPERTIES
Completeness each process eventually miss heartbeat
Speed configurable, as little as protocol interval
Accuracy high, depends on speed
Network message load each node sends message to
all other nodes
O( )n
2
SWIM FAILURE DETECTOR
SWIM: Scalable Weakly-consistent Infection-style Process
Group Membership. Protocol
SWIM FAILURE DETECTOR
On each protocol round,
node sends only
pings messages
SWIM uses ping as
primary way to do FD, and
indirect ping for better
tolerance to network
partitions
k = 3
SWIM FAILURE DETECTOR. PROPERTIES
Completeness each process eventually will be pinged
Speed configurable, 1 protocol interval
Accuracy 99.9 % with delivery probability 0.95 and k=3
Network message load. ( )O(n) 4k + 2)n
SWIM VS CONNECTION LOSS. SUSPICION
SUBPROTOCOL
Provides a mechanism to reduce the rate of false positives by
“suspecting” a process before “declaring” it as failed within
the group.
SWIM VS PACKET ORDER
Ordering between messages is important, but total order is not
required, only happens before/casual ordering.
Logical timestamp for state updates
Peer specific and only incremented by peer
SWIM VS NETWORK PARTITIONS
Nodes in each subnet can talk to each as result declares peers
on other subnet as dead.
How we can
recover cluster
after network heal?
Do not purge nodes
on dead
Periodically try to
rejoin
PROBLEM SOLVED! IMPLEMENTATION DETAILS
How python can
help with
implementation?
What frameworks
to use?
OVERVIEW OF FRAMEWORKS FOR BUILDING
CLUSTER AWARE SYSTEMS
Name Language Developer Description
??? Python ??? ???
node.js Uber Used as services for matching user and driver with follow
up location update
golang Hashicorp Used in number applications for instance in HPC to
manage computing resources
.NET Microsoft General purpose framework, used in Halo online game
Java EA Games Used in Bioware games, such as DragonAge game, not
sure where thou. Inspired by Orleans
Erlang Basho Building block for Riak database and erlang distributed
systems
Scala Lightblend General purpose distribute systems framework, often used
as microservsies platform
RingPop
Serf
Orleans
Orbit/jGroups
riak_core
Akka
IMPROVEMENT: NETWORK COORDINATES
Famous paper from MIT, describes synthetic network
coordinates, based on ping delays, used in Serf/Consul for data
center fail over
IMPROVEMENT: NETWORK COORDINATES
VISUALIZATION
Notice coordinate drifting in space and stable distance
between clusters
IMPROVEMENT: PARTIAL VIEW FOR HUGE
CLUSTERS
For huge clusters full membership is not scalable, paper
proposes partial membership protocol
IMPROVEMENT: PARTIAL VIEW IN CASE OF NODE
FAILURES
Even for failure rates as high as 95%, HyParView still
manages to maintain a reliability value in the order of
deliveries to 90% of the active processes.
IMPROVEMENT: DHT FOR MORE BALANCING
Orleans uses a one-hop distributed hash table that maps actors
between machines, as result actors could be moved across the
cluster
STATEFUL SERVICES CHALLENGES
Work distribution
Code deployment
Unbounded data structures
Memory management
Persistent strategies
READ MORE PAPERS!
REFERENCES
1. Karger, David, et al. "Consistent hashing and random trees: Distributed caching protocols for
relieving hot spots on the World Wide Web." Proceedings of the twenty-ninth annual ACM
symposium on Theory of computing. ACM, 1997.
2. Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed
systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.
3. Das, Abhinandan, Indranil Gupta, and Ashish Motivala. "Swim: Scalable weakly-consistent
infection-style process group membership protocol." Dependable Systems and Networks, 2002.
DSN 2002. Proceedings. International Conference on. IEEE, 2002.
4. Dabek, Frank, et al. "Vivaldi: A decentralized network coordinate system." ACM SIGCOMM
Computer Communication Review 34.4 (2004): 15-26.
5. Leitao, Joao, José Pereira, and Luis Rodrigues. "HyParView: A membership protocol for reliable
gossip-based broadcast." Dependable Systems and Networks, 2007. DSN'07. 37th Annual
IEEE/IFIP International Conference on. IEEE, 2007.
6. Stoica, Ion, et al. "Chord: A scalable peer-to-peer lookup service for internet applications."
ACM SIGCOMM Computer Communication Review 31.4 (2001): 149-160.
7. Bailis, Peter, and Kyle Kingsbury. "The network is reliable." Queue 12.7 (2014): 20.
8. Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system."
Communications of the ACM 21.7 (1978): 558-565.b
THANK YOU!
aio-libs: https://github.com/aio-libs
slides: https://jettify.github.io/pyconua2017

More Related Content

Similar to KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES

Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
OpenStack
 
CrawlerLD - Distributed crawler for linked data
CrawlerLD - Distributed crawler for linked dataCrawlerLD - Distributed crawler for linked data
CrawlerLD - Distributed crawler for linked data
Raphael do Vale
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
Nitesh Jadhav
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper dive
Robert Kubiś
 
CS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfCS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdf
KishaKiddo
 
Internship msc cs
Internship msc csInternship msc cs
Internship msc cs
Pooja Bhojwani
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
Peter Clapham
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
Crypto Mark Scheme for Fast Pollution Detection and Resistance over Networking
Crypto Mark Scheme for Fast Pollution Detection and Resistance over NetworkingCrypto Mark Scheme for Fast Pollution Detection and Resistance over Networking
Crypto Mark Scheme for Fast Pollution Detection and Resistance over Networking
IRJET Journal
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
Ding Li
 
Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~
Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~
Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~
Brocade
 
A Unique Test Bench for Various System-on-a-Chip
A Unique Test Bench for Various System-on-a-Chip A Unique Test Bench for Various System-on-a-Chip
A Unique Test Bench for Various System-on-a-Chip
IJECEIAES
 
Disadvantages Of Robotium
Disadvantages Of RobotiumDisadvantages Of Robotium
Disadvantages Of Robotium
Susan Tullis
 
Chapter 3 chapter reading task
Chapter 3 chapter reading taskChapter 3 chapter reading task
Chapter 3 chapter reading task
Grievous Humorist-Ilham
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5
Peter Lawrey
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
System Architecture Exploration Training Class
System Architecture Exploration Training ClassSystem Architecture Exploration Training Class
System Architecture Exploration Training Class
Deepak Shankar
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 

Similar to KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES (20)

Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
CrawlerLD - Distributed crawler for linked data
CrawlerLD - Distributed crawler for linked dataCrawlerLD - Distributed crawler for linked data
CrawlerLD - Distributed crawler for linked data
 
Tos tutorial
Tos tutorialTos tutorial
Tos tutorial
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper dive
 
CS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfCS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdf
 
Internship msc cs
Internship msc csInternship msc cs
Internship msc cs
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Crypto Mark Scheme for Fast Pollution Detection and Resistance over Networking
Crypto Mark Scheme for Fast Pollution Detection and Resistance over NetworkingCrypto Mark Scheme for Fast Pollution Detection and Resistance over Networking
Crypto Mark Scheme for Fast Pollution Detection and Resistance over Networking
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~
Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~
Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~
 
A Unique Test Bench for Various System-on-a-Chip
A Unique Test Bench for Various System-on-a-Chip A Unique Test Bench for Various System-on-a-Chip
A Unique Test Bench for Various System-on-a-Chip
 
Disadvantages Of Robotium
Disadvantages Of RobotiumDisadvantages Of Robotium
Disadvantages Of Robotium
 
Chapter 3 chapter reading task
Chapter 3 chapter reading taskChapter 3 chapter reading task
Chapter 3 chapter reading task
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
System Architecture Exploration Training Class
System Architecture Exploration Training ClassSystem Architecture Exploration Training Class
System Architecture Exploration Training Class
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 

Recently uploaded

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Nidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, TipsNidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, Tips
vrstrong314
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Game Development with Unity3D (Game Development lecture 3)
Game Development  with Unity3D (Game Development lecture 3)Game Development  with Unity3D (Game Development lecture 3)
Game Development with Unity3D (Game Development lecture 3)
abdulrafaychaudhry
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 

Recently uploaded (20)

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Nidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, TipsNidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, Tips
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Game Development with Unity3D (Game Development lecture 3)
Game Development  with Unity3D (Game Development lecture 3)Game Development  with Unity3D (Game Development lecture 3)
Game Development with Unity3D (Game Development lecture 3)
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 

KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES

  • 1. KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES Nikolay Novik https://github.com/jettify PyConUA 2017
  • 2. I AM ... Software Engineer: at DataRobot Ukraine Github: Twitter: aio-libs: My Projects: database clients: aiomysql, aioobc, aiogibson web and etc: aiomonitor, aiohttp_debugtoolbar, aiobotocore, aiohttp_mako, aiohttp_admin, aiorwlock https://github.com/jettify https://twitter.com/isinf https://github.com/aio-libs
  • 3. POLL: HAVE YOU EVER READ DYNAMO PAPER? 1. I read this papers. 2. I heard about this paper and know key ideas. 3. I think distributed systems is kinda cool.
  • 4. AGENDA 1. Motivation, why and when we might want to user stateful services. 2. Industry examples: Uber, Halo 4, DragonAge, HPC 3. Problem statement, required components 4. Overview of consistent hashing, gossip dissemination and swim failure detection 5. Possible improvements
  • 5. USE STATELESS (DUCK TAPE) WHEN YOU CAN! Stateless protocol is proved technique, use it like duck tape
  • 6. ISSUES WITH STATELESS SERVICES Soft real time is requirement State serialization Wasteful data fetching DB leaky transactions
  • 7. STATELESS SERVICE EXAMPLE Notice that user data fetched several times and cached on multiple servers.
  • 8. BENEFITS OF STATEFUL SERVICES Data locality, logic executed where data is stored with fast access Lower latency state in memory, no need extra network hops Higher performance no need to deserialize data
  • 9. STATEFUL SERVICE EXAMPLE Avoided are extra trips to the database which reduces latency. Even if the database is down the request can be handled.
  • 10. INDUSTRY EXAMPLE: UBER Geo spatial index service to match driver and user
  • 11. INDUSTRY EXAMPLE: HALO 4 Orleans used as backbone for server part of Halo game, including: presence, statistics, cheat detection, etc
  • 12. INDUSTRY EXAMPLE: HPC San Diego Supercomputer Center uses Serf to coordinate compute resources in multiple locations, cluster size is about 2k nodes
  • 13. LETS TRY TO SOLVE CLOSE TO REAL WORLD PROBLEM: PREDICTION SERVICE Services that predicts reselling prices of different products, based on product specification User enters used product specs, and obtains price estimate Each product category
  • 14. FUNCTIONAL REQUIREMENTS Dynamic scaling Fault tolerance Exploit data locality Flexible API
  • 15. REQUIRED COMPONENTS 1. Work distribution and routing move job request to appropriate node 2. Cluster membership update provide means to determine nodes participating in cluster in stable and cluster resizing conditions 3. Failure detector periodically check nodes and remove unresponsive/dead ones
  • 16. ROUTING. NAIVE SOLUTION WITH HARD CODED CLUSTER NODES Very easy to implement, viable solution when dynamic resizing is not required Does not support dynamic scaling in or scaling out Requires cluster restart for changing nodes configuration
  • 17. ROUTING. CONSISTENT HASHING SOLUTION This simple algorithms made Akamai multi billion worth company
  • 18. CONSISTENT HASHING. BASIC IDEA Consistent hashing minimizes number of keys, need to be remapped http://blog.carlosgaldino.com/consistent-hashing.html
  • 19. CONSISTENT HASHING. ADDING NODE In case of adding capacity, only fraction of keys will be moved
  • 20. CONSISTENT HASHING. REMOVING NODE In case of node failure next address will handle related keys
  • 21. CONSISTENT HASHING. VIRTUAL NODES Virtual nodes help with keys distribution, moving it close to 1/n
  • 22. CLUSTER MEMBERSHIP PROBLEM We have routing and job distribution, lets figure out how to add and remove nodes.
  • 23. WHY NOT JUST USE ZOOKEEPER/CONSUL/ECTD (OR IN OTHER WORDS ZAB, PAXOS, RAFT)? Issues Availability Performance Network partitions Operation overhead
  • 24. TYPICAL SYSTEM WITH COORDINATION Zookeeper forces own view Possible links: but for FD used only Nodes availability decision best when it is local n(n−1) 2 n
  • 25. CLUSTER MEMBERSHIP UPDATE PROBLEM. NAIVE SOLUTION Broadcast: could be used for cluster membership update Use network broadcast (usually disabled) Send message one by one to each peer(not reliable)
  • 26. Xerox invented gossip protocols: and . GOSSIP PROTOCOL anti-entropy rumor mongering
  • 27. GOSSIP OVERVIEW Basic gossip protocol Send message to k random peers peers retransmit message to next k random peers in steps, information will be disseminated log(n)
  • 28. GOSSIP PROTOCOL VS PACKET LOSS Heavy packet loss does not stop dissemination, it simply will take a bit longer, 2 times for 50% loss.
  • 29. FAILURE DETECTION PROTOCOL We can route jobs and communicate cluster update, last component is failure detector.
  • 30. Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267. FAILURE DETECTORS FOR ASYNCHRONOUS SYSTEMS In asynchronous distributed systems, the detection of crash failures is imperfect. There will be false positives and false negatives.
  • 31. FAILURE DETECTORS. PROPERTIES Completeness - every crashed process is eventually suspected Accuracy - no correct process is ever suspected Speed - how fast we can detect fault node Network message load - number of messages required during protocol period
  • 32. BASIC FAILURE DETECTOR Each process periodically sends out an incremented heartbeat counter to the outside world. Another process is detected as failed when a heartbeat is not received from it for some time
  • 33. BASIC FAILURE DETECTOR. PROPERTIES Completeness each process eventually miss heartbeat Speed configurable, as little as protocol interval Accuracy high, depends on speed Network message load each node sends message to all other nodes O( )n 2
  • 34. SWIM FAILURE DETECTOR SWIM: Scalable Weakly-consistent Infection-style Process Group Membership. Protocol
  • 35. SWIM FAILURE DETECTOR On each protocol round, node sends only pings messages SWIM uses ping as primary way to do FD, and indirect ping for better tolerance to network partitions k = 3
  • 36. SWIM FAILURE DETECTOR. PROPERTIES Completeness each process eventually will be pinged Speed configurable, 1 protocol interval Accuracy 99.9 % with delivery probability 0.95 and k=3 Network message load. ( )O(n) 4k + 2)n
  • 37. SWIM VS CONNECTION LOSS. SUSPICION SUBPROTOCOL Provides a mechanism to reduce the rate of false positives by “suspecting” a process before “declaring” it as failed within the group.
  • 38. SWIM VS PACKET ORDER Ordering between messages is important, but total order is not required, only happens before/casual ordering. Logical timestamp for state updates Peer specific and only incremented by peer
  • 39. SWIM VS NETWORK PARTITIONS Nodes in each subnet can talk to each as result declares peers on other subnet as dead. How we can recover cluster after network heal? Do not purge nodes on dead Periodically try to rejoin
  • 40. PROBLEM SOLVED! IMPLEMENTATION DETAILS How python can help with implementation? What frameworks to use?
  • 41. OVERVIEW OF FRAMEWORKS FOR BUILDING CLUSTER AWARE SYSTEMS Name Language Developer Description ??? Python ??? ??? node.js Uber Used as services for matching user and driver with follow up location update golang Hashicorp Used in number applications for instance in HPC to manage computing resources .NET Microsoft General purpose framework, used in Halo online game Java EA Games Used in Bioware games, such as DragonAge game, not sure where thou. Inspired by Orleans Erlang Basho Building block for Riak database and erlang distributed systems Scala Lightblend General purpose distribute systems framework, often used as microservsies platform RingPop Serf Orleans Orbit/jGroups riak_core Akka
  • 42. IMPROVEMENT: NETWORK COORDINATES Famous paper from MIT, describes synthetic network coordinates, based on ping delays, used in Serf/Consul for data center fail over
  • 43. IMPROVEMENT: NETWORK COORDINATES VISUALIZATION Notice coordinate drifting in space and stable distance between clusters
  • 44. IMPROVEMENT: PARTIAL VIEW FOR HUGE CLUSTERS For huge clusters full membership is not scalable, paper proposes partial membership protocol
  • 45. IMPROVEMENT: PARTIAL VIEW IN CASE OF NODE FAILURES Even for failure rates as high as 95%, HyParView still manages to maintain a reliability value in the order of deliveries to 90% of the active processes.
  • 46. IMPROVEMENT: DHT FOR MORE BALANCING Orleans uses a one-hop distributed hash table that maps actors between machines, as result actors could be moved across the cluster
  • 47. STATEFUL SERVICES CHALLENGES Work distribution Code deployment Unbounded data structures Memory management Persistent strategies
  • 49. REFERENCES 1. Karger, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM, 1997. 2. Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267. 3. Das, Abhinandan, Indranil Gupta, and Ashish Motivala. "Swim: Scalable weakly-consistent infection-style process group membership protocol." Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on. IEEE, 2002. 4. Dabek, Frank, et al. "Vivaldi: A decentralized network coordinate system." ACM SIGCOMM Computer Communication Review 34.4 (2004): 15-26. 5. Leitao, Joao, José Pereira, and Luis Rodrigues. "HyParView: A membership protocol for reliable gossip-based broadcast." Dependable Systems and Networks, 2007. DSN'07. 37th Annual IEEE/IFIP International Conference on. IEEE, 2007. 6. Stoica, Ion, et al. "Chord: A scalable peer-to-peer lookup service for internet applications." ACM SIGCOMM Computer Communication Review 31.4 (2001): 149-160. 7. Bailis, Peter, and Kyle Kingsbury. "The network is reliable." Queue 12.7 (2014): 20. 8. Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system." Communications of the ACM 21.7 (1978): 558-565.b
  • 50. THANK YOU! aio-libs: https://github.com/aio-libs slides: https://jettify.github.io/pyconua2017