SlideShare a Scribd company logo
Call me maybe: Jepsen and flaky networks
Shalin Shekhar Mangar
@shalinmangar
Lucidworks Inc.
Typical first year for a new
cluster
— Jeff Dean, Google
• ~5 racks out of 30 go
wonky (50% packetloss)
• ~8 network
maintenances (4 might
cause ~30-minute
random connectivity
losses)
• ~3 router failures (have
to immediately pull
traffic for an hour)
LADIS 2009
Reliable networks are a
myth
• GC pause
• Process crash
• Scheduling delays
• Network maintenance
• Faulty equipment
Network
n1
n2
n3
n4
n5
Network partition
n1
n2
n3
n4
n5
Messages can
be lost,
delayed,
reordered and
duplicated
n1
n2
X
n1
n2
Time
Drop
Delay
n1
n2
Duplicate
n1
n2
Reorder
CAP recap
• Consistency (Linearizability): A total order on all operations such
that each operation looks as if it were completed at a single instant.
• Availability: Every request received by a non-failing node in the
system must result in a response.
• Partition Tolerance: Arbitrary many messages between two nodes
may be lost. Mandatory unless you can guarantee that partitions
don’t happen at all.
Have you
planned for
these?
Availability
Consistency
X
X
• Errors
• Connection timeouts
• Hung requests (read
timeouts)
• Stale results
• Dirty results
• Data lost forever!
During and after a partition
Jepsen: Testing systems
under stress
• Network partitions
• Random process crashes
• Slow networks
• Clock skew
http://github.com/aphyr/jepsen
Anatomy of a Jepsen test
• Automated DB setup
• Test definitions a.k.a Client
• Partition types a.k.a Nemesis
• Scheduler of operations (client & nemesis)
• History of operations
• Consistency checker
Data store specific
(Mongo/Solr/Elastic)
Provided by Jepsen
n1
n2
n3
c1
c2
c3
OK
X
DatastoreClients
History
?
nem.e.sis
the
inescapable
agent of
someone’s
downfall
Nemesis
n1
n2
n3
n4
n5
partition-random-node
n1
n2
n3
n4
n5
kill-random-node clock-scrambler
Nemesis
n1
n2
n3
n4
n5
partition-halves
n1
n4
n5
n2
n3
partition-random-halves
n1
n2
n4
n5
bridge
n3
A set of integers: cas-set-client
• S = {1, 2, 3, 4, 5, …}
• Stored as a single document containing all the integers
• Update using compare-and-set
• Multiple clients try to update concurrently
• Create and restore partitions
• Finally, read the set of integers and verify consistency
Compare and Set client
cas({}, 1)
cas(1, 2)
{1}
{1, 2}
cas(1, 3) X
Time
Client 1
Client 2
cas(2, 4) X
cas(2, 5) {1, 2, 5}
Client 1
Client 2
t=0 t=1 t=x
Compare and Set client
cas({}, 1)
cas(1, 2)
{1}
{1, 2}
cas(1, 3) X
Time
Client 1
Client 2
cas(2, 4) X
cas(2, 5) {1, 2, 5}
Client 1
Client 2
t=0 t=1 t=x
History = [(t, op, result)]
Solr
• Search server built on Lucene
• Lucene index + transaction log
• Optimistic concurrency, linearizable CAS ops
• Synchronous replication to all ‘live’ nodes
• ZooKeeper for ‘consensus’
• http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-
networks/
Add an integer
every second,
partition
network every
30 seconds for
200 seconds
Solr - Are we safe?
• Leaders become unavailable for upto ZK session timeout, typically
30 seconds (expected)
• Some write ‘hang’ for a long time on partition. Timeouts are
essential. (unexpected)
• Final reads under CAS are consistent but we haven’t proved
linearizability (good!)
• Loss of availability for writes in minority partition. (expected)
• No data loss (yet!) which is great!
Solr - Bugs, bugs & bugs
• SOLR-6530: Commits under network partition can put any node into
‘down’ state.
• SOLR-6583: Resuming connection with ZK causes log replay
• SOLR-6511: Requests threads hang under network partition
• SOLR-7636: A flaky cluster status API - times out during partitions
• SOLR-7109: Indexing threads stuck under network partition can mark
leader as down
Elastic
• Search server built on Lucene
• It has a Lucene index and a transaction log
• Consistent single doc reads, writes & updates
• Eventually consistent search but a flush/commit should ensure that
changes are visible
Elastic
• Optimistic concurrency control a.k.a CAS linearizibility
• Synchronous acknowledgement from a majority of nodes
• “Instantaneous” promotion under a partition
• Homegrown ‘ZenDisco’ consensus
Elastic - Are we safe?
• “Instantaneous” promotion is not. 90 seconds timeouts to elect a
new primary (worse in <1.5.0)
• Bridge partition: 645/1961 writes acknowledged and lost in 1.1.0.
Better in 1.5.0, only 22/897 lost.
• Isolated primaries: 209/947 updates lost
• Repeated pauses (simulating GC): 200/2143 updates lost
• Getting better but not quite there. Good documentation on
resiliency problems.
MongoDB
• Document-oriented database
• Replica set has a single primary which accepts writes
• Primary asynchronously replicates writes to secondaries
• Replica decide between themselves to promote/demote primaries
• Applies to 2.4.3 and 2.6.7
MongoDB
• Claims atomic writes per document and consistent reads
• But strict consistency only when reading from primaries
• Eventual consistency when reading from secondaries
MongoDB - Are we safe?
Source: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads
MongoDB - Are we really safe?
• Inconsistent reads are possible even with majority write concern
• Read-uncommitted isolation
• A minority partition will allow both stale reads and dirty reads
Conclusion
• Network communication is flaky! Plan for it.
• Hackernews driven development (HDD) is not a good way of
choosing data stores!
• Test the guarantees of your data stores.
• Help me find more Solr bugs!
References
• Kyle Kingsbury’s posts on Jepsen: https://aphyr.com/tags/jepsen
• Solr & Jepsen: http://lucidworks.com/blog/call-maybe-solrcloud-
jepsen-flaky-networks/
• Jepsen on github: github.com/aphyr/jepsen
• Solr fork of Jepsen: https://github.com/LucidWorks/jepsen
Solr/Lucene Meetup on 25th July 2015
Venue: Target Corporation, Manyata Embassy Business Park
Time: 9:30am to 1pm
Talks:
Crux of eCommerce Search and Relevancy
Creating Search Analytics Dashboards
Signup at http://meetu.ps/2KnJHM
Thank you
shalin@apache.org
@shalinmangar

More Related Content

What's hot

分散システムの限界について知ろう
分散システムの限界について知ろう分散システムの限界について知ろう
分散システムの限界について知ろう
Shingo Omura
 
Windows コンテナを AKS に追加する
Windows コンテナを AKS に追加するWindows コンテナを AKS に追加する
Windows コンテナを AKS に追加する
Yuto Takei
 
ただしくHTTPSを設定しよう!
ただしくHTTPSを設定しよう!ただしくHTTPSを設定しよう!
ただしくHTTPSを設定しよう!
IIJ
 
Go micro framework to build microservices
Go micro framework to build microservicesGo micro framework to build microservices
Go micro framework to build microservices
TechMaster Vietnam
 
そんなトランザクションマネージャで大丈夫か?
そんなトランザクションマネージャで大丈夫か?そんなトランザクションマネージャで大丈夫か?
そんなトランザクションマネージャで大丈夫か?takezoe
 
賣 K8s 的人不敢告訴你的事 (Secrets that K8s vendors won't tell you)
賣 K8s 的人不敢告訴你的事 (Secrets that K8s vendors won't tell you)賣 K8s 的人不敢告訴你的事 (Secrets that K8s vendors won't tell you)
賣 K8s 的人不敢告訴你的事 (Secrets that K8s vendors won't tell you)
William Yeh
 
まずやっとくPostgreSQLチューニング
まずやっとくPostgreSQLチューニングまずやっとくPostgreSQLチューニング
まずやっとくPostgreSQLチューニング
Kosuke Kida
 
Aws 分散負荷テストツールを使ってapp runnerをスケールさせる(デモ動画削除)
Aws 分散負荷テストツールを使ってapp runnerをスケールさせる(デモ動画削除)Aws 分散負荷テストツールを使ってapp runnerをスケールさせる(デモ動画削除)
Aws 分散負荷テストツールを使ってapp runnerをスケールさせる(デモ動画削除)
ShinodaYukihiro
 
マイクロサービスのセキュリティ概説
マイクロサービスのセキュリティ概説マイクロサービスのセキュリティ概説
マイクロサービスのセキュリティ概説
Eiji Sasahara, Ph.D., MBA 笹原英司
 
小さなサービスも契約する時代
小さなサービスも契約する時代小さなサービスも契約する時代
小さなサービスも契約する時代
Ryo Mitoma
 
Amazon RDS for PostgreSQLのインスタンス(DB)作成手順
Amazon RDS for PostgreSQLのインスタンス(DB)作成手順Amazon RDS for PostgreSQLのインスタンス(DB)作成手順
Amazon RDS for PostgreSQLのインスタンス(DB)作成手順
Insight Technology, Inc.
 
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
Open Source Consulting
 
AWSからのメール送信
AWSからのメール送信AWSからのメール送信
AWSからのメール送信
Amazon Web Services Japan
 
はじめての datadog
はじめての datadogはじめての datadog
はじめての datadog
Naoya Nakazawa
 
Scalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction ManagerScalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction Manager
Scalar, Inc.
 
4. 대용량 아키텍쳐 설계 패턴
4. 대용량 아키텍쳐 설계 패턴4. 대용량 아키텍쳐 설계 패턴
4. 대용량 아키텍쳐 설계 패턴
Terry Cho
 
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
Amazon Web Services Japan
 
これからLDAPを始めるなら 「389-ds」を使ってみよう
これからLDAPを始めるなら 「389-ds」を使ってみようこれからLDAPを始めるなら 「389-ds」を使ってみよう
これからLDAPを始めるなら 「389-ds」を使ってみよう
Nobuyuki Sasaki
 
DynamoDBの初心者に伝えたい初めて触るときの勘所
DynamoDBの初心者に伝えたい初めて触るときの勘所DynamoDBの初心者に伝えたい初めて触るときの勘所
DynamoDBの初心者に伝えたい初めて触るときの勘所
Ryo Sasaki
 

What's hot (20)

分散システムの限界について知ろう
分散システムの限界について知ろう分散システムの限界について知ろう
分散システムの限界について知ろう
 
Windows コンテナを AKS に追加する
Windows コンテナを AKS に追加するWindows コンテナを AKS に追加する
Windows コンテナを AKS に追加する
 
ただしくHTTPSを設定しよう!
ただしくHTTPSを設定しよう!ただしくHTTPSを設定しよう!
ただしくHTTPSを設定しよう!
 
Go micro framework to build microservices
Go micro framework to build microservicesGo micro framework to build microservices
Go micro framework to build microservices
 
そんなトランザクションマネージャで大丈夫か?
そんなトランザクションマネージャで大丈夫か?そんなトランザクションマネージャで大丈夫か?
そんなトランザクションマネージャで大丈夫か?
 
賣 K8s 的人不敢告訴你的事 (Secrets that K8s vendors won't tell you)
賣 K8s 的人不敢告訴你的事 (Secrets that K8s vendors won't tell you)賣 K8s 的人不敢告訴你的事 (Secrets that K8s vendors won't tell you)
賣 K8s 的人不敢告訴你的事 (Secrets that K8s vendors won't tell you)
 
まずやっとくPostgreSQLチューニング
まずやっとくPostgreSQLチューニングまずやっとくPostgreSQLチューニング
まずやっとくPostgreSQLチューニング
 
Aws 分散負荷テストツールを使ってapp runnerをスケールさせる(デモ動画削除)
Aws 分散負荷テストツールを使ってapp runnerをスケールさせる(デモ動画削除)Aws 分散負荷テストツールを使ってapp runnerをスケールさせる(デモ動画削除)
Aws 分散負荷テストツールを使ってapp runnerをスケールさせる(デモ動画削除)
 
マイクロサービスのセキュリティ概説
マイクロサービスのセキュリティ概説マイクロサービスのセキュリティ概説
マイクロサービスのセキュリティ概説
 
小さなサービスも契約する時代
小さなサービスも契約する時代小さなサービスも契約する時代
小さなサービスも契約する時代
 
Amazon RDS for PostgreSQLのインスタンス(DB)作成手順
Amazon RDS for PostgreSQLのインスタンス(DB)作成手順Amazon RDS for PostgreSQLのインスタンス(DB)作成手順
Amazon RDS for PostgreSQLのインスタンス(DB)作成手順
 
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
 
AWSからのメール送信
AWSからのメール送信AWSからのメール送信
AWSからのメール送信
 
はじめての datadog
はじめての datadogはじめての datadog
はじめての datadog
 
Scalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction ManagerScalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction Manager
 
4. 대용량 아키텍쳐 설계 패턴
4. 대용량 아키텍쳐 설계 패턴4. 대용량 아키텍쳐 설계 패턴
4. 대용량 아키텍쳐 설계 패턴
 
Firebirdの障害対策
Firebirdの障害対策Firebirdの障害対策
Firebirdの障害対策
 
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
 
これからLDAPを始めるなら 「389-ds」を使ってみよう
これからLDAPを始めるなら 「389-ds」を使ってみようこれからLDAPを始めるなら 「389-ds」を使ってみよう
これからLDAPを始めるなら 「389-ds」を使ってみよう
 
DynamoDBの初心者に伝えたい初めて触るときの勘所
DynamoDBの初心者に伝えたい初めて触るときの勘所DynamoDBの初心者に伝えたい初めて触るときの勘所
DynamoDBの初心者に伝えたい初めて触るときの勘所
 

Viewers also liked

High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
Shalin Shekhar Mangar
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Shalin Shekhar Mangar
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big Data
Shalin Shekhar Mangar
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Shalin Shekhar Mangar
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
Shalin Shekhar Mangar
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
Shalin Shekhar Mangar
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
Shalin Shekhar Mangar
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
Shalin Shekhar Mangar
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Lucidworks
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Shalin Shekhar Mangar
 

Viewers also liked (10)

High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big Data
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 

Similar to Call me maybe: Jepsen and flaky networks

Percona XtraDB Cluster
Percona XtraDB ClusterPercona XtraDB Cluster
Percona XtraDB Cluster
Kenny Gryp
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0
Shi Shao Feng
 
Seek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under ReplicationSeek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under Replication
HostedbyConfluent
 
No sql & dq2 tracer service
No sql & dq2 tracer serviceNo sql & dq2 tracer service
No sql & dq2 tracer service
Zang Donal
 
Webinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica SetWebinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica Set
MongoDB
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
Thoughts on consistency models
Thoughts on consistency modelsThoughts on consistency models
Thoughts on consistency models
rogerbodamer
 
Kafka practical experience
Kafka practical experienceKafka practical experience
Kafka practical experience
Rico Chen
 
Raft After ScyllaDB 5.2: Safe Topology Changes
Raft After ScyllaDB 5.2: Safe Topology ChangesRaft After ScyllaDB 5.2: Safe Topology Changes
Raft After ScyllaDB 5.2: Safe Topology Changes
ScyllaDB
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
Scality
 
Oss4b - pxc introduction
Oss4b   - pxc introductionOss4b   - pxc introduction
Oss4b - pxc introduction
Frederic Descamps
 
Fail-Safe Cluster for FirebirdSQL and something more
Fail-Safe Cluster for FirebirdSQL and something moreFail-Safe Cluster for FirebirdSQL and something more
Fail-Safe Cluster for FirebirdSQL and something more
Alexey Kovyazin
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
Jimmy Lai
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
thelabdude
 
Spil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRLSpil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRL
spil-engineering
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
Shyam Raj
 
Real world repairs
Real world repairsReal world repairs
Real world repairs
Vinay Kumar Chella
 
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) HypervisorAsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
Dave Voutila
 
Buytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerBuytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerkuchinskaya
 

Similar to Call me maybe: Jepsen and flaky networks (20)

Percona XtraDB Cluster
Percona XtraDB ClusterPercona XtraDB Cluster
Percona XtraDB Cluster
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0
 
Seek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under ReplicationSeek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under Replication
 
No sql & dq2 tracer service
No sql & dq2 tracer serviceNo sql & dq2 tracer service
No sql & dq2 tracer service
 
Webinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica SetWebinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica Set
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Thoughts on consistency models
Thoughts on consistency modelsThoughts on consistency models
Thoughts on consistency models
 
Kafka practical experience
Kafka practical experienceKafka practical experience
Kafka practical experience
 
Raft After ScyllaDB 5.2: Safe Topology Changes
Raft After ScyllaDB 5.2: Safe Topology ChangesRaft After ScyllaDB 5.2: Safe Topology Changes
Raft After ScyllaDB 5.2: Safe Topology Changes
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
 
Oss4b - pxc introduction
Oss4b   - pxc introductionOss4b   - pxc introduction
Oss4b - pxc introduction
 
Fail-Safe Cluster for FirebirdSQL and something more
Fail-Safe Cluster for FirebirdSQL and something moreFail-Safe Cluster for FirebirdSQL and something more
Fail-Safe Cluster for FirebirdSQL and something more
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Spil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRLSpil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRL
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Real world repairs
Real world repairsReal world repairs
Real world repairs
 
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) HypervisorAsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
 
Buytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerBuytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemaker
 

Recently uploaded

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 

Recently uploaded (20)

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 

Call me maybe: Jepsen and flaky networks

  • 1.
  • 2. Call me maybe: Jepsen and flaky networks Shalin Shekhar Mangar @shalinmangar Lucidworks Inc.
  • 3. Typical first year for a new cluster — Jeff Dean, Google • ~5 racks out of 30 go wonky (50% packetloss) • ~8 network maintenances (4 might cause ~30-minute random connectivity losses) • ~3 router failures (have to immediately pull traffic for an hour) LADIS 2009
  • 4. Reliable networks are a myth • GC pause • Process crash • Scheduling delays • Network maintenance • Faulty equipment
  • 7. Messages can be lost, delayed, reordered and duplicated n1 n2 X n1 n2 Time Drop Delay n1 n2 Duplicate n1 n2 Reorder
  • 8. CAP recap • Consistency (Linearizability): A total order on all operations such that each operation looks as if it were completed at a single instant. • Availability: Every request received by a non-failing node in the system must result in a response. • Partition Tolerance: Arbitrary many messages between two nodes may be lost. Mandatory unless you can guarantee that partitions don’t happen at all.
  • 9. Have you planned for these? Availability Consistency X X • Errors • Connection timeouts • Hung requests (read timeouts) • Stale results • Dirty results • Data lost forever! During and after a partition
  • 10. Jepsen: Testing systems under stress • Network partitions • Random process crashes • Slow networks • Clock skew http://github.com/aphyr/jepsen
  • 11. Anatomy of a Jepsen test • Automated DB setup • Test definitions a.k.a Client • Partition types a.k.a Nemesis • Scheduler of operations (client & nemesis) • History of operations • Consistency checker Data store specific (Mongo/Solr/Elastic) Provided by Jepsen
  • 16. A set of integers: cas-set-client • S = {1, 2, 3, 4, 5, …} • Stored as a single document containing all the integers • Update using compare-and-set • Multiple clients try to update concurrently • Create and restore partitions • Finally, read the set of integers and verify consistency
  • 17. Compare and Set client cas({}, 1) cas(1, 2) {1} {1, 2} cas(1, 3) X Time Client 1 Client 2 cas(2, 4) X cas(2, 5) {1, 2, 5} Client 1 Client 2 t=0 t=1 t=x
  • 18. Compare and Set client cas({}, 1) cas(1, 2) {1} {1, 2} cas(1, 3) X Time Client 1 Client 2 cas(2, 4) X cas(2, 5) {1, 2, 5} Client 1 Client 2 t=0 t=1 t=x History = [(t, op, result)]
  • 19. Solr • Search server built on Lucene • Lucene index + transaction log • Optimistic concurrency, linearizable CAS ops • Synchronous replication to all ‘live’ nodes • ZooKeeper for ‘consensus’ • http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky- networks/
  • 20. Add an integer every second, partition network every 30 seconds for 200 seconds
  • 21. Solr - Are we safe? • Leaders become unavailable for upto ZK session timeout, typically 30 seconds (expected) • Some write ‘hang’ for a long time on partition. Timeouts are essential. (unexpected) • Final reads under CAS are consistent but we haven’t proved linearizability (good!) • Loss of availability for writes in minority partition. (expected) • No data loss (yet!) which is great!
  • 22. Solr - Bugs, bugs & bugs • SOLR-6530: Commits under network partition can put any node into ‘down’ state. • SOLR-6583: Resuming connection with ZK causes log replay • SOLR-6511: Requests threads hang under network partition • SOLR-7636: A flaky cluster status API - times out during partitions • SOLR-7109: Indexing threads stuck under network partition can mark leader as down
  • 23. Elastic • Search server built on Lucene • It has a Lucene index and a transaction log • Consistent single doc reads, writes & updates • Eventually consistent search but a flush/commit should ensure that changes are visible
  • 24. Elastic • Optimistic concurrency control a.k.a CAS linearizibility • Synchronous acknowledgement from a majority of nodes • “Instantaneous” promotion under a partition • Homegrown ‘ZenDisco’ consensus
  • 25. Elastic - Are we safe? • “Instantaneous” promotion is not. 90 seconds timeouts to elect a new primary (worse in <1.5.0) • Bridge partition: 645/1961 writes acknowledged and lost in 1.1.0. Better in 1.5.0, only 22/897 lost. • Isolated primaries: 209/947 updates lost • Repeated pauses (simulating GC): 200/2143 updates lost • Getting better but not quite there. Good documentation on resiliency problems.
  • 26. MongoDB • Document-oriented database • Replica set has a single primary which accepts writes • Primary asynchronously replicates writes to secondaries • Replica decide between themselves to promote/demote primaries • Applies to 2.4.3 and 2.6.7
  • 27. MongoDB • Claims atomic writes per document and consistent reads • But strict consistency only when reading from primaries • Eventual consistency when reading from secondaries
  • 28. MongoDB - Are we safe? Source: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads
  • 29. MongoDB - Are we really safe? • Inconsistent reads are possible even with majority write concern • Read-uncommitted isolation • A minority partition will allow both stale reads and dirty reads
  • 30. Conclusion • Network communication is flaky! Plan for it. • Hackernews driven development (HDD) is not a good way of choosing data stores! • Test the guarantees of your data stores. • Help me find more Solr bugs!
  • 31. References • Kyle Kingsbury’s posts on Jepsen: https://aphyr.com/tags/jepsen • Solr & Jepsen: http://lucidworks.com/blog/call-maybe-solrcloud- jepsen-flaky-networks/ • Jepsen on github: github.com/aphyr/jepsen • Solr fork of Jepsen: https://github.com/LucidWorks/jepsen
  • 32. Solr/Lucene Meetup on 25th July 2015 Venue: Target Corporation, Manyata Embassy Business Park Time: 9:30am to 1pm Talks: Crux of eCommerce Search and Relevancy Creating Search Analytics Dashboards Signup at http://meetu.ps/2KnJHM