Successfully reported this slideshow.
Your SlideShare is downloading. ×

Conflict-Free Replicated Data Types (PyCon 2022)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 29 Ad

Conflict-Free Replicated Data Types (PyCon 2022)

Download to read offline

Jupyter Notebook may be one of the most controversial open source projects released in the last ten years! Love them or hate them, they’ve become a mainstay of data science and machine learning, and a significant part of the Python ecosystem. While Jupyter can simplify experimentation, rapid prototyping, documentation, and visualization, it often impedes version control, code review, and test coverage. Dev teams must accept the good with the bad… but what if they didn’t have to? In this talk we introduce conflict-free replicated data types (CRDT), a special object that supports strong consistency, and which can be used to enhance Jupyter notebooks for a truly collaborative experience.

First proposed by Shapiro et al in 2011 conflict-free replicated data types (CRDTs) evolved out of the Distributed Systems community for replication of data across a network of replicas. CRDTs are objects that come with a special guarantee — namely, that two different copies of that object can be strongly consistent, meaning they can be kept in sync. While CRDTs have enjoyed a good amount of attention from academia over the last years, primarily amongst database and cloud researchers, they have not led to many practical applications for everyday developers. However, recent work by Kleppmann et al shows CRDTs can be used for real-time rich-text collaboration — creating a “Google doc”-type experience with any document in a networked file system.

In this talk, we’ll present the basics of CRDTs and demonstrate how they work with examples written in Python. Next, we’ll explain how CRDTs can enable more collaborative Jupyter notebooks, opening up features such as synchronous insertions, diffs, and auto-merges, even with multiple simultaneous contributors!

Jupyter Notebook may be one of the most controversial open source projects released in the last ten years! Love them or hate them, they’ve become a mainstay of data science and machine learning, and a significant part of the Python ecosystem. While Jupyter can simplify experimentation, rapid prototyping, documentation, and visualization, it often impedes version control, code review, and test coverage. Dev teams must accept the good with the bad… but what if they didn’t have to? In this talk we introduce conflict-free replicated data types (CRDT), a special object that supports strong consistency, and which can be used to enhance Jupyter notebooks for a truly collaborative experience.

First proposed by Shapiro et al in 2011 conflict-free replicated data types (CRDTs) evolved out of the Distributed Systems community for replication of data across a network of replicas. CRDTs are objects that come with a special guarantee — namely, that two different copies of that object can be strongly consistent, meaning they can be kept in sync. While CRDTs have enjoyed a good amount of attention from academia over the last years, primarily amongst database and cloud researchers, they have not led to many practical applications for everyday developers. However, recent work by Kleppmann et al shows CRDTs can be used for real-time rich-text collaboration — creating a “Google doc”-type experience with any document in a networked file system.

In this talk, we’ll present the basics of CRDTs and demonstrate how they work with examples written in Python. Next, we’ll explain how CRDTs can enable more collaborative Jupyter notebooks, opening up features such as synchronous insertions, diffs, and auto-merges, even with multiple simultaneous contributors!

Advertisement
Advertisement

More Related Content

Similar to Conflict-Free Replicated Data Types (PyCon 2022) (20)

More from Rebecca Bilbro (20)

Advertisement

Recently uploaded (20)

Conflict-Free Replicated Data Types (PyCon 2022)

  1. 1. Conflict-Free Replicated Data Types (CRDTs) PyCon 2022 Rotational Labs rotational.io
  2. 2. Replication and Conflict 01 Talk Outline Introduction to CRDTs 02 CRDTs in Python 03 Demo 04 Key Takeaways 05 Rotational Labs rotational.io
  3. 3. About Us Rebecca Bilbro Patrick Deziel Patrick is a software engineer & machine learning specialist. He was employee #1 at Rotational Labs. He’s also a rock climbing enthusiast. Rebecca is a machine learning engineer & distributed systems researcher. She’s Founder/ CTO at Rotational Labs She prefers earth-bound activities. Rotational Labs rotational.io
  4. 4. –Leslie Lamport “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” Rotational Labs rotational.io
  5. 5. Replication and Conflict 01 What is consistency and why is it so hard to achieve? Rotational Labs rotational.io
  6. 6. RB PD Rotational Labs rotational.io
  7. 7. Rotational Labs rotational.io { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "f4a87d2f", "metadata": {}, "outputs": [], "source": [ "import warningsn" , "import pandas as pdn" , "from yellowbrick.datasets import load_bikesharen" , "from yellowbrick.regressor import ResidualsPlotn" , "from sklearn.linear_model import LinearRegressionn" , "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": null, "id": "712ed090", "metadata": {}, "outputs": [], "source": [] } ], … } { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "311d6947", "metadata": {}, "outputs": [], "source": [ "import warningsn" , "import pandas as pdn" , "from yellowbrick.datasets import load_bikesharen" , "from yellowbrick.regressor import ResidualsPlotn" , "from sklearn.linear_model import LinearRegressionn" , "from sklearn.model_selection import train_test_splitn" , "# let’s try sklearn.model_selection.TimeSeriesSplit" ] }, { "cell_type": "code", "execution_count": null, "id": "9cbea4df", "metadata": {}, "outputs": [], "source": [] } ], … }
  8. 8. Rotational Labs rotational.io { … "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)" , "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python" , "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.3" } }, "nbformat": 4, "nbformat_minor": 5 } { … "metadata": { "kernelspec": { "display_name": "Python 3.8.2 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python" , "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2-final" } }, "nbformat": 3, "nbformat_minor": 21 }
  9. 9. Context: A Single Server 1 2 3 4 5 6 Rotational Labs rotational.io 1 2 3 4 5 6 It is always consistent (responds predictably to requests) - that’s convenient! But what if there’s a failure? The entire system becomes unavailable. Data loss can occur for information stored on volatile memory. This is why we need distributed systems!
  10. 10. Inconsistency & Concurrency PUT(x, 42) ok GET(x) not found Rotational Labs rotational.io PUT(x, 42) PUT(x, 27) GET(x) → ? Servers in a distributed system need to communicate to remain in the same state. Communication takes time (latency); more servers means more latency. Delays in communication can allow two clients to perform operations concurrently. From the system’s perspective, they happen at the same time.
  11. 11. Time in a Distributed System Due to clock drift, we can’t expect any two nodes in a distributed system to have the same perception of physical time. In the absence of specialized hardware (Spanner), logical clocks can be used to impose a partial ordering of events. To obtain a total ordering of events, we need some arbitrary mechanism to break ties (e.g., the node name). Rotational Labs rotational.io PUT(x, 42, t=Alice@1) PUT(x, 27, t=Bob@2) GET(x) → 27
  12. 12. Introduction to CRDTs 02 Conflict-Free Replicated Data Types Rotational Labs rotational.io
  13. 13. CRDT: A data structure designed for replication CRDTs are a good alternative to more expensive, heavyweight coordination methods, such as: Some representation of mutable state. Some function M which merges two states and produces a deterministic value. M’s operations are idempotent, associative, and commutative… A = M(A, A) M(A, B) = M(B, A) M(A, M(B, C)) = M(C, (M(A, B)) …not unlike a Python set! Locking (shared lock, x-lock, etc) Limits collaboration between users Consensus algorithms (Paxos, Raft, ePaxos) Network-intensive, difficult to implement
  14. 14. Key Intuition: We can combine multiple CRDTs to make more complex CRDTs Rotational Labs rotational.io
  15. 15. Simple CRDTs Grow-only Counter ● A monotonically increasing counter across all replicas, each of which is assigned a unique ID ● The counter value at any point in time is equal to the sum of all values across the replicas ● Can be implemented using a dict() in Python Grow-only Set ● A set which only supports adding new items ● No way to “delete” an item ● Similar to Python’s set() Rotational Labs rotational.io
  16. 16. Compound CRDTs Positive-Negative Counters Combination of two Grow-only Counters, supports incrementation and decrementation Two-Phase Sets Combination of two Grow-only Sets, one is a “tombstone” set to support deletion Last-Write-Wins-Element-Set Improvement on Two-Phase Set which includes a timestamp to allow for items to be “undeleted” Observed-Remove Set Similar to Last-Write-Wins-Element-Set but uses unique tags rather than timestamps Sequence CRDTs Implements an ordered set with familiar list operations such as append, insert, remove. We can use this to build a collaborative editor! Rotational Labs rotational.io
  17. 17. CRDTs in Python 03 An Example Implementation Rotational Labs rotational.io
  18. 18. Rotational Labs rotational.io Hypothesis We can compound a few CRDTs together to create a collaborative “notebook” ala Jupyter Our composite CRDT needs to support the following operations ● High level operations: Insert and Remove notebook “cells” ● Low level operations: Insert and Remove characters within each cell ● Support merging at both the notebook level and the cell level to enable consistency Key understanding ● Individual cell data can be represented by Sequence CRDTs ● The list of “cells” in a notebook is also a Sequence! A Practical Example…
  19. 19. To achieve eventual consistency, each peer needs to agree on: 1. The set of operations 2. The order of operations To achieve a total ordering of operations: 1. Assign each operation a unique ID based on client name and timestamp, e.g. INSERT(0, “a”) ⇒ alice@1 2. Lower timestamp values always go first 3. Order by client name to break ties alice@1 -> bob@2 -> alice@3 -> bob@3 Total Ordering of Operations
  20. 20. Realizing the Object Order Note: Object payloads are generic, so we can nest Sequences within Sequences. This advantage comes from Python being dynamically typed! alice@1 “a” alice@2 “c” bob@5 “b” alice@5 “d” bob@3 “x”
  21. 21. Merging Sequences INSERT(“c”, before=end) ⇒ bob@1 INSERT(“a”, before=end) ⇒ alice@1 INSERT(“b”, after=alice@1) ⇒ bob@2 do(alice@1) ⇒ [“a”] do(bob@1) ⇒ [“a”, “c”] do(bob@2) ⇒ [“a”, “b”, “c”] do(alice@1) ⇒ [“a”] do(bob@1) ⇒ [“a”, “c”] do(bob@2) ⇒ [“a”, “b”, “c”]
  22. 22. Demo 04 Synchronizing Collaboration Rotational Labs rotational.io
  23. 23. Sequence: Composite CRDT containing ordered set of items Notebook: Contains a Sequence of Cells Cell: Contains a Sequence of characters GCounter: The shared logical clock GSet: The entire history of operations Operation: A single insert or delete performed by a node OpId: Unique identifier for operations Object: Represents an item in a sequence
  24. 24. GCounter class GCounter: """Implements a grow-only counter CRDT. It must be instantiated with a network-unique ID.""" ... def add(self, value): """Adds a non-negative value to the counter.""" if value < 0: raise ValueError("Only non-negative values are allowed for add()" ) self.counts[self.id] += value def merge(self, other): """Merges another GCounter with this one.""" if not isinstance(other, GCounter): raise ValueError("Incompatible CRDT for merge(), expected GCounter" ) for id, count in other.counts.items(): self.counts[id] = max(self.counts.get(id, 0), count) return self def get(self): """Returns the current value of the counter.""" return sum(self.counts.values()) Rotational Labs rotational.io
  25. 25. Sequence.merge def merge(self, other): # Merge the two Sequences self.merge_operations(other) other.merge_operations( self) ... # Recursive merge of the sub-sequences for i in range(len(this_sequence)): if isinstance(this_sequence[i], Sequence) andisinstance(other_sequence[i], Sequence): this_sequence[i].merge(other_sequence[i]) this_sequence[i].id =self.id return self def merge_operations(self, other): # Sync the local clock with the remote clock and apply the unseen operations self.clock = self.clock.merge(other.clock) patch_ops = other.operations.get().difference( self.operations.get()) patch_log = sorted(patch_ops, key=cmp_to_key( self.compare_operations)) for op in patch_log: op.do( self.objects) # Merge the two operation logs self.operations = self.operations.merge(other.operations) Rotational Labs rotational.io
  26. 26. ObjectTree class ObjectTree(): """Add-only data structure which stores a sequence of Objects.""" def __init__(self): self.roots = [] def find_insert(self, target, object, iter): for root, i, obj in iter: op = obj.operation if op == target: # We found the target return root, i elif op.target == target and object.operation < op: # Same target (conflicting operations), so order the operations return root, i return None, -1 def insert_node(self, target, object): root, i = self.find_insert(target, object, self.enumerate_nodes) if root is None: self.roots[-1].nodes.append( object) else: root.nodes.insert(i,object) Rotational Labs rotational.io
  27. 27. Key Takeaways 05 Possibilities for Real Time Collaboration Rotational Labs rotational.io
  28. 28. CRDT Limitations, Possibilities, and Resources Limitations ● Eventual strong consistency ● Append-only data type ● Buffer size limitations ● Increasing egress costs ● Need for compaction/pruning Rotational Labs rotational.io Applications ● Testing ● Merging ● Branching ● Commenting ● Metadata Resolution ● Collaborative Editing Resources ● eirene: a client for collaborative Python development with CRDT ● nbdime: tools for diffing and merging of Jupyter Notebooks ● peritext: a CRDT for rich-text collaboration ● Martin Kleppmann — CRDTs: The hard parts ● Michael Whittaker — Consistency in Distributed Systems
  29. 29. Thank you! Rotational Labs rotational.io

×