Data Diffing Based Software Architecture Patterns

Huahai Yang
Huahai YangCo-founder & CTO at Juji, Inc.
Data Diffing Based Software
Architecture Patterns
Huahai Yang
Juji Inc.
What is diffing?
• Given two elements a and b,calculate the difference d between
them
• Function (diff a b) ;=> d
• Function (patch a d)
• Such that (= b (patch a d))
• Or: (= b (patch a (diff a b)))
• These are normally true:
• (not= (diff a b) (diff b a))
• (= (diff a c) (concat (diff a b) (diff b c)))
• (< (size d) (min (size a) (size b)))
• (< (time (patch a d)) (time (diff a b)))
Evolution of diffing (1)
• Earliest diff was developed by Doug
McIIroy on Unix at Bell Lab in 1974
• Works on text file, work units are lines
of text
• Purpose: Reduce storage necessary to
maintain multiple versions of file.
• Use: compare content, track changes,
verifying output, version control
Evolution of diffing (2)
• Diffing in 3D graphics programming
• World modeled as a scene graph
• Only re-render changed subtrees
• Purpose: performance optimization
• Conceptually simple programming
model: render everything
• Inspired react.js
• Clojurescript wrapper of react could
be faster than react due to faster
diffing with immutable data
Evolution of diffing (3)
• Data oriented programming
• Data, not text
• Data are directly meaningful for code, no need for parsing or decoding
• Generic data literals, not specialized opaque programming constructs
• Diff input and output are both data
• Diffing as a software architecture consideration, not just an
implementation detail, impacting
• Delineation of system components
• Data model design
• API design
Diffing enables decoupling
• diff & patch functions are generic and blind
• They don't have to understand their input for them to work
• Semantic asymmetry between sender and receiver enforces separation of
concerns
• Also support a kind of natural encapsulation, not forced like in OOP
• d is still open for inspection if the receiver chooses to
• Graded, receiver don’t need know a lot, but can know a lot if choose to
Sender
(diff a a’) ;=> d
d
Receiver
(patch a d) ;=> a’
Diffing encourages data model reuse
• Thanks to diffing, data duplication between components are faithful and
cheap
• Advantageous to reuse the same data model throughout the system,
dramatically simplifying system
Diffing tracks changes
• Thanks to diffing, each version of the
world state can be cheaply saved
and replayed to recover originals
• Application statefulness can be
externalized and managed
Editscript: a Clojure data
diffing library
• https://github.com/juji-io/editscript
• Works for vector, list, set and map
• Edits are a vector of vectors:
• Path
• Op :+, :-, or :r
• Value
• Diffing algorithms
• Quick: fast
• A* : optimal diff size
Case study: Juji Studio UI Re-design
• Complete UI redesign
• Re-implementation
• One month
turnaround
• Mainly due to
switching from a
resource-oriented API
to a diffing based API
Before
Case study: Juji Studio UI Re-design
• Complete UI redesign
• Re-implementation
• One month
turnaround
• Mainly due to
switching from a
resource-oriented API
to a diffing based API
After
UI Data model: config doc
• Single Page Application (SPA) in cljs
• States in an EDN document – config doc
• SPA, server and DB all having copies of
config doc
Config
doc
SPA Server DB
GraphQL
Config
doc
Config
docAPI
Traditional GraphQL API
• Resources oriented
(RESTful)
• Server side config doc is
the truth
• API is CRUD on server
resources
• i.e. paths in the config
doc
• Repetitive CRUD calls for
each and every type of
nodes
• Thousands lines of Lacinia
schema
Diffing based GraphQL API
• All logic is in SPA
• API is CRUD on config doc
• Update is sending diffs
• SPA periodically sends to
server:
(diff doc-prev doc-now)
• Server applies the diff, saves
the doc in DB, replies with
config doc SHA
• SPA validates SHA, if
different, sends config doc
to overwrite
• Removed all API calls on
paths and nodes
Case study: externalize application states
• How to scale highly stateful application?
• E.g. Juji initiates an agent (rep) for each chat session on a server node, the
state of each rep is stored in an atom
• What if the server node become unavailable?
Server Node
API
Gateway
Case study: externalize application states
• Each rep sends diff of its state to a persistent log (e.g. Kafka)
• E.g. At each utterance, rep sends (diff state-prev state-now)
• When a server becomes unavailable, API gateway forward traffic to
another server, which recovers the agent state from the persistent
log, by simply sequentially applying all diffs to a shared initial state.
Server Node
API
Gateway
Persistent Log
diff
Case study: reduce component dependency
• Stateful components depend on one another
• Introducing user invokable system functions,
leads to circular dependency, e.g.
(juji.func.system/cleanup-chat rep)
System
Rep
Reps
Rep
Subs
func.system
[:rt jujiid]
• Instead of depending on
namespaces that contain
subscriptions
• Watch reps atom
• Inspect its diff between old
and new
• Handle the case when a rep
is removed or cleaned
• i.e. sending :user-left
message to channels, and let
the subscriptions clean
themselves up
Case study: synchronize collaborative editing
• Multiple parties sending diffs
• Out of sync when lines cross path
• Difficult yet common problem
• E.g. enable multiple users editing the same
chat at the same time
• Locking has bad UX
• Three-way merge has high latency
A A
(diff A A’)
(diff A A’’)
Differential Synchronization
• Diffing based synchronization
method
• Scalable
• Fault-tolerant
• Low latency
• Developed by Neil Fraser in
2009
• Used by Google Docs
• Client-server
case
• Use two
shadows
• Fault
tolerant case
• Keep a
backup
shadow
• Scaling
Data modeling guideline: Don’t use vector
• Minimize unnecessary use of ordered data structure, e.g. vector or
list
• Diffing algorithm is slow for ordered data, because order is a strong
constraint to satisfy
• Ordered O(mn) vs. Unordered O(m+n)
• The implicit order of data elements are often source of incidental complexity
• Meaningful order is often based on data fields
• Sets or maps suffice in most cases
[ {} {} {} … ]
Bad
{ {} {} {} … } #{ {} {} {} … }
Good
Conclusion
• Diffing offers a few properties that lead to
• Simplified software architecture
• Enhanced system decoupling
• Easier scaling of stateful application
• Better solution to data synchronization problem
• Worthwhile to consider diffing based software architecture
• Particularly for data-oriented programming
Thank you!
• Huahai Yang @huahaiy
• Juji Inc. https://juji.io
1 of 26

More Related Content

Similar to Data Diffing Based Software Architecture Patterns(20)

Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
David Martínez Rego564 views
So you want to liberate your data?So you want to liberate your data?
So you want to liberate your data?
Mogens Heller Grabe1.6K views
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
Amazon Web Services10.5K views
Data ScienceData Science
Data Science
Ahmet Bulut945 views
Evolutionary database designEvolutionary database design
Evolutionary database design
Salehein Syed346 views
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook10.6K views
Apache SparkApache Spark
Apache Spark
SugumarSarDurai15 views
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak4.8K views
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover2.9K views
No sql DatabaseNo sql Database
No sql Database
mymail2ashok167 views

Data Diffing Based Software Architecture Patterns

  • 1. Data Diffing Based Software Architecture Patterns Huahai Yang Juji Inc.
  • 2. What is diffing? • Given two elements a and b,calculate the difference d between them • Function (diff a b) ;=> d • Function (patch a d) • Such that (= b (patch a d)) • Or: (= b (patch a (diff a b))) • These are normally true: • (not= (diff a b) (diff b a)) • (= (diff a c) (concat (diff a b) (diff b c))) • (< (size d) (min (size a) (size b))) • (< (time (patch a d)) (time (diff a b)))
  • 3. Evolution of diffing (1) • Earliest diff was developed by Doug McIIroy on Unix at Bell Lab in 1974 • Works on text file, work units are lines of text • Purpose: Reduce storage necessary to maintain multiple versions of file. • Use: compare content, track changes, verifying output, version control
  • 4. Evolution of diffing (2) • Diffing in 3D graphics programming • World modeled as a scene graph • Only re-render changed subtrees • Purpose: performance optimization • Conceptually simple programming model: render everything • Inspired react.js • Clojurescript wrapper of react could be faster than react due to faster diffing with immutable data
  • 5. Evolution of diffing (3) • Data oriented programming • Data, not text • Data are directly meaningful for code, no need for parsing or decoding • Generic data literals, not specialized opaque programming constructs • Diff input and output are both data • Diffing as a software architecture consideration, not just an implementation detail, impacting • Delineation of system components • Data model design • API design
  • 6. Diffing enables decoupling • diff & patch functions are generic and blind • They don't have to understand their input for them to work • Semantic asymmetry between sender and receiver enforces separation of concerns • Also support a kind of natural encapsulation, not forced like in OOP • d is still open for inspection if the receiver chooses to • Graded, receiver don’t need know a lot, but can know a lot if choose to Sender (diff a a’) ;=> d d Receiver (patch a d) ;=> a’
  • 7. Diffing encourages data model reuse • Thanks to diffing, data duplication between components are faithful and cheap • Advantageous to reuse the same data model throughout the system, dramatically simplifying system
  • 8. Diffing tracks changes • Thanks to diffing, each version of the world state can be cheaply saved and replayed to recover originals • Application statefulness can be externalized and managed
  • 9. Editscript: a Clojure data diffing library • https://github.com/juji-io/editscript • Works for vector, list, set and map • Edits are a vector of vectors: • Path • Op :+, :-, or :r • Value • Diffing algorithms • Quick: fast • A* : optimal diff size
  • 10. Case study: Juji Studio UI Re-design • Complete UI redesign • Re-implementation • One month turnaround • Mainly due to switching from a resource-oriented API to a diffing based API Before
  • 11. Case study: Juji Studio UI Re-design • Complete UI redesign • Re-implementation • One month turnaround • Mainly due to switching from a resource-oriented API to a diffing based API After
  • 12. UI Data model: config doc • Single Page Application (SPA) in cljs • States in an EDN document – config doc • SPA, server and DB all having copies of config doc Config doc SPA Server DB GraphQL Config doc Config docAPI
  • 13. Traditional GraphQL API • Resources oriented (RESTful) • Server side config doc is the truth • API is CRUD on server resources • i.e. paths in the config doc • Repetitive CRUD calls for each and every type of nodes • Thousands lines of Lacinia schema
  • 14. Diffing based GraphQL API • All logic is in SPA • API is CRUD on config doc • Update is sending diffs • SPA periodically sends to server: (diff doc-prev doc-now) • Server applies the diff, saves the doc in DB, replies with config doc SHA • SPA validates SHA, if different, sends config doc to overwrite • Removed all API calls on paths and nodes
  • 15. Case study: externalize application states • How to scale highly stateful application? • E.g. Juji initiates an agent (rep) for each chat session on a server node, the state of each rep is stored in an atom • What if the server node become unavailable? Server Node API Gateway
  • 16. Case study: externalize application states • Each rep sends diff of its state to a persistent log (e.g. Kafka) • E.g. At each utterance, rep sends (diff state-prev state-now) • When a server becomes unavailable, API gateway forward traffic to another server, which recovers the agent state from the persistent log, by simply sequentially applying all diffs to a shared initial state. Server Node API Gateway Persistent Log diff
  • 17. Case study: reduce component dependency • Stateful components depend on one another • Introducing user invokable system functions, leads to circular dependency, e.g. (juji.func.system/cleanup-chat rep) System Rep Reps Rep Subs func.system [:rt jujiid]
  • 18. • Instead of depending on namespaces that contain subscriptions • Watch reps atom • Inspect its diff between old and new • Handle the case when a rep is removed or cleaned • i.e. sending :user-left message to channels, and let the subscriptions clean themselves up
  • 19. Case study: synchronize collaborative editing • Multiple parties sending diffs • Out of sync when lines cross path • Difficult yet common problem • E.g. enable multiple users editing the same chat at the same time • Locking has bad UX • Three-way merge has high latency A A (diff A A’) (diff A A’’)
  • 20. Differential Synchronization • Diffing based synchronization method • Scalable • Fault-tolerant • Low latency • Developed by Neil Fraser in 2009 • Used by Google Docs
  • 22. • Fault tolerant case • Keep a backup shadow
  • 24. Data modeling guideline: Don’t use vector • Minimize unnecessary use of ordered data structure, e.g. vector or list • Diffing algorithm is slow for ordered data, because order is a strong constraint to satisfy • Ordered O(mn) vs. Unordered O(m+n) • The implicit order of data elements are often source of incidental complexity • Meaningful order is often based on data fields • Sets or maps suffice in most cases [ {} {} {} … ] Bad { {} {} {} … } #{ {} {} {} … } Good
  • 25. Conclusion • Diffing offers a few properties that lead to • Simplified software architecture • Enhanced system decoupling • Easier scaling of stateful application • Better solution to data synchronization problem • Worthwhile to consider diffing based software architecture • Particularly for data-oriented programming
  • 26. Thank you! • Huahai Yang @huahaiy • Juji Inc. https://juji.io