Data Diffing Based Software
Architecture Patterns
Huahai Yang
Juji Inc.
What is diffing?
• Given two elements a and b,calculate the difference d between
them
• Function (diff a b) ;=> d
• Function (patch a d)
• Such that (= b (patch a d))
• Or: (= b (patch a (diff a b)))
• These are normally true:
• (not= (diff a b) (diff b a))
• (= (diff a c) (concat (diff a b) (diff b c)))
• (< (size d) (min (size a) (size b)))
• (< (time (patch a d)) (time (diff a b)))
Evolution of diffing (1)
• Earliest diff was developed by Doug
McIIroy on Unix at Bell Lab in 1974
• Works on text file, work units are lines
of text
• Purpose: Reduce storage necessary to
maintain multiple versions of file.
• Use: compare content, track changes,
verifying output, version control
Evolution of diffing (2)
• Diffing in 3D graphics programming
• World modeled as a scene graph
• Only re-render changed subtrees
• Purpose: performance optimization
• Conceptually simple programming
model: render everything
• Inspired react.js
• Clojurescript wrapper of react could
be faster than react due to faster
diffing with immutable data
Evolution of diffing (3)
• Data oriented programming
• Data, not text
• Data are directly meaningful for code, no need for parsing or decoding
• Generic data literals, not specialized opaque programming constructs
• Diff input and output are both data
• Diffing as a software architecture consideration, not just an
implementation detail, impacting
• Delineation of system components
• Data model design
• API design
Diffing enables decoupling
• diff & patch functions are generic and blind
• They don't have to understand their input for them to work
• Semantic asymmetry between sender and receiver enforces separation of
concerns
• Also support a kind of natural encapsulation, not forced like in OOP
• d is still open for inspection if the receiver chooses to
• Graded, receiver don’t need know a lot, but can know a lot if choose to
Sender
(diff a a’) ;=> d
d
Receiver
(patch a d) ;=> a’
Diffing encourages data model reuse
• Thanks to diffing, data duplication between components are faithful and
cheap
• Advantageous to reuse the same data model throughout the system,
dramatically simplifying system
Diffing tracks changes
• Thanks to diffing, each version of the
world state can be cheaply saved
and replayed to recover originals
• Application statefulness can be
externalized and managed
Editscript: a Clojure data
diffing library
• https://github.com/juji-io/editscript
• Works for vector, list, set and map
• Edits are a vector of vectors:
• Path
• Op :+, :-, or :r
• Value
• Diffing algorithms
• Quick: fast
• A* : optimal diff size
Case study: Juji Studio UI Re-design
• Complete UI redesign
• Re-implementation
• One month
turnaround
• Mainly due to
switching from a
resource-oriented API
to a diffing based API
Before
Case study: Juji Studio UI Re-design
• Complete UI redesign
• Re-implementation
• One month
turnaround
• Mainly due to
switching from a
resource-oriented API
to a diffing based API
After
UI Data model: config doc
• Single Page Application (SPA) in cljs
• States in an EDN document – config doc
• SPA, server and DB all having copies of
config doc
Config
doc
SPA Server DB
GraphQL
Config
doc
Config
docAPI
Traditional GraphQL API
• Resources oriented
(RESTful)
• Server side config doc is
the truth
• API is CRUD on server
resources
• i.e. paths in the config
doc
• Repetitive CRUD calls for
each and every type of
nodes
• Thousands lines of Lacinia
schema
Diffing based GraphQL API
• All logic is in SPA
• API is CRUD on config doc
• Update is sending diffs
• SPA periodically sends to
server:
(diff doc-prev doc-now)
• Server applies the diff, saves
the doc in DB, replies with
config doc SHA
• SPA validates SHA, if
different, sends config doc
to overwrite
• Removed all API calls on
paths and nodes
Case study: externalize application states
• How to scale highly stateful application?
• E.g. Juji initiates an agent (rep) for each chat session on a server node, the
state of each rep is stored in an atom
• What if the server node become unavailable?
Server Node
API
Gateway
Case study: externalize application states
• Each rep sends diff of its state to a persistent log (e.g. Kafka)
• E.g. At each utterance, rep sends (diff state-prev state-now)
• When a server becomes unavailable, API gateway forward traffic to
another server, which recovers the agent state from the persistent
log, by simply sequentially applying all diffs to a shared initial state.
Server Node
API
Gateway
Persistent Log
diff
Case study: reduce component dependency
• Stateful components depend on one another
• Introducing user invokable system functions,
leads to circular dependency, e.g.
(juji.func.system/cleanup-chat rep)
System
Rep
Reps
Rep
Subs
func.system
[:rt jujiid]
• Instead of depending on
namespaces that contain
subscriptions
• Watch reps atom
• Inspect its diff between old
and new
• Handle the case when a rep
is removed or cleaned
• i.e. sending :user-left
message to channels, and let
the subscriptions clean
themselves up
Case study: synchronize collaborative editing
• Multiple parties sending diffs
• Out of sync when lines cross path
• Difficult yet common problem
• E.g. enable multiple users editing the same
chat at the same time
• Locking has bad UX
• Three-way merge has high latency
A A
(diff A A’)
(diff A A’’)
Differential Synchronization
• Diffing based synchronization
method
• Scalable
• Fault-tolerant
• Low latency
• Developed by Neil Fraser in
2009
• Used by Google Docs
• Client-server
case
• Use two
shadows
• Fault
tolerant case
• Keep a
backup
shadow
• Scaling
Data modeling guideline: Don’t use vector
• Minimize unnecessary use of ordered data structure, e.g. vector or
list
• Diffing algorithm is slow for ordered data, because order is a strong
constraint to satisfy
• Ordered O(mn) vs. Unordered O(m+n)
• The implicit order of data elements are often source of incidental complexity
• Meaningful order is often based on data fields
• Sets or maps suffice in most cases
[ {} {} {} … ]
Bad
{ {} {} {} … } #{ {} {} {} … }
Good
Conclusion
• Diffing offers a few properties that lead to
• Simplified software architecture
• Enhanced system decoupling
• Easier scaling of stateful application
• Better solution to data synchronization problem
• Worthwhile to consider diffing based software architecture
• Particularly for data-oriented programming
Thank you!
• Huahai Yang @huahaiy
• Juji Inc. https://juji.io

Data Diffing Based Software Architecture Patterns

  • 1.
    Data Diffing BasedSoftware Architecture Patterns Huahai Yang Juji Inc.
  • 2.
    What is diffing? •Given two elements a and b,calculate the difference d between them • Function (diff a b) ;=> d • Function (patch a d) • Such that (= b (patch a d)) • Or: (= b (patch a (diff a b))) • These are normally true: • (not= (diff a b) (diff b a)) • (= (diff a c) (concat (diff a b) (diff b c))) • (< (size d) (min (size a) (size b))) • (< (time (patch a d)) (time (diff a b)))
  • 3.
    Evolution of diffing(1) • Earliest diff was developed by Doug McIIroy on Unix at Bell Lab in 1974 • Works on text file, work units are lines of text • Purpose: Reduce storage necessary to maintain multiple versions of file. • Use: compare content, track changes, verifying output, version control
  • 4.
    Evolution of diffing(2) • Diffing in 3D graphics programming • World modeled as a scene graph • Only re-render changed subtrees • Purpose: performance optimization • Conceptually simple programming model: render everything • Inspired react.js • Clojurescript wrapper of react could be faster than react due to faster diffing with immutable data
  • 5.
    Evolution of diffing(3) • Data oriented programming • Data, not text • Data are directly meaningful for code, no need for parsing or decoding • Generic data literals, not specialized opaque programming constructs • Diff input and output are both data • Diffing as a software architecture consideration, not just an implementation detail, impacting • Delineation of system components • Data model design • API design
  • 6.
    Diffing enables decoupling •diff & patch functions are generic and blind • They don't have to understand their input for them to work • Semantic asymmetry between sender and receiver enforces separation of concerns • Also support a kind of natural encapsulation, not forced like in OOP • d is still open for inspection if the receiver chooses to • Graded, receiver don’t need know a lot, but can know a lot if choose to Sender (diff a a’) ;=> d d Receiver (patch a d) ;=> a’
  • 7.
    Diffing encourages datamodel reuse • Thanks to diffing, data duplication between components are faithful and cheap • Advantageous to reuse the same data model throughout the system, dramatically simplifying system
  • 8.
    Diffing tracks changes •Thanks to diffing, each version of the world state can be cheaply saved and replayed to recover originals • Application statefulness can be externalized and managed
  • 9.
    Editscript: a Clojuredata diffing library • https://github.com/juji-io/editscript • Works for vector, list, set and map • Edits are a vector of vectors: • Path • Op :+, :-, or :r • Value • Diffing algorithms • Quick: fast • A* : optimal diff size
  • 10.
    Case study: JujiStudio UI Re-design • Complete UI redesign • Re-implementation • One month turnaround • Mainly due to switching from a resource-oriented API to a diffing based API Before
  • 11.
    Case study: JujiStudio UI Re-design • Complete UI redesign • Re-implementation • One month turnaround • Mainly due to switching from a resource-oriented API to a diffing based API After
  • 12.
    UI Data model:config doc • Single Page Application (SPA) in cljs • States in an EDN document – config doc • SPA, server and DB all having copies of config doc Config doc SPA Server DB GraphQL Config doc Config docAPI
  • 13.
    Traditional GraphQL API •Resources oriented (RESTful) • Server side config doc is the truth • API is CRUD on server resources • i.e. paths in the config doc • Repetitive CRUD calls for each and every type of nodes • Thousands lines of Lacinia schema
  • 14.
    Diffing based GraphQLAPI • All logic is in SPA • API is CRUD on config doc • Update is sending diffs • SPA periodically sends to server: (diff doc-prev doc-now) • Server applies the diff, saves the doc in DB, replies with config doc SHA • SPA validates SHA, if different, sends config doc to overwrite • Removed all API calls on paths and nodes
  • 15.
    Case study: externalizeapplication states • How to scale highly stateful application? • E.g. Juji initiates an agent (rep) for each chat session on a server node, the state of each rep is stored in an atom • What if the server node become unavailable? Server Node API Gateway
  • 16.
    Case study: externalizeapplication states • Each rep sends diff of its state to a persistent log (e.g. Kafka) • E.g. At each utterance, rep sends (diff state-prev state-now) • When a server becomes unavailable, API gateway forward traffic to another server, which recovers the agent state from the persistent log, by simply sequentially applying all diffs to a shared initial state. Server Node API Gateway Persistent Log diff
  • 17.
    Case study: reducecomponent dependency • Stateful components depend on one another • Introducing user invokable system functions, leads to circular dependency, e.g. (juji.func.system/cleanup-chat rep) System Rep Reps Rep Subs func.system [:rt jujiid]
  • 18.
    • Instead ofdepending on namespaces that contain subscriptions • Watch reps atom • Inspect its diff between old and new • Handle the case when a rep is removed or cleaned • i.e. sending :user-left message to channels, and let the subscriptions clean themselves up
  • 19.
    Case study: synchronizecollaborative editing • Multiple parties sending diffs • Out of sync when lines cross path • Difficult yet common problem • E.g. enable multiple users editing the same chat at the same time • Locking has bad UX • Three-way merge has high latency A A (diff A A’) (diff A A’’)
  • 20.
    Differential Synchronization • Diffingbased synchronization method • Scalable • Fault-tolerant • Low latency • Developed by Neil Fraser in 2009 • Used by Google Docs
  • 21.
  • 22.
    • Fault tolerant case •Keep a backup shadow
  • 23.
  • 24.
    Data modeling guideline:Don’t use vector • Minimize unnecessary use of ordered data structure, e.g. vector or list • Diffing algorithm is slow for ordered data, because order is a strong constraint to satisfy • Ordered O(mn) vs. Unordered O(m+n) • The implicit order of data elements are often source of incidental complexity • Meaningful order is often based on data fields • Sets or maps suffice in most cases [ {} {} {} … ] Bad { {} {} {} … } #{ {} {} {} … } Good
  • 25.
    Conclusion • Diffing offersa few properties that lead to • Simplified software architecture • Enhanced system decoupling • Easier scaling of stateful application • Better solution to data synchronization problem • Worthwhile to consider diffing based software architecture • Particularly for data-oriented programming
  • 26.
    Thank you! • HuahaiYang @huahaiy • Juji Inc. https://juji.io