Palantir Revisioning Database
 Bob McGrew
 Director of Engineering




 © 2008 Palantir Technologies Inc. All rights reser...
Introduction

 Imagine a community of thousands of analysts all
  reading and writing to a traditional database.
 Proble...
Revisioning Database features

 Online object history
 Separate spaces for analysis
 Granular sharing of changes
Revisioning Database features

 Online object history
 Separate spaces for analysis
 Granular sharing of changes
Alternative approach: audit logs

 Write every change to audit log
   – Separate data store (database or flat file)
   – ...
Integrated object history


 Online history of every change




 Ability to view “what we knew when”
Modeling data: The object model

          Each stored in
           separate database
           table
Modeling time: Application events

 Model time as a linear sequence of application events
 Provides atomicity for change...
Revisioning Database: the basic idea

 Every edit to an object component inserts a new row into the
  relevant database t...
Down to the database

 Revisioning fields:
   – id: long-living component ID
   – object_id: ties the component to its ob...
Modifying an object


op       id   object_id   app_event_id version deleted value
Create   10   10          1            ...
Loading an object


op       id   object_id      app_event_id version deleted value
Create   10   10             1        ...
Loading an object

App Event #   Object #10     Property #101         Property #102
1             Type: Person   Name: Mik...
Loading an object

App Event #   Object #10     Property #101          Property #102
1             Type: Person   Name: Mi...
Loading an object

App Event #   Object #10     Property #101          Property #102
1             Type: Person   Name: Mi...
Loading an object from history

App Event #   Object #10     Property #101         Property #102
1             Type: Perso...
Loading an object from history

App Event #   Object #10     Property #101         Property #102
1             Type: Perso...
Loading an object from history

App Event #   Object #10     Property #101         Property #102
1             Type: Perso...
Speed of operations

 Time efficient
   – Edits
   – Fetch current version of object
   – Fetch past version of object
  ...
Revisioning Database features

 Online object history
 Separate spaces for analysis
 Granular sharing of changes
Separate spaces for analysis

   Explore competing hypotheses
   Edit information without affecting others
   Control o...
A realm of one’s own

 A realm is a complete view of all objects
   – Each investigation
   – Data Repository
 Investiga...
Down to the database

 Revisioning fields:
   – id: long-living component ID
   – object_id: ties the component to its ob...
Down to the database

 Revisioning fields:
   – id: long-living component ID
   – object_id: ties the component to its ob...
Making inheritance work

 Copy-on-read approach
   – Copy on first viewing
   – Edit as normal
   – Expensive for large o...
The realm_object relation

 Pointer is relation between objects and realm
 realm_object row:
   – object_id: the object ...
Object modification

 Object addition to realm
   – Write one realm_object row
   – source_realm_app_event is the current...
Loading an object

 If there is no realm_object row,
    – Load the object from the Data Repository at the latest app eve...
Loading with realm objects


App Event #   Object #10     Property #101           Property #102
1             Type: Entity...
Loading with realm objects


Realm        Object #10      Property #101          Property #102
Data Repo    Type: Entity  ...
Speed of operations

 Time efficient
   – Edits
   – Fetch current version of object
   – Fetch past version of object
  ...
Revisioning Database Features

 Online object history
 Separate spaces for analysis
 Granular sharing of changes
Granular data sharing

 Changes developed in one realm need to be shared with other
  analysts
 Data sharing must be per...
Update

 Edit the realm_object
 Insert new realm_object row:
   – object_id: unchanged
   – realm_id: unchanged
   – sou...
Publish

 Two-phase process
 Data Repository
   – Copy rows from investigative realm
   – Only need to copy rows that ar...
Publish


id    object_id   realm_id   app_event_id   deleted value
10    10          100        1              0       Ty...
Publish: Copy rows to Repository


id    object_id   realm_id   app_event_id   deleted value
10    10          100        ...
Publish: Delete rows in investigation


id    object_id   realm_id   app_event_id   deleted value
10    10          100   ...
Loading after publish


App Event #     Object #10     Property #101         Property #102
1               Type: Person   ...
Loading after publish


Realm           Object #10     Property #101         Property #102
Data Repo       Type: Person   ...
Speed of operations

 Time efficient
   – Updates
   – Publish is bulk in-database insert
 Space efficient
   – Updates
...
Revisioning Database features

 Online object history
 Separate spaces for analysis
 Granular sharing of changes
Other issues not covered

   Branching history in investigations
   Loading links
   Conflict resolution
   Synchroniz...
Palantir Revisioning Database
 Bob McGrew
 Director of Engineering




 © 2008 Palantir Technologies Inc. All rights reser...
Upcoming SlideShare
Loading in …5
×

Palantir's Revisioning Database

5,766 views

Published on

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total views
5,766
On SlideShare
0
From Embeds
0
Number of Embeds
90
Actions
Shares
0
Downloads
294
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide
  • Pitch whitevideoAt end: From now on, I will only discuss how we revision rows in a database table
  • Mention that “Update” refers to the SQL operation: when the
  • Don’t mention Lucene
  • Palantir's Revisioning Database

    1. 1. Palantir Revisioning Database Bob McGrew Director of Engineering © 2008 Palantir Technologies Inc. All rights reserved.
    2. 2. Introduction  Imagine a community of thousands of analysts all reading and writing to a traditional database.  Problems for analysts: – No history of additions and edits for data – No ability to discover “what we knew when” – No prior review of shared data – No granular sharing
    3. 3. Revisioning Database features  Online object history  Separate spaces for analysis  Granular sharing of changes
    4. 4. Revisioning Database features  Online object history  Separate spaces for analysis  Granular sharing of changes
    5. 5. Alternative approach: audit logs  Write every change to audit log – Separate data store (database or flat file) – Separate tools (SQL or grep)  Not integrated with analysis – Offline and inaccessible by design – No security info – No analyst-facing tools – Infeasible to determine “what we knew when”  Audit logging is important, but no substitute for an online history
    6. 6. Integrated object history  Online history of every change  Ability to view “what we knew when”
    7. 7. Modeling data: The object model  Each stored in separate database table
    8. 8. Modeling time: Application events  Model time as a linear sequence of application events  Provides atomicity for changes  The database ID for an app event always increases with time – Total ordering for all app events – Wall clock time allows ties
    9. 9. Revisioning Database: the basic idea  Every edit to an object component inserts a new row into the relevant database table  Once inserted, rows are never updated
    10. 10. Down to the database  Revisioning fields: – id: long-living component ID – object_id: ties the component to its object – app_event_id: totally ordered timestamp for row – version: primary key for row, based on ID and app_event_id – deleted: 0 if not deleted; non-zero if deleted  Metadata fields: – time_created: time component was created – created_by: user ID of component creator – last_modified: time row was created  Value fields: – <value fields>: everything else
    11. 11. Modifying an object op id object_id app_event_id version deleted value Create 10 10 1 1001 0 Type:Person Create 101 10 1 1002 0 Name:Mike Fikri Create 102 10 2 1003 0 Phone#:650-494-1574 Edit 101 10 3 1004 0 Name:Michael Fikri Delete 102 10 4 1005 1 Phone#:650-494-1574
    12. 12. Loading an object op id object_id app_event_id version deleted value Create 10 10 1 1001 0 Type:Person Create 101 10 1 1002 0 Name:Mike Fikri Create 102 10 2 1003 0 Phone#:650-494-1574 Edit 101 10 3 1004 0 Name:Michael Fikri Delete 102 10 4 1005 1 Phone#:650-494-1574 App Event # Object #1 Property #101 Property #102 1 Type: Person Name: Mike Fikri 2 Phone#: 650-494-1574 3 Name: Michael Fikri 4 Phone#: 650-494-1574
    13. 13. Loading an object App Event # Object #10 Property #101 Property #102 1 Type: Person Name: Mike Fikri 2 Phone#: 650-494-1574 3 Name: Michael Fikri 4 Phone#: 650-494-1574 5 • Choose all rows that • Belong to object 10
    14. 14. Loading an object App Event # Object #10 Property #101 Property #102 1 Type: Person Name: Mike Fikri 2 Phone#: 650-494-1574 3 Name: Michael Fikri 4 Phone#: 650-494-1574 5 Type: Person Name: Michael Fikri Phone#: 650-494-1574 • Choose all rows that • Belong to object 10 • Have not been superseded by a later version
    15. 15. Loading an object App Event # Object #10 Property #101 Property #102 1 Type: Person Name: Mike Fikri 2 Phone#: 650-494-1574 3 Name: Michael Fikri 4 Phone#: 650-494-1574 5 Type: Person Name: Michael Fikri • Choose all rows that • Belong to object 10 • Have not been superseded by a later version • Are not deleted
    16. 16. Loading an object from history App Event # Object #10 Property #101 Property #102 1 Type: Person Name: Mike Fikri 2 Phone#: 650-494-1574 3 Name: Michael Fikri 4 Phone#: 650-494-1574 • To load the version of Mike at app event 2, choose all rows that • Belong to object 10 • Have not been superseded by a later version less than 2 • Are not deleted
    17. 17. Loading an object from history App Event # Object #10 Property #101 Property #102 1 Type: Person Name: Mike Fikri 2 Phone#: 650-494-1574 Type: Person Name: Mike Fikri Phone#: 650-494-1574 3 Name: Michael Fikri 4 Phone#: 650-494-1574 • To load the version of Mike at app event 2, choose all rows that • Belong to object 10 • Have not been superseded by a later version less than 2 • Are not deleted
    18. 18. Loading an object from history App Event # Object #10 Property #101 Property #102 1 Type: Person Name: Mike Fikri 2 Phone#: 650-494-1574 Type: Person Name: Mike Fikri Phone#: 650-494-1574 3 Name: Michael Fikri 4 Phone#: 650-494-1574 • To load the version of Mike at app event 2, choose all rows that • Belong to object 10 • Have not been superseded by a later version less than 2 • Are not deleted
    19. 19. Speed of operations  Time efficient – Edits – Fetch current version of object – Fetch past version of object – Fetch all changes in a given time range  Space efficient – No overhead if object never changed – Only one row overhead per change
    20. 20. Revisioning Database features  Online object history  Separate spaces for analysis  Granular sharing of changes
    21. 21. Separate spaces for analysis  Explore competing hypotheses  Edit information without affecting others  Control over visibility of others’ edits  Solution: – Let analysts “check out” a copy of the data – Similar to source control systems like SVN or CVS
    22. 22. A realm of one’s own  A realm is a complete view of all objects – Each investigation – Data Repository  Investigations inherit data from the Repository – Load and view Repository data – Search Repository data
    23. 23. Down to the database  Revisioning fields: – id: long-living component ID – object_id: ties the component to its object – realm_id: the realm that this row is in – app_event_id: totally ordered timestamp – version: primary key, based on ID and app_event_id – deleted: 0 if not deleted; non-zero if deleted
    24. 24. Down to the database  Revisioning fields: – id: long-living component ID – object_id: ties the component to its object – realm_id: the realm that this row is in – app_event_id: totally ordered timestamp – version: primary key, based on ID and app_event_id – deleted: 0 if not deleted; non-zero if deleted
    25. 25. Making inheritance work  Copy-on-read approach – Copy on first viewing – Edit as normal – Expensive for large objects or many objects  Copy-on-write – Write a pointer to the right object state – Edits supersede pointed-to rows – Super-space efficient; no redundant info
    26. 26. The realm_object relation  Pointer is relation between objects and realm  realm_object row: – object_id: the object that we are locking – realm_id: the investigative realm – source_realm_id: the Data Repository – source_realm_app_event_id: the app event that the object is locked into  Realm_objects are object components  Realm_objects are revisioned!
    27. 27. Object modification  Object addition to realm – Write one realm_object row – source_realm_app_event is the current app event  Edit – Write only changed rows to investigative realm – Need not write full object – Data Repository untouched
    28. 28. Loading an object  If there is no realm_object row, – Load the object from the Data Repository at the latest app event  If there is a realm_object row, – Load the object from the Data Repository at the app event specified – Load all rows from the investigative realm at the current app event – Investigative rows supersede Data Repository rows
    29. 29. Loading with realm objects App Event # Object #10 Property #101 Property #102 1 Type: Entity Name: Mike Fikri 2 Phone#: 650-494-1574 3 Name: Michael Fikri 4 Phone#: 650-494-1574 5 Data Repo Type: Entity Name: Michael Fikri App Event # Object #10 Property #101 Property #102 6 Type: Person Investigation Type: Person
    30. 30. Loading with realm objects Realm Object #10 Property #101 Property #102 Data Repo Type: Entity Name: Michael Fikri Investigation Type: Person Type: Person Name: Michael Fikri
    31. 31. Speed of operations  Time efficient – Edits – Fetch current version of object – Fetch past version of object – Fetch all changes in a given time range  Space efficient – Almost all data shared between realms – One row written per edit
    32. 32. Revisioning Database Features  Online object history  Separate spaces for analysis  Granular sharing of changes
    33. 33. Granular data sharing  Changes developed in one realm need to be shared with other analysts  Data sharing must be per-object  Update – Bringing changes from the Data Repository into an investigation  Publish – Moving changes from an investigation into the Data Repository
    34. 34. Update  Edit the realm_object  Insert new realm_object row: – object_id: unchanged – realm_id: unchanged – source_realm_id: unchanged – source_realm_app_event_id: overwrite to the new app event that the object is locked into – Set other revisioning fields appropriately
    35. 35. Publish  Two-phase process  Data Repository – Copy rows from investigative realm – Only need to copy rows that are still current  Investigative realm – Need to allow Data Repository rows to be authoritative – Insert rows with deleted = -1 – Rows with deleted = -1 do not supersede Data Repository rows
    36. 36. Publish id object_id realm_id app_event_id deleted value 10 10 100 1 0 Type:Person 101 10 100 1 0 Name:Mike Fikri 102 10 100 2 0 Phone#:650-494-1574 101 10 100 3 0 Name:Michael Fikri 102 10 100 4 1 Phone#:650-494-1574
    37. 37. Publish: Copy rows to Repository id object_id realm_id app_event_id deleted value 10 10 100 1 0 Type:Person 101 10 100 1 0 Name:Mike Fikri 102 10 100 2 0 Phone#:650-494-1574 101 10 100 3 0 Name:Michael Fikri 102 10 100 4 1 Phone#:650-494-1574 id object_id realm_id app_event_id deleted value 10 10 1 5 0 Type:Person 101 10 1 5 0 Name:Michael Fikri
    38. 38. Publish: Delete rows in investigation id object_id realm_id app_event_id deleted value 10 10 100 1 0 Type:Person 101 10 100 1 0 Name:Mike Fikri 102 10 100 2 0 Phone#:650-494-1574 101 10 100 3 0 Name:Michael Fikri 102 10 100 4 1 Phone#:650-494-1574 10 10 100 6 -1 Type:Person 101 10 100 6 -1 Name:Michael Fikri id object_id realm_id app_event_id deleted value 10 10 1 5 0 Type:Person 101 10 1 5 0 Name:Michael Fikri
    39. 39. Loading after publish App Event # Object #10 Property #101 Property #102 1 Type: Person Name: Mike Fikri 2 Phone#: 650-494-1574 3 Name: Michael Fikri 4 Phone#: 650-494-1574 6 Type: Person Name: Michael Fikri Investigation App Event # Object #10 Property #101 Property #102 5 Type: Person Name: Michael Fikri Data Repo Type: Person Name: Michael Fikri
    40. 40. Loading after publish Realm Object #10 Property #101 Property #102 Data Repo Type: Person Name: Michael Fikri Investigation Type: Person Name: Michael Fikri
    41. 41. Speed of operations  Time efficient – Updates – Publish is bulk in-database insert  Space efficient – Updates – Publish writes no more rows than the original – Often considerably fewer
    42. 42. Revisioning Database features  Online object history  Separate spaces for analysis  Granular sharing of changes
    43. 43. Other issues not covered  Branching history in investigations  Loading links  Conflict resolution  Synchronizing external indices (search server)
    44. 44. Palantir Revisioning Database Bob McGrew Director of Engineering © 2008 Palantir Technologies Inc. All rights reserved.

    ×