Derek Laufenberg
derek.laufenberg@actian.com
262-754-4792

1
“It was the best of times ….” with apologies to Dickens, but today there are many choices in
data management. It is truly ...
This talk isn't going to identify a “best database” between these two technologies, as we
will see, best is determined by ...
Credit – LandScape according to 451 Group, 2012.
Introduction
Databases have been around for over 50 years, from the begin...
Great summary - http://en.wikipedia.org/wiki/A_Tale_of_Two_Cities
http://www.sparknotes.com/lit/twocities/

I chose “A tal...
Architecture Overview
The flavor and color of a city is conveyed through its architecture and inhabitants; without
straini...
Daily life for early cities centered around the water. They were built on water to provide
economic advantage and improve ...
If our protagonists are the databases, our story needs some form of antagonists which our
technical heroes can overcome.
D...
10 Duplication prohibited

19.06.2013

Model driven: thinking about your problem domain in classes, modelling in OO
Comple...
11
12
Vectorwise is typically deployed at the heart of the BI/Reporting system to provide high
speed reporting. Actian partners ...
Please forgive the marketing here, but the cost effective commodity hardware shows how
well Vectorwise’s re-designed query...
15 Duplication prohibited

19.06.2013

Versant on the other hand is all about dealing with really complicated problem doma...
A picture can explain the complexity better.
This is actually a map of the Schema – SID Shared Information and Data model
...
With those typical use cases in mind, lets see how these technologies approach the data
management problem.

17
Database’s share some common structures when viewed at a high level. The common
elements come from the fact that they are ...
The data models employed by these two systems again have some similarities albeit
different naming conventions and a few w...
Where the two data models diverge is seen in the object database's need to support
abstractions commonly found in the obje...
We see here the central SQL focus for Vectorwise.

21
With Versant, we see the application client built with object management resources: cache,
transaction manager, and transp...
23
24 Duplication prohibited

19.06.2013

Our Versant Object Database Server together with the respective client API store th...
25 Duplication prohibited

19.06.2013

The persistent capable class model of the application corresponds to the schema of ...
Annotations within the Java code coupled with an added compilation step to extract the
schema and give the Java applicatio...
26 Duplication prohibited

19.06.2013

Change Tracking - We know, which objects were modified in the current transaction, ...
27
Communications
Communications for both these systems is similar, a Java application for Vectorwise would
use JDBC to query...
Transactions are central to the operation of both systems. They are the means through
which all data flows in and out of t...
Locking vs Multiversioning
Versant uses a 2-phase locking protocol which gathers locks on all the objects being used to
en...
need to be purged if space is a concern.

We have two different means of managing concurrency and serialization of transac...
One major difference found between these technologies is in how they physically store
data both on the disk and in memory....
Versant does allow for variability on the page, multiple types of objects or variable length
structures.
The min/max stats...
Compression of data both on disk and in RAM reduces the IO bottlenecks that large data
systems confront today. By decompre...
34
Although Versant uses a traditional layout, where objects get located on a given data page,
there are some tricks it uses ...
36 Duplication prohibited

19.06.2013

The LOID is used to identify an object and represent references, but how it is used...
One final point about the LOID in Versant. LOID references are designed to be crossdatabase references. Here we have an ap...
What good is a database without a means to find answers to our particular questions or
efficiently service an application'...
Circa 2003
SQL vs C benchmark on the TPC-H
This difference between the database and the custom C program is huge… why is t...
Each level of the data handling was studied for performance loss.
Compiler optimization are easier to take advantage of in...
Results of the work some 40x improvement.

41
The work on Ingres is very critical to Vectorwise (ParAccel SMP) as the main interface to
both Ingres and Vectorwise are t...
Taking all the performance features into account for VW query processing.
This is great for Reporting where you data isn’t...
With Versant, queries are typically used to locate the beginning of a graph or top level
objects. Once the starting point ...
The thing about relationships is they don’t change often. By baking them into the server’s
data structure and making them ...
Closing Comments
This brings us to the end of our tale and hope you enjoyed our time together as much as I
did. Each of th...
47
Upcoming SlideShare
Loading in …5
×

NoSQL Object DB & NewSQL Columnar DB, A Tale of Two Databases

1,210 views

Published on

Derek Laufenberg, Director of Systems Engineering, Actian/Versant

A Database Month event:
http://www.DatabaseMonth.com/database/actian-versant

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,210
On SlideShare
0
From Embeds
0
Number of Embeds
38
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

NoSQL Object DB & NewSQL Columnar DB, A Tale of Two Databases

  1. 1. Derek Laufenberg derek.laufenberg@actian.com 262-754-4792 1
  2. 2. “It was the best of times ….” with apologies to Dickens, but today there are many choices in data management. It is truly a “best of times” moment for choice. That choice, is a double edged sword, databases are not created equal. Not all problems are created equal either. Database designs have inherent tradeoffs forced by the problem the DMS was intend to solve. Selecting the wrong technology can doom a project at worst, or end up costing it millions over the lifetime of the application. 3
  3. 3. This talk isn't going to identify a “best database” between these two technologies, as we will see, best is determined by the fit to the particular problem being solve. What I hope you will gain from our time today is a better understanding of the core components, design tradeoffs, and intended use cases, so you can make better choices on your next data management project. 4
  4. 4. Credit – LandScape according to 451 Group, 2012. Introduction Databases have been around for over 50 years, from the beginning of electronic computation data storage has always been fought with challenges – what to store, the format in which to store, how to retrieve it later. How to protect it and how to share it. The challenges of persistence are persistent even today after 60 odd years of computing. Being on the technical side of database sales for over 14 years, I've learned that “one size doesn't fit all” when it comes to data management. Different problems often demand different approaches. The last 5 or so years has given us an explosion in New SQL or No SQL technologies all aimed at better solving some part of the persistence problem. 5
  5. 5. Great summary - http://en.wikipedia.org/wiki/A_Tale_of_Two_Cities http://www.sparknotes.com/lit/twocities/ I chose “A tale of two databases” as the title for today's talk, with apologies to Dickens, as a motivator to look at two very different database products within the Actian portfolio. Actian has a large offering of data management and integration products, and I encourage you to check out our website for the larger picture, but for this discussion we're going to focus on and look under the covers of only two products: Versant and ParAcell SMP (aka Vectorwise) to see how they tick, and what makes one an operational DB and the other a powerful analytic database. Both are enterprise databases, each with 1000's of deployments, but what I find interesting as a systems engineer is where they share design concepts and the key areas where they differ. 6
  6. 6. Architecture Overview The flavor and color of a city is conveyed through its architecture and inhabitants; without straining the analogy, the style of a database is also understood through it architecture and components from which it's made. We will see like any pair of modern cities there is much in common; between the our two database protagonists there is also much common ground, but there are important differences which should guide a systems designer's choice of technology. Cast of Characters – Our Cities •Versant Object Database •ParAccel SMP aka Vectorwise A Tale of Two Cites is a story about, well two cities during the French-English War. The cites server as the main characters with their politics, geography and inhabitants providing the details and coloring for the story. A third major character or theme in the Dickens novel, is water. 7
  7. 7. Daily life for early cities centered around the water. They were built on water to provide economic advantage and improve the quality of life. Water is life. Uncontrolled or contained, water can too be the ruin of a city. Early city inhabitants weren’t always to careful with what was put into that life giving river or lake. Fortunately today we know how our water cycles work and are much more careful, even reclaiming the once mistreated bodies of water. In our modern day story, our water is data. It flows, it changes, and has a life-cycle all its own. Data is life for companies today. How it is managed, shaped, and used by a company greatly affects its overall prosperity. Today a company’s information is just as important. Like water, care must be taken to both store and let it flow, creating value from it huge potential. ** kite boarder pictured is the author enjoying water’s potential on Lake Michigan 8
  8. 8. If our protagonists are the databases, our story needs some form of antagonists which our technical heroes can overcome. Data management projects have different concerns and the tools used for the project must match the concerns. 9
  9. 9. 10 Duplication prohibited 19.06.2013 Model driven: thinking about your problem domain in classes, modelling in OO Complex models in OO Application types can often fall into to of of these broad categores. Data driven – common rules used by many applications or reports. Aggregations found in reporting or data warehouse are a particular strenght of Vectorwise.. 10 Copyright © 2013 Actian Corporation 10
  10. 10. 11
  11. 11. 12
  12. 12. Vectorwise is typically deployed at the heart of the BI/Reporting system to provide high speed reporting. Actian partners with the leading BI & Reporting vendors. 13
  13. 13. Please forgive the marketing here, but the cost effective commodity hardware shows how well Vectorwise’s re-designed query takes advantage of the new CPU and multi-core designs. More on this later. 14
  14. 14. 15 Duplication prohibited 19.06.2013 Versant on the other hand is all about dealing with really complicated problem domains. The class diagram above just shows a few classes. Typical applications have hundreds, even thousands of classes. 15 Copyright © 2013 Actian Corporation 15
  15. 15. A picture can explain the complexity better. This is actually a map of the Schema – SID Shared Information and Data model Deep inheritance – sometimes 15 levels or more. Collections all over, most of them are polymorphic. 16
  16. 16. With those typical use cases in mind, lets see how these technologies approach the data management problem. 17
  17. 17. Database’s share some common structures when viewed at a high level. The common elements come from the fact that they are solving the same problems utilizing different means or with a different focus. But common structures vary greatly in their implementation and tradeoffs that make one system excel at fast execution of ad hoc query or the navigation of a complex telecommunications network. 18
  18. 18. The data models employed by these two systems again have some similarities albeit different naming conventions and a few wrinkles in how their respective schema is defined. Both systems support the basic data types: chars, ints, floats, strings with minor variances on width. In both systems, these basic types are used to compose more complex structures: tables or classes which on the surface look pretty similar. The Vectorwise data model is based on the SQL standard and supports most of the SQL types. Data definition language (DDL) and data manipulation language (DML) is SQL. SQL is used to create table definitions, insert or update, or delete. We won't be going into the SQL details here as most people are familiar with the model, but lets compare it to the Versant model, because here we see some major differences. 19
  19. 19. Where the two data models diverge is seen in the object database's need to support abstractions commonly found in the object oriented programming languages, these concepts include: pointers, type inheritance, and collections. This doesn't imply that these concepts can't be expressed in a RDB like Vectorwise, in fact ORM tools like JPA or Hibernate help manage persistence problem by hiding RDM nature and SQL from the application developer. However this hiding isn't without considerable cost in operational friction, also known as impedance mismatch in the OODB literature. 20
  20. 20. We see here the central SQL focus for Vectorwise. 21
  21. 21. With Versant, we see the application client built with object management resources: cache, transaction manager, and transport over the network. Part of the friction comes from dealing with OO concepts mentioned above. Versant backend supports these abstractions innately, and is best understood with an example. 22
  22. 22. 23
  23. 23. 24 Duplication prohibited 19.06.2013 Our Versant Object Database Server together with the respective client API store the objects, instances of application classes, directly in database storage. Typically, objects have references to other objects, of varying types – base class or interface types. Once stored, this network of objects, or any part of it, can be retrieved later by queries, followed by navigation across object references in the respective language. Only the objects accessed during a transaction are loaded into the client side cache. Once a method is called on an object reference of a not yet loaded object that object is retrieved from the server doing a lookup based on its type independent logical object id. 24 Copyright © 2013 Actian Corporation 24
  24. 24. 25 Duplication prohibited 19.06.2013 The persistent capable class model of the application corresponds to the schema of the database. persistent capable classes are marked in the source code, or get listed in a configuration file. Our tools read this information and generate the additional code that connect simple classes to our database system. We add the enhancer step. The enhancer takes the byte code of the application classes and adds the code that makes classes persistent capable, and persistent aware, respectively. In the source code above, the lines are marked that create a database connection, and control a transaction. Please note that only the Employee instance is made persistent explicitly. But because the Department and Phone instances are reachable from the Employee instance they are made persistent as well. This is ‘Persistence by reachability’. For the example, we'll use Java's defacto persistence standard JPA, as our database binding language. With JPA we can highlight Versant's implementation details behind OO language abstractions. The DDL and DML for Versant is Java and the JPA API. This is truly a NO SQL interface to the database. 25 Copyright © 2013 Actian Corporation 25
  25. 25. Annotations within the Java code coupled with an added compilation step to extract the schema and give the Java application a direct line into the database. With JPA, the persistent class's byte code is modified to support change tracking, data marshaling, cascading persistence, and on demand object loading logic. Annotations indicate what classes are destined for the database and support the nuances of how attributes should be stored. Interestingly, with V/JPA, you need far fewer attribute annotations because the database better understands OO concepts like inheritance and collections. 25
  26. 26. 26 Duplication prohibited 19.06.2013 Change Tracking - We know, which objects were modified in the current transaction, and we store them at commit Transparent lazy loading Per default, objects get only loaded once they are de-referenced – a method is called on them Persistence by Reachability New objects get stored, if they are reachable by any already persisted object. Only the root object of a network of objects needs to persisted explicitly. JPA is an ORM tool, JSR 220 was principally the work of the RDB community to eliminate the development friction found when using Java and JDBC to store complicated object models. Hiding the persistence implementation from the developer, leads to more consistent and simpler programming for the developer. Object Relational Mapping details are needed, and many of the JPA annotations are used to identify special handling required for mapping the class into one or more tables. Versant has adopted JPA as the latest binding on top of its object database. Because of the inherent treatment of OO many mapping annotations aren't required because of the back end's understanding of the object model. 26 Copyright © 2013 Actian Corporation 26
  27. 27. 27
  28. 28. Communications Communications for both these systems is similar, a Java application for Vectorwise would use JDBC to query and return data sets, which could then be used to construct the objects if required by the application's object model. A JPA O/RM layer could be used here to hide dataset to object translations if desired, but that isn't really Vectorwise's nature, a more typical use would be a BI application accessing the contents. Versant JPA uses an internal protocol built with RPC against the object server to load or update objects within the JPA programming interface. Objects are marshaled in a binary form and instantiated in the JVM for use by the application. In some cases, in complete objects, hollow objects, are created inside the VM, but the lazy loading protocol ensures they will be fully loaded prior to use by the application. 28
  29. 29. Transactions are central to the operation of both systems. They are the means through which all data flows in and out of the server. Data creation, updates, deletions, and even the schema manipulation itself is bounded by a transaction. In 1983, Reuter & Harder coined the term ACID1 to describe transactions. Both Versant and ParAccel are ACID databases, however they go about it through different mechanisms. This brings us to our next comparison, locking and versioning. 1Haerder, T.; Reuter, A. (1983). "Principles of transaction-oriented database recovery". ACM Computing Surveys 15 (4): 287. doi:10.1145/289.291 29
  30. 30. Locking vs Multiversioning Versant uses a 2-phase locking protocol which gathers locks on all the objects being used to ensure no two transactions are attempting to write to the same data (object). This is mechanized with a locking table and transaction graph. Shared or read locks are collected as the transactions work with data. They are then followed up with update (semi-exclusive) or write (exclusive) locks when the transaction attempts changing the data. Deadlocks are detected, as well as a timeout to prevent a transaction from waiting forever. With this approach, updates are done in place on the existing data. Very likely the same physical pages in memory and disk are updated as the object was read from. The locks ensure transaction serialization. I should mention that Versant supports both a pessimistic and optimistic locking schemes. Even optimistic locking uses the read and write locks temporarily as objects are read or the transaction commits. --The counter part in Vectorwise is a multiversioning concurrency control (MVCC) system whereby each transaction sees a consistent database at a given point in time – a snapshot controlled by the transaction ID. A given transaction won't see a half-completed transaction operating on the same data because other transactions doesn't overwrite the original data, they create a new version with a later transaction-ID to prevent contaminating earlier transactions. No locks or wait graphs need be maintained. Deleted and updated entries 30
  31. 31. need to be purged if space is a concern. We have two different means of managing concurrency and serialization of transactions. The Versant method is historically similar to RDBMS which support row and table locking. Vectorwise's MVCC increases throughput at the expense of data growth and needed propagation events. If you require strict serialization of transactions, or want to limit growth, the locking model will suit your needs. If analytic speed and concurrent read concerns are your core concern, the MVCC will be faster, at the possible cost of stale data. We are starting to see why Vectorwise is used for analytic, read-heavy reporting and Versant finds itself used for operational processing. 30
  32. 32. One major difference found between these technologies is in how they physically store data both on the disk and in memory. Of particular interest to me is the Vectorwise's columnar approach, it is designed for pure analytical efficiency. In contrast to the underlying storage model used by Versant which is similar to what is found in many database systems. Versant model older design, N-ary Storage Model, but there are some interesting tricks it uses to optimize performance for networked object graphs. Common in most database storage system are the concepts of volumes and pages. A volume is a collection of pages and Versant can have as many volumes as need for the database. A volume is mapped to a file and can be located on anything from raw devices to storage area network (SAN) drives. [DeWitt] [Zukowski] NSM = N-ary Storage Model - row contains all columns DSM – Decomposed Storage model = N attributes into N vertical storage elements PAX = Partition Attributes Across = multiple columns stored on a page, but attributes stored vertically Vectorwise Block size must be set prior to table creation. 31
  33. 33. Versant does allow for variability on the page, multiple types of objects or variable length structures. The min/max stats help reduce the columnar blocks that need be evaluated for a query. 32
  34. 34. Compression of data both on disk and in RAM reduces the IO bottlenecks that large data systems confront today. By decompressing into the CPU’s cache VW takes advantage of the Processor IO. Column structure works really well for compression. Similar data is grouped together allowing VW to pick an optimal compression strategy. Here optimal is not just storage density, but also ease of decompression into the CPU cache for later processing. 33
  35. 35. 34
  36. 36. Although Versant uses a traditional layout, where objects get located on a given data page, there are some tricks it uses to efficient locate connected object. Common in most database storage system are the concepts of volumes and pages. A volume is a collection of pages and Versant can have as many volumes as need for the database. A volume is mapped to a file and can be located on anything from raw devices to storage area network (SAN) drives. Pages are further broken down to slots used to store object instances. Multiple object instances may stored on a page and accessed through the object's slot location. Larger objects will span contiguous pages. Page size in Versant is modest 16K bytes; this is often large enough hold many objects and still small enough not to waste too much space with deleted objects. Normally, objects of the same type get stored in the same page on next available slot, but as an optimization, it is possible to co-locate a parent and its children on the same page. This extra effort results in a extremely efficient object loading when the parent is used with its children frequently. 35
  37. 37. 36 Duplication prohibited 19.06.2013 The LOID is used to identify an object and represent references, but how it is used to locate an object internally? Central to accessing any object is the Address Translation of the LOID to a physical volume:page:slot triple. This triple identifies the objects location on the disk and is accomplished by a multi-level hash table. It is highly optimized and cached in memory since it is used for accessing every object. Client side, the red object is already loaded in the client cache. It contiains references to two other objects of grey color that are not yet loaded. If the application now calls a method on any of these, then the Loid of this object is looked up in the client side hash table. It has no address, so the Loid is send to the server, where a lookup is done in the Association Table (AT). That lookup provides the physical object location in the respective data volume. The physical page is loaded into the server cache and the object is sent back to the client and instantiated in client memory. 36 Copyright © 2013 Actian Corporation 36
  38. 38. One final point about the LOID in Versant. LOID references are designed to be crossdatabase references. Here we have an application using 4 objects, but they come from two different databases. This give the application designer great flexibility in deciding how to partition data. The application simply connects to all the databases involved for the cross-db references. Transaction use a 2-phase commit protocol. 37
  39. 39. What good is a database without a means to find answers to our particular questions or efficiently service an application's demand for data. Like the other components we've looked at there are some similarities between these two technologies, but also some big difference. Indexes in Vectorwise are typically not needed. Often, VW is setup so the compressed DB lives entirely in memory and the auto-page indexes the redesigned query engine are enough that scanning the data without indexing performs well enough that no index tuning is required. Versant on the other hand allows nearly any attribute or even collection to be indexed. Versant’s query engine will then use the index automatically or with hints supplied by the user. 38
  40. 40. Circa 2003 SQL vs C benchmark on the TPC-H This difference between the database and the custom C program is huge… why is the overhead of using a database so high, what’s being left on the table? This difference started the X100 project to try to reclaim the 100 times loss in performance. We’ve seen the storage model change for VW, but lets look further for the query processing. 39
  41. 41. Each level of the data handling was studied for performance loss. Compiler optimization are easier to take advantage of in smaller units. Often don’t get fully exploited in large programs. Modern CPU have better instruction sets and larger chip caches which can be used for vector processing. 40
  42. 42. Results of the work some 40x improvement. 41
  43. 43. The work on Ingres is very critical to Vectorwise (ParAccel SMP) as the main interface to both Ingres and Vectorwise are the same. It is not until the Optimizer which processes the SQL query and generates the x100 algebra that the two components separate. After generating the result set from VW it is the Ingres components that make this available to the application. Aliamaki, DeWitt, Hill, Skounakis – Weaving Relations for Cache Performance NSM = N-ary Storage Model - row contains all columns DSM – Decomposed Storage model = N attributes into N vertical storage elements PAX = Partition Attributes Across = multiple columns stored on a page, but attributes stored vertically 42
  44. 44. Taking all the performance features into account for VW query processing. This is great for Reporting where you data isn’t changing frequently. 43
  45. 45. With Versant, queries are typically used to locate the beginning of a graph or top level objects. Once the starting point is identified, the connected objects are frequently retried by the application as required (lazy loading) or automatically with a default fetching. The group loading saves round trips to the server and is much more efficient on the network. On the Versant side, query is done via OQL or JPQL. This example is JPQL. The Book has a simple collection Authors and we want to find an Author of “Smith”. Notice the syntax is a little SQL like. But we directly operate on the collection Book.authors, using “auth” as a working variable. On execution, the Book extent would be searched for all the books with a Smith author. This would end up scanning all the books and evaluating the Authors collections, returning the object ids for the matching books. ResultList holds the objects and the rest of the Java program would process that list. 44
  46. 46. The thing about relationships is they don’t change often. By baking them into the server’s data structure and making them cheap to evaluate, Versant avoids join operations which can be quite costly. IF you look at typical ORM code, you see a fair amount of join activity whenever collection classes are involve. Following a few links down a list can end up with a very expensive group of joins. Where as managing the references with LOID allow for direct navigation to the object. The server takes advantage of this in query expressions that involve paths or collections like the example. 45
  47. 47. Closing Comments This brings us to the end of our tale and hope you enjoyed our time together as much as I did. Each of the components we've examined should have given you insight into the design and tradeoff made by the different engineering teams. When taken as a whole they provide consistent powerful framework for solving hard real world problems. Each of these products has thousands of users which rely on their respective products for business critical applications. The engineers who built those applications made strategic choices for the data management system at the heart of their project. 46
  48. 48. 47

×