-Analysis and solutions for problems
faced by HBase™ and other
columnar data store client
applications under the ever
increasing demand for domain model
complexity-

Managing “Big Data” Application
Complexity using CloudGraph®
Scott Cinnamond, TerraMeta Software Inc.
http://cloudgraph.org
(for columnar data store client applications)

Complexity

Complexity Increases With
Added Data Model Entities

#Model Entities / Classes
Why More App Complexity?
(with Added Data Model
Entities)

1. Column Mapping Difficult
2. Composite Row Key Mapping, Hashing,
Salting and Formatting
3. Persistence Code Development,
Refactoring and Maintenance
Typical Column Mapping
Strategies
• Hard Coded Names Embedded in Source Code
– Not good 

• Column Names in Java Constants File(s)
– Better, but still really hard coded
– Feasible with 5-10 entities, 50 attributes
– With 500-1000 entities and 5000+ attributes? Not
maintainable

• Custom XML Configuration
–
–
–
–

Create a “meta model” using, say XML Schema and JAXB
Construct unique names and refer to them in source
Better but application specific ”one off”
Does not solve “state” management challenges
CloudGraph Column Mapping
A Standards Based Approach Using SDO and UML

CloudGraph
Statefull Column
Key Factories

Marshalling

Row Key
Mapping

Entity ID
Mapping

Sequence
Management

Data Graph “State”
Great, Still How Do We Keep Column
Names Entirely Out Of CRUD Source
Code?
Create | Update | Delete:

CloudGraph SDO API
(Service Data Objects)

Read (Query):

CloudGraph Query DSL
(Domain Specific
Language)
CloudGraph SDO
Your complex domain model as a
(create | update | delete) API
•
•
•
•
•

Drives all Column Mapping Transparently
Granular Control over Data Graph Edits
Convenient “Create Entity” Factory Methods
Change Tracking Including History
Rich Built In Data Types
• 100% Compile Time Checking
• Supports Multiple Inheritance Models
• Currently Uses PlasmaSDO™
– See http://plasma-sdo.org
CloudGraph SDO API Example
Uses Chemical Modelling Language (CML) 2.4
https://github.com/cloudgraph/cml
CloudGraph Query DSL
Your complex domain model as a query API
• Drives all Column Mapping Transparently
• Intuitive Almost “Fluent” English
Appearance
• Logical Entity, Attribute Names Generated
into API
• 100% Compile Time Checking
• Currently Uses PlasmaQuery®
– See http://plasma-query.org
CloudGraph Query DSL Example
Uses Chemical Modelling Language (CML) 2.4

https://github.com/cloudgraph/cml
Why More Complexity?
2.) Composite Row Key Mapping,
Hashing and Formatting
• More Model Entities:
 Larger data graphs
 More composite row key fields so can find graphs
 How to reliably map “deep” into graphs

• Row Key Field Hashing and Formatting
– Critical for HBase partial-key scan API
– Many data type specific idiosyncrasies
CloudGraph HBase Composite Row Keys
A Configuration Driven Approach using SDO XPath

CloudGraph
Composite Row
Keys

Hierarchica
l
Row Filters

Fuzzy Row
Filter

Partial Key
Assembly

Scan Support
Why More Complexity?
3.) Persistence Code Development,
Refactoring and Maintenance

Small Domain Model (e.g. CML 164 Entities) : 95,000 Lines
“Average” Custom Domain Model (e.g. 300 Entities): 174,000 Lines
*Example from UML conversion from XML Schema of BIOXSD - see http://bioxsd.org/
**Example from UML adaptation of HL7 POCD/HD000040 Clinical Document
***Example from UML conversion from XML Schema of Chemical Markup Language 2.4 – see http://xmlcml.org
CloudGraph Code Generation
A contract-first approach in 4 steps
1. Leverage Existing or Create UML Model(s)
1. Can be automatically reverse engineered
from existing RDBMS Schema

2. Map Repository Namespaces to Service
Configurations
3. Define and Map Row Keys To Data Graphs
4. Add CloudGraph and Plasma Maven
Artifacts and Generate Code
Resources
• Exchange Model Examples
– https://github.com/cloudgraph/cml
– https://github.com/cloudgraph/bioxsd
– https://github.com/cloudgraph/hl7

• End To End Examples
– https://github.com/cloudgraph/wordnet
– http://wordnet.cloudgraph.org
Status/Legal
• Project Status
– CloudGraph® is currently in private beta testing
– Other services for Cassandra, MongoDB and others are
under analysis
– See http://cloudgraph.org for contact info and other details

• Licensing
– CloudGraph® 0.5.5 Community Edition (CE) is open source
licensed under version 2 of the GNU General Public License

• Trademarks
– CloudGraph® is a registered trademark of TerraMeta
Software LLC
– Java™ is a trademark of Oracle Corporation
– HBase™ is a trademark of Apache Software Foundation
Copyright © TerraMeta Software, Inc – 2012,2013 – All Rights Reserved
References
• BIOXSD – http://bioxsd.org
• Chemical Markup Language (CML) – http://xmlcml.org
• Health Level 7 (HL7) – http://hl7.org
• Apache HBase™ – http://hbase.apache.org
• Apache Cassandra –
http://cassandra.apache.org
• MongoDB - http://www.mongodb.org
• PlasmaSDO™ – http://plasma-sdo.org,
http://search.maven.org/#search%7Cga%7C1%
7Ca%3A%22plasma-sdo%22

Managing "Big Data" Application Complexity with CloudGraph

  • 1.
    -Analysis and solutionsfor problems faced by HBase™ and other columnar data store client applications under the ever increasing demand for domain model complexity- Managing “Big Data” Application Complexity using CloudGraph® Scott Cinnamond, TerraMeta Software Inc. http://cloudgraph.org
  • 2.
    (for columnar datastore client applications) Complexity Complexity Increases With Added Data Model Entities #Model Entities / Classes
  • 3.
    Why More AppComplexity? (with Added Data Model Entities) 1. Column Mapping Difficult 2. Composite Row Key Mapping, Hashing, Salting and Formatting 3. Persistence Code Development, Refactoring and Maintenance
  • 4.
    Typical Column Mapping Strategies •Hard Coded Names Embedded in Source Code – Not good  • Column Names in Java Constants File(s) – Better, but still really hard coded – Feasible with 5-10 entities, 50 attributes – With 500-1000 entities and 5000+ attributes? Not maintainable • Custom XML Configuration – – – – Create a “meta model” using, say XML Schema and JAXB Construct unique names and refer to them in source Better but application specific ”one off” Does not solve “state” management challenges
  • 5.
    CloudGraph Column Mapping AStandards Based Approach Using SDO and UML CloudGraph Statefull Column Key Factories Marshalling Row Key Mapping Entity ID Mapping Sequence Management Data Graph “State”
  • 6.
    Great, Still HowDo We Keep Column Names Entirely Out Of CRUD Source Code? Create | Update | Delete: CloudGraph SDO API (Service Data Objects) Read (Query): CloudGraph Query DSL (Domain Specific Language)
  • 7.
    CloudGraph SDO Your complexdomain model as a (create | update | delete) API • • • • • Drives all Column Mapping Transparently Granular Control over Data Graph Edits Convenient “Create Entity” Factory Methods Change Tracking Including History Rich Built In Data Types • 100% Compile Time Checking • Supports Multiple Inheritance Models • Currently Uses PlasmaSDO™ – See http://plasma-sdo.org
  • 8.
    CloudGraph SDO APIExample Uses Chemical Modelling Language (CML) 2.4 https://github.com/cloudgraph/cml
  • 9.
    CloudGraph Query DSL Yourcomplex domain model as a query API • Drives all Column Mapping Transparently • Intuitive Almost “Fluent” English Appearance • Logical Entity, Attribute Names Generated into API • 100% Compile Time Checking • Currently Uses PlasmaQuery® – See http://plasma-query.org
  • 10.
    CloudGraph Query DSLExample Uses Chemical Modelling Language (CML) 2.4 https://github.com/cloudgraph/cml
  • 11.
    Why More Complexity? 2.)Composite Row Key Mapping, Hashing and Formatting • More Model Entities:  Larger data graphs  More composite row key fields so can find graphs  How to reliably map “deep” into graphs • Row Key Field Hashing and Formatting – Critical for HBase partial-key scan API – Many data type specific idiosyncrasies
  • 12.
    CloudGraph HBase CompositeRow Keys A Configuration Driven Approach using SDO XPath CloudGraph Composite Row Keys Hierarchica l Row Filters Fuzzy Row Filter Partial Key Assembly Scan Support
  • 13.
    Why More Complexity? 3.)Persistence Code Development, Refactoring and Maintenance Small Domain Model (e.g. CML 164 Entities) : 95,000 Lines “Average” Custom Domain Model (e.g. 300 Entities): 174,000 Lines *Example from UML conversion from XML Schema of BIOXSD - see http://bioxsd.org/ **Example from UML adaptation of HL7 POCD/HD000040 Clinical Document ***Example from UML conversion from XML Schema of Chemical Markup Language 2.4 – see http://xmlcml.org
  • 14.
    CloudGraph Code Generation Acontract-first approach in 4 steps 1. Leverage Existing or Create UML Model(s) 1. Can be automatically reverse engineered from existing RDBMS Schema 2. Map Repository Namespaces to Service Configurations 3. Define and Map Row Keys To Data Graphs 4. Add CloudGraph and Plasma Maven Artifacts and Generate Code
  • 15.
    Resources • Exchange ModelExamples – https://github.com/cloudgraph/cml – https://github.com/cloudgraph/bioxsd – https://github.com/cloudgraph/hl7 • End To End Examples – https://github.com/cloudgraph/wordnet – http://wordnet.cloudgraph.org
  • 16.
    Status/Legal • Project Status –CloudGraph® is currently in private beta testing – Other services for Cassandra, MongoDB and others are under analysis – See http://cloudgraph.org for contact info and other details • Licensing – CloudGraph® 0.5.5 Community Edition (CE) is open source licensed under version 2 of the GNU General Public License • Trademarks – CloudGraph® is a registered trademark of TerraMeta Software LLC – Java™ is a trademark of Oracle Corporation – HBase™ is a trademark of Apache Software Foundation Copyright © TerraMeta Software, Inc – 2012,2013 – All Rights Reserved
  • 17.
    References • BIOXSD –http://bioxsd.org • Chemical Markup Language (CML) – http://xmlcml.org • Health Level 7 (HL7) – http://hl7.org • Apache HBase™ – http://hbase.apache.org • Apache Cassandra – http://cassandra.apache.org • MongoDB - http://www.mongodb.org • PlasmaSDO™ – http://plasma-sdo.org, http://search.maven.org/#search%7Cga%7C1% 7Ca%3A%22plasma-sdo%22