1) The document describes a 5 step process for converting a MySQL database containing the Wordnet lexical database to Apache HBase. The steps include modeling the database in UML, generating Java code, mapping the data to HBase tables, migrating the data from MySQL to HBase, and building services and a web application.
2) It provides details on each step, including reverse engineering the Wordnet schema to UML, generating Java code for persistence and queries, configuring row keys and mapping the data model to HBase tables, developing an incremental migration tool, and creating a sample web application for Wordnet queries.
3) The results show Wordnet queries returning related data from HBase in under 200 milliseconds
1. CloudGraph ®
MySql to HBase in 5 Steps
Converting MySql or Oracle databases to Apache HBase™ with on-line
examples using the popular Wordnet® dictionary
Scott Cinnamond – TerraMeta Software Inc.
http://cloudgraph.org
2. What is Wordnet ?
®
• Large complex lexical (MySql) database of
English.
• Nouns, verbs, adjectives and adverbs
grouped into sets of cognitive synonyms
(synsets), each expressing a distinct
concept.
• Synsets are interlinked by means of
conceptual-semantic and lexical relations.
3. HBase Conversion Steps
http://wordnet.cloudgraph.org
1) Model Creation: reverse engineer Wordnet DB
into UML®
2) Code Generation: provision persistence and
query-DSL java code
3) HBase™ Table Mapping: map data graphs and
row keys to table(s)
4) Data Migration: MySql to HBase
5) Services / App Creation: build services,
web app
4. 1.) Model Creation
Reverse engineer Wordnet DB into PlasmaSDO™ UML® Model
• Capture entities, properties, data types,
associations, enumerations, comments as UML
• Why UML? Popular standards-based format.
Editable, viewable using standard tools.
Supports enterprise governance processes
• How? Maven build with plasma-maven-plugin
RDB tool (goal:RDB, action:reverse, dialect:mysql)
• Download working example at
https://github.com/cloudgraph/wordnet
6. 2.) Code Generation
Provision SDO persistence and query DSL java code
• Generate Java API based on Wordnet UML
Model
• Why? Use across RDB, HBase, other
CloudGraph Services. Compile time checking for
queries, all persistence logic
• How? Maven build with plasma-maven-plugin
SDO and DSL tools
• See generated API Javadocs on-line at
http://wordnet.cloudgraph.org
7. 3.) HBase™ Table Mapping
Map data graphs and row keys to HBase™ table(s)
• Configure delimited, hashed, salted, formatted,
composite row keys with (xpath) paths into
target data graphs
• Map data graph roots to HBase tables
• Why? Automates row-key creation via data
extraction processing from anywhere in your
data graphs
• How? CloudGraph Configuration XML. See
https://github.com/cloudgraph/wordnet
8. 4.) Data Migration
MySql to HBase
• Create RDB-to-HBase standalone
migration app using generated
persistence and DSL query API
incrementally call CloudGraph HBase and
RDB services
• Why? Wordnet data is large and highly
connected, so must be incrementally
extracted/inserted and linked
9. 5.) Services / App Creation
Build services, web app
• Build simple pojo services using
persistence and DSL query API
• Encapsulate Wordnet business logic
• Add adapter/wrapper structures
• Call services called from web-app
10. Web App
http://wordnet.cloudgraph.org
• Auto-complete field triggers CloudGraph
HBase to use the HBase fuzzy row filter
API
• Find button returns all semantic and
lexical relations for the selected word,
including descriptions and example
sentences
• Resulting relation graphs typically contain
more than 100 nodes and return in less
than 200 milliseconds
11. Conclusions
• Complex, highly recursive RDB models
can be easily converted and leveraged in
HBase and future CloudGraph services
• Large lexical data graphs can be returned
in single query
• Data migration difficult given complex
recursive model
12. Resources
• Download the complete CloudGraph Wordnet
example: https://github.com/cloudgraph/wordnet
• Run the example online:
http://wordnet.cloudgraph.org
• Project details, contact information:
http://cloudgraph.org
• Beta Source Repo:
https://github.com/terrameta/cloudgraph
• Production Source Repo (under construction):
https://github.com/cloudgraph
13. Status / Legal
•
•
•
Project Status
– CloudGraph ® is currently under private beta testing
Licensing
– CloudGraph ® 0.5.5 Community Edition (CE) is open source licensed
under version 2 of the GNU General Public License
Trademarks
– WordNet ® is a registered trademark of Princeton University
– Apache HBase™ is a trademark of Apache Software Foundation
– CloudGraph ® is a trademark of TerraMeta Software LLC, TerraMeta
Software Inc.