Remedying the Challenges of Migrating Oracle/Postgres/SQL to MongoDB/NoSQL

Migrating from SQL to NoSQL (MongoDB)
I wish I had a time machine and could use
CrateDB
@mtrivizas @crateio

@mtrivizas @crateio
Marios (me)
Academic background in Databases & Distributed Systems
~ 15 years experience as backend developer, C / Java
Focused on Data (Databases/Datastores & caching)
Core Developer, 1 year at Crate.io

@mtrivizas @crateio
● Migrating from RDBMS to NoSQL (MongoDB)
● Why CrateDB made me wish I had a time machine
(This is not a presentation for MongoDB vs CrateDB)
Agenda

@mtrivizas @crateio
Migrating from RDBMS to NoSQL (MongoDB)
● Why
● Developers’ pain
● QA/DevOps/BI pain
● SysAdmin’s pain

@mtrivizas @crateio
Migrating to MongoDB - Why - Previous Setup
● Several 2-node clusters of PostgreSQL (each cluster
servicing one project/customer)
● 1 Large Oracle data warehouse
● Performing analytics on Production DBs (schemas
contained tables with data only used for analytics)
● Some data transformed -> injecting to DWH for further
analysis

@mtrivizas @crateio
Migrating to MongoDB - Why - Problem
● Unsatisfactory throughput/latency for user requests
● Querying 2 different systems (production db and DWH) for
analytics
● Querying production db for BI purposes impacts performance
dramatically (both BI queries & user requests)
● Ability to perform text searching

@mtrivizas @crateio
Migrating to MongoDB - Why - Solution
● Move non-transactional BI data to a NoSQL data store
● Use one large (or a few large) NoSQL cluster which stores
all BI related data / Remove DWH
● Why MongoDB
○ Document data store
○ Performance
○ In house knowledge by other company teams

@mtrivizas @crateio
Migrating to MongoDB - Developer’s Pain
● Schema - Docs
○ Referenced docs
■ Multiple queries to retrieve info
■ application processing (no join support)

@mtrivizas @crateio
● Schema - Docs
○ Embedded
■ Max doc size 2mb (now 16mb)
■ Data duplication
■ Data inconsistencies (updated in one embedded doc but not in
other(s)
○ Often updates or growing documents lead to fragmentation
■ Running compact is really slow and DB is almost unusable

@mtrivizas @crateio
● Schema - Indexes
○ If all fields indexed
■ Insert/Update performance drops a lot
■ Indexes must fit in memory
○ Need to carefully choose which fields to index and which
scheme to use (partial, sparse)
○ Keyword search very good / full-text quite slow

@mtrivizas @crateio
● Schema - Aggregation
○ Some aggregations were really slow -> had to store
several counters in extra collections to solve the
problem
● Write lock per collection

@mtrivizas @crateio
Learn new
DML & DQL
Language

@mtrivizas @crateio
Migrating to MongoDB - QA/DevOps/Bi Pain
● Problems
○ New query language
○ Write application-side code
● “Solution”: Transform & Inject
data to RDBMS Data
Warehouse

@mtrivizas @crateio
Migrating to MongoDB - DevOps/SysAdmin Pain
Production setup needs:
● DB nodes
● Config servers (at least 3 even if
you only have 2 DB nodes)
● mongos instances
Need for 3 different configurations ->
more “pain” for vm/container images

@mtrivizas @crateio
CrateDB to the rescue!

@mtrivizas @crateio
CrateDB
A Distributed, Persistent, Realtime SQL Database

@mtrivizas @crateio
Crate.io
● Founded in 2013
● ~25 people and growing
● Offices in Dornbirn (AT), Berlin (DE), and San Francisco (US)
● Opensource: https://github.com/crate/crate

@mtrivizas @crateio
CrateDB - core features
● Simply scalable using shared-nothing architecture
● Uses SQL for DDL/DML/DQL
● Supports dynamic schemas
● Supports full-text, geospatial & time series search
● Eventually consistent

@mtrivizas @crateio
CrateDB - Sample users
● Skyhigh Networks - cloud access security broker for F500
● Alpla (Coke bottles!) - real-time manufacturing optimization
● Clickdrive.io - Real-time vehicle fleet tracking; 1500 readings/car/second
● Gantner Instruments - Industrial (nuclear) sensors; 10Khz readings
● Roomonitor - AC & noise level sensors & control for AirBnB/public housing
● NBC (Golf Channel) - Internet of Golfers app (geo-positioning)
● ...

@mtrivizas @crateio
CrateDB - Numbers from users
● 100+ nodes clusters
● 3.2bn inserts/updates /day
● 1m+ inserts / second with 14 nodes
● 4000 reads /sec

@mtrivizas @crateio
CrateDB - for All
● SQL (+ some extension e.g. sharding) for schema creation
● SQL for data modification
● Power of SQL for data query (Joins, subselects, filtering over
aggregations, etc)
● Atomic row updates - no locking of whole document collections
● Blazing fast and accurate (no approximations) aggregations, up to 29x
faster than PostgreSQL
https://crate.io/a/benchmarking-complex-query-performance-cratedb-postgresql/
● Bulk inserts

@mtrivizas @crateio
CrateDB - for Devs
● Every column is indexed by default
● Variety of data types including arrays and objects
● Variety of functions (scalar, conditional, aggregations, casts)
● Blob storage
● Transparent partitioning
● Customizable full text search support (tokenizers, analyzers, filters)
● Dynamic schema
● Generated columns
● Explain execution plan

@mtrivizas @crateio
CrateDB - for Devs/DevOps
● HTTP (REST) & Binary protocol (Postgres wire protocol), JDBC
supported!
● Other clients:
○ Python
○ PHP PDO and DBAL
○ Erlang
○ ODBC

CrateDB - for DevOps/SysAdmins
● Easy cluster setup: All nodes are equal
● Runs on premises and in cloud AWS, Azure, GCE
● Easily containerized - Docker, Kubernetes
● Auto-sharding & replication, partitioning
● Scale horizontally? just add nodes
● Built-in cluster management
● Comes with a web UI
● CLI: Crash
@mtrivizas @crateio

CrateDB - for DevOps/SysAdmins
● Import/Export (file system, S3) and also Insert by query!
● Backup/Restore mechanism via snapshots (HDFS also supported as
storage)
● Importing/Exporting to an RBMS is easy
● Monitor/Manage via SQL
○ Information schema tables
○ sys.cluster, sys.nodes, sys.shards
○ sys.jobs, sys.jobs_log (kill jobs supported)
○ Change runtime settings
@mtrivizas @crateio

CrateDB - Enterprise edition
● JMX Monitoring plugin (integrate with popular monitoring
platforms)
● Host based authentication (HTTP/PG)
● User defined functions (Javascript)
@mtrivizas @crateio

CrateDB - Coming soon…
● Enterprise edition
○ Encryption for clients and node-node communication (SSL)
○ User authentication, schema permissions
● Community edition
○ More SQL features (full support for subselects, unions, etc)
@mtrivizas @crateio

@mtrivizas @crateio
CrateDB - Considerations
● No Join variations (only nested loop currently)
● No auto-Increment functionality
● No transactions (only eventually consistent - optimistic lock)
● No stored procedures
● No views
● No full support for subselects (but soon…)

@mtrivizas @crateio
CrateDB - Internals - Overview
Postgres Wire
Protocol/HTTP
ANTLR4 Parser
Query Analyzer/Planner
Distributed Execution
Engine
Lucene
...
CrateDB CrateDB CrateDB
Client

@mtrivizas @crateio
CrateDB - Internals - Sharding

@mtrivizas @crateio
CrateDB - Internals - Sharding
CREATE TABLE my_table (
col1 string,
col2 integer,
…
) CLUSTERED BY (col2) INTO 10 SHARDS
WITH (number_of_replicas = 2)

@mtrivizas @crateio
CrateDB - Internals - Partitioning
CREATE TABLE parted_table (
id long,
title string,
content string,
day timestamp
)
CLUSTERED BY (title) INTO 4 SHARDS
PARTITIONED BY (day)
WITH (number_of_replicas = 4)
Total shards =
number of partitions x
4 (number of shards) x
5 (primary + 4 replicas)

@mtrivizas @crateio
CrateDB - Internals - Partitioning
CREATE TABLE computed_parted_table (
id long,
data double,
created_at timestamp,
month timestamp GENERATED ALWAYS AS date_trunc('month',
created_at)
) PARTITIONED BY (month);

@mtrivizas @crateio
CrateDB - Internals - Refresh
CREATE TABLE my_table (
id long primary key,
data double,
...
) ... WITH (refresh_interval = 10000)
REFRESH TABLE my_table;
SELECT * FROM my_table WHERE id = 10

@mtrivizas @crateio
Express your data question in SQL and let the
magic happen
Complicated
SQL
CrateDB
Awesome
Analytics

@mtrivizas @crateio
CrateDB - Internals - Query execution path
● Parse the query
● Analyze/Plan the query
● Find route to shards
● Use Lucene Reader to get IDs (possible apply ORDER BY)
● Return results & merge (in a map-reduce fashion)
● Apply limit/offset
● Fetch the rest of the fields
● Evaluate remaining expressions
● Return results

@mtrivizas @crateio
CrateDB - Play with it
https://play.crate.io (read only)

@mtrivizas @crateio
Questions?

@mtrivizas @crateio
Let’s store large amounts of data
AND
have realtime query capabilities!

Remedying the Challenges of Migrating Oracle/Postgres/SQL to MongoDB/NoSQL

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Remedying the Challenges of Migrating Oracle/Postgres/SQL to MongoDB/NoSQL