Marios Trivyzas, Senior Core Engineer, Crate.io
A Database Month event http://www.DBMonth.com/database/migration
This is the dark and terrible tale of a migration from an relational SQL database, an RDBMS like Oracle or Postgres, to NoSQL with MongoDB. The battles were fought by those unwilling or unable to learn Mongo's query language. The trials we faced feeding Mongo back through a SQL data warehouse. The tears shed. The lives lost (well, virtual-lives lost, lol.)
In this session, we will look at how using a scalable SQL database with full text search would have solved all of the most-challenging aspects of SQL-to-NoSQL migration.
Marios Trivyzas, Senior Core Engineer, Crate.io
Marios is the Senior Core Engineer for Crate.io. He from Corfu, an island at the north-west of Greece with similar crazy weather to Berlin, and he was based in Berlin since February 2013.
Marios has 15 years of experience working with C and Java focusing on low level Unix system programming and distributed systems. Since his academic years he has always been interested in data and has developed a specialty in databases, data stores and caching and helps CrateDB simplify and streamline working with machine data.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Remedying the Challenges of Migrating Oracle/Postgres/SQL to MongoDB/NoSQL
1. Migrating from SQL to NoSQL (MongoDB)
I wish I had a time machine and could use
CrateDB
@mtrivizas @crateio
2. @mtrivizas @crateio
Marios (me)
Academic background in Databases & Distributed Systems
~ 15 years experience as backend developer, C / Java
Focused on Data (Databases/Datastores & caching)
Core Developer, 1 year at Crate.io
3. @mtrivizas @crateio
● Migrating from RDBMS to NoSQL (MongoDB)
● Why CrateDB made me wish I had a time machine
(This is not a presentation for MongoDB vs CrateDB)
Agenda
5. @mtrivizas @crateio
Migrating to MongoDB - Why - Previous Setup
● Several 2-node clusters of PostgreSQL (each cluster
servicing one project/customer)
● 1 Large Oracle data warehouse
● Performing analytics on Production DBs (schemas
contained tables with data only used for analytics)
● Some data transformed -> injecting to DWH for further
analysis
6. @mtrivizas @crateio
Migrating to MongoDB - Why - Problem
● Unsatisfactory throughput/latency for user requests
● Querying 2 different systems (production db and DWH) for
analytics
● Querying production db for BI purposes impacts performance
dramatically (both BI queries & user requests)
● Ability to perform text searching
7. @mtrivizas @crateio
Migrating to MongoDB - Why - Solution
● Move non-transactional BI data to a NoSQL data store
● Use one large (or a few large) NoSQL cluster which stores
all BI related data / Remove DWH
● Why MongoDB
○ Document data store
○ Performance
○ In house knowledge by other company teams
8. @mtrivizas @crateio
Migrating to MongoDB - Developer’s Pain
● Schema - Docs
○ Referenced docs
■ Multiple queries to retrieve info
■ application processing (no join support)
9. @mtrivizas @crateio
Migrating to MongoDB - Developer’s Pain
● Schema - Docs
○ Embedded
■ Max doc size 2mb (now 16mb)
■ Data duplication
■ Data inconsistencies (updated in one embedded doc but not in
other(s)
○ Often updates or growing documents lead to fragmentation
■ Running compact is really slow and DB is almost unusable
10. @mtrivizas @crateio
Migrating to MongoDB - Developer’s Pain
● Schema - Indexes
○ If all fields indexed
■ Insert/Update performance drops a lot
■ Indexes must fit in memory
○ Need to carefully choose which fields to index and which
scheme to use (partial, sparse)
○ Keyword search very good / full-text quite slow
11. @mtrivizas @crateio
Migrating to MongoDB - Developer’s Pain
● Schema - Aggregation
○ Some aggregations were really slow -> had to store
several counters in extra collections to solve the
problem
● Write lock per collection
13. @mtrivizas @crateio
Migrating to MongoDB - QA/DevOps/Bi Pain
● Problems
○ New query language
○ Write application-side code
● “Solution”: Transform & Inject
data to RDBMS Data
Warehouse
14. @mtrivizas @crateio
Migrating to MongoDB - DevOps/SysAdmin Pain
Production setup needs:
● DB nodes
● Config servers (at least 3 even if
you only have 2 DB nodes)
● mongos instances
Need for 3 different configurations ->
more “pain” for vm/container images
17. @mtrivizas @crateio
Crate.io
● Founded in 2013
● ~25 people and growing
● Offices in Dornbirn (AT), Berlin (DE), and San Francisco (US)
● Opensource: https://github.com/crate/crate
18. @mtrivizas @crateio
CrateDB - core features
● Simply scalable using shared-nothing architecture
● Uses SQL for DDL/DML/DQL
● Supports dynamic schemas
● Supports full-text, geospatial & time series search
● Eventually consistent
20. @mtrivizas @crateio
CrateDB - Numbers from users
● 100+ nodes clusters
● 3.2bn inserts/updates /day
● 1m+ inserts / second with 14 nodes
● 4000 reads /sec
21. @mtrivizas @crateio
CrateDB - for All
● SQL (+ some extension e.g. sharding) for schema creation
● SQL for data modification
● Power of SQL for data query (Joins, subselects, filtering over
aggregations, etc)
● Atomic row updates - no locking of whole document collections
● Blazing fast and accurate (no approximations) aggregations, up to 29x
faster than PostgreSQL
https://crate.io/a/benchmarking-complex-query-performance-cratedb-postgresql/
● Bulk inserts
22. @mtrivizas @crateio
CrateDB - for Devs
● Every column is indexed by default
● Variety of data types including arrays and objects
● Variety of functions (scalar, conditional, aggregations, casts)
● Blob storage
● Transparent partitioning
● Customizable full text search support (tokenizers, analyzers, filters)
● Dynamic schema
● Generated columns
● Explain execution plan
24. CrateDB - for DevOps/SysAdmins
● Easy cluster setup: All nodes are equal
● Runs on premises and in cloud AWS, Azure, GCE
● Easily containerized - Docker, Kubernetes
● Auto-sharding & replication, partitioning
● Scale horizontally? just add nodes
● Built-in cluster management
● Comes with a web UI
● CLI: Crash
@mtrivizas @crateio
25. CrateDB - for DevOps/SysAdmins
● Import/Export (file system, S3) and also Insert by query!
● Backup/Restore mechanism via snapshots (HDFS also supported as
storage)
● Importing/Exporting to an RBMS is easy
● Monitor/Manage via SQL
○ Information schema tables
○ sys.cluster, sys.nodes, sys.shards
○ sys.jobs, sys.jobs_log (kill jobs supported)
○ Change runtime settings
@mtrivizas @crateio
26. CrateDB - Enterprise edition
● JMX Monitoring plugin (integrate with popular monitoring
platforms)
● Host based authentication (HTTP/PG)
● User defined functions (Javascript)
@mtrivizas @crateio
27. CrateDB - Coming soon…
● Enterprise edition
○ Encryption for clients and node-node communication (SSL)
○ User authentication, schema permissions
● Community edition
○ More SQL features (full support for subselects, unions, etc)
@mtrivizas @crateio
28. @mtrivizas @crateio
CrateDB - Considerations
● No Join variations (only nested loop currently)
● No auto-Increment functionality
● No transactions (only eventually consistent - optimistic lock)
● No stored procedures
● No views
● No full support for subselects (but soon…)
31. @mtrivizas @crateio
CrateDB - Internals - Sharding
CREATE TABLE my_table (
col1 string,
col2 integer,
…
) CLUSTERED BY (col2) INTO 10 SHARDS
WITH (number_of_replicas = 2)
32. @mtrivizas @crateio
CrateDB - Internals - Partitioning
CREATE TABLE parted_table (
id long,
title string,
content string,
day timestamp
)
CLUSTERED BY (title) INTO 4 SHARDS
PARTITIONED BY (day)
WITH (number_of_replicas = 4)
Total shards =
number of partitions x
4 (number of shards) x
5 (primary + 4 replicas)
33. @mtrivizas @crateio
CrateDB - Internals - Partitioning
CREATE TABLE computed_parted_table (
id long,
data double,
created_at timestamp,
month timestamp GENERATED ALWAYS AS date_trunc('month',
created_at)
) PARTITIONED BY (month);
34. @mtrivizas @crateio
CrateDB - Internals - Refresh
CREATE TABLE my_table (
id long primary key,
data double,
...
) ... WITH (refresh_interval = 10000)
REFRESH TABLE my_table;
SELECT * FROM my_table WHERE id = 10
36. @mtrivizas @crateio
CrateDB - Internals - Query execution path
● Parse the query
● Analyze/Plan the query
● Find route to shards
● Use Lucene Reader to get IDs (possible apply ORDER BY)
● Return results & merge (in a map-reduce fashion)
● Apply limit/offset
● Fetch the rest of the fields
● Evaluate remaining expressions
● Return results