Leverage modern data architecture in the big data era. Lecture by Arthur Gimpel @ Elasticsearch{Zone} as part of Big Data Month 2016 by DataZone and Elastic
2. Name: Arthur Gimpel
Position: Technology Evangelist, Solutions Architect,
Trainer
Tech Stack: Elastic Stack, SQL Server, MongoDB,
Couchbase, Redis, Kafka, StreamSets,
Python, .NET…
Free Time: Motorcycles, Skydiving…
Click to edit Master
title styleAbout Me
3. • First RDBMS was introduced in late 1970s
• Exist in all possible flavors but share one thing - ACID
• Still dominate the database market
Click to edit Master
title styleRelational Database Management Systems
4. • Atomicity: All or nothing approach, transactions
• Consistency: Hard state, every transaction changes the whole DBMS
• Isolation: Transactions cannot interfere with each other
• Durability: Every transaction is persisted
Click to edit Master
title styleRDBMS in Theory - ACID
5. • Everything is persisted, synchronously. Limited by IO
performance
• All data is bound to a tabular schema, hard to make changes in
big databases
• ACID makes horizontal scaling nearly* impossible
• Complex schema slows down aggregations and queries
drastically
Click to edit Master
title styleACID Is Not Perfect
6. • Distributed / Horizontal Scalability
• Mostly Open Source
• Mostly schema less:
• Key - Value
• Document
• Graph
• Serves specific purposes
Click to edit Master
title styleNoSQL - New Kid in Town
7. • Every data store has its purpose. There is no single solution to
all database needs
• NoSQL does not implement all of RDBMS’s abilities (CDC,
Jobs, Stored Procedures, Triggers)
• Every data store has its own languages, and APIs. There is no
ANSI SQL
Click to edit Master
title styleNoSQL - Challenges
8. Click to edit Master
title style
NoSQL = Not Only SQL | Polyglot Persistence
9. • Search platform, data store based on Apache Lucene
• Supports various search types: Filtered, Full-text, Geography,
Aggregation (Facet, Nested, Pipeline), Graph
• Distributed - every index is split to shards relying on (potentially) a node
• Document store - JSON
• “Optimistic” Schema-less architecture
• Supports Replication by nature
• Supports Unsupervised Machine Learning by nature (Prelert, in beta)
Click to edit Master
title style
10. Click to edit Master
title styleSearch != SQL Querying
11. Click to edit Master
title styleReference Architecture #1
12. Click to edit Master
title styleReference Architecture #2
13. Click to edit Master
title styleArchitecture Comparison
Architecture #1 Architecture #2
Data distribution strategy Data store based Application based
Data distribution component Data Pipeline ( StreamSets ) Message Queue ( Kafka )
Implementation Team Data Engineers / DevOps DevOps / Developers
Implementation Complexity Low: Data pipeline development High: data access layer refactor
Potential additional licensing Elasticsearch, StreamSets None
Scalability Limited to RDBMS Scale
Fully scalable regardless of
RDBMS