Beyond Relational

Beyond Relational:
Applying Big Data Cloud Patterns
Lynn Langit
QCon London – March 2017

About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT

Save
ALL
of your Data
(in a Data Lake)

But NOT in HDFS…sorry Cloudera
What is a Cloud-based Data Lake Architecture?

Disruptor: Data Lake on GCP
✘Well-priced,Highly Available,Auto-scaled
✘Simple, One API, NoOps
✘SQL ‘over’ it via Query Services

Practice
Applying Concepts – Data Lake Features (AWS or GCP)

“Finding a path -- the ACTUAL
Cost of…
✘ Saving all Data into a Data Lake
✘ Using newer database technologies
✘ Going beyond Relational

2.
Evaluate Cloud Data Severs & Services

Past -- Workloads and Terms
Example Products Other Names
Operational Database MySQL, SQL Server,
Oracle
OTLP
Analytic Database SSAS, Cognos, SAS, OLAP
NoSQL Database MongoDB, Redis,
Cassandra, Neo4J
Document, K/V, Graph,
Column-store
Hadoop Cloudera, Hortonworks Big Data

???
The fastest growing cloud-basedBig Data products are…

Relational
The fastest growing cloud-basedBig Data products are…

“When do I use…?
✘Hadoop / Spark
✘NoSQL
✘Big Relational

“Existing relational databases
to the cloud?
Operational, Data Warehouse

Prerequisite – move / store files in the cloud – prefer Data Lake to HDFS

Cloud Vendors – Files as Data Lake
AWS
✘ S3/Glacier
✘ Most complete offering
✘ Most mature offering
✘ Notable: S3 management
GCP
✘ GCS with 4 types
✘ Lean, mean and cheap
✘ Unified API
✘ Notable: Simple pricing

Serverless
Do you need to manage servers?

Place your screenshot here
AWS Console
20 Data services

GCP Console
11 Data Services

GCP Serverless Data Warehouse
Reference table
Query / Compute
BigQuery
Customer Lists / Reference Data
Export Ad Data
Cloud Storage
Id matching
Cloud Dataflow
Marketing List
DoubleClick
Campaign Manager
Google Analytics
Relevant Users
Cloud Storage
Analysts
DataStudio 360
Dashboards

BigQuery
✘ Mature– based on Dremel (Google)
✘ Import (batch or stream) from GCS++
✘ ANSI SQL queries
✘ Interactive or batch pricing
✘ NoOps / Serverless
Head-to-Head
Data Warehouse -- GCP BigQuery vs. AWS Redshift
Redshift
✘ Mature – big market share
✘ Big MySQL – ANSI SQL query
✘ PostgreSQL dialect
✘ Cost compare?
✘ Manual partition/scale up
✘ Huge partner ecosystem (ETL, viz)

Disruptor:
AWS Athena
✘Well-priced
✘Highly Available
✘Auto-scaled
✘Simple
✘NoOps
✘Analytics
✘NoSQL

Practice
Applying Concepts – Serverless SQL Queries
(AWS Aurora or GCP Big Query)

BigQuery
✘ Mature – based on Dremel (Google)
✘ Import (batch or stream) from GCS++
✘ Interactive or batch pricing (streaming)
✘ 3rd party vendor integration (ETL, viz)
Head-to-Head
SQL Query -- GCP BigQuery vs. AWS Athena
Athena
✘ New – based on Apache Presto (from FB)
✘ SQL query against S3 data
✘ Cost compare - $ 5/TB scanned

Disruptor:
GCP Cloud Spanner
✘Well-priced
✘Highly Available
✘Auto-scaled
✘Transactional
✘NoOps
✘Operational
✘RDBMS

Aurora
✘ Mature
✘ MySQL compliant
✘ 3rd party vendor integration (ETL, viz)
Head-to-Head
Relational Database – AWS Aurora vs. GCP Spanner
Spanner
✘ New (beta)
✘ CAP – auto scales, transactional
consistent, auto partitions
✘ Cost compare?

Cloud Vendors
RDBMS Database (Operational or Analytics)
AWS
✘ 14X market share of next
competitor
✘ Most mature offering - partners
✘ Notable: Aurora, Big Relational
RDBMS
GCP
✘ Fastest player
✘ Requires top developers
✘ Notable: BigQuery, MPP
SQL Query as a Service

Cloud Offerings – RDBMS and Files
AWS Google
Managed RDBMS RDS Aurora
RDS <Other>
Spanner
Cloud SQL
Data Warehouse Redshift
Athena*
BigQuery*
NoSQL buckets S3*
Glacier*
Cloud Storage (Multi-regional, Regional)*
Nearline, Coldline*
*Serverless services

Practice
Applying Concepts – Big Relational

Reasons to use Big Relational Cloud Services
Developers DevOps Cloud Vendors – AWS
Developers DevOps Cloud Vendors – GCP

Reasons to use Big Relational Cloud Services
Developers
✘ Most know RDBMS query patterns
✘ Many know basic administration
DevOps
✘ Most know RDBMS administration
✘ Many know basic RDBMS queries
✘ Many know query optimization
Cloud Vendors - AWS
✘ Aurora – RDBMS up to 64 TB
✘ Redshift - $ 1k USD / 1 TB / year
✘ Rich partner ecosystem – ETL
✘ Integration with AWS products
Developers
✘ Most know coding language patterns
to interact with RDBMS systems
DevOps
✘ Familiar RDBMS security patterns
✘ Familiar auditing
✘ Partner tooling integration
Cloud Vendors - GCP
✘ Spanner / Big Query – familiar SQL queries
✘ No hassle streaming ingest
✘ No hassle pay-as-you-go
✘ NoOps administration

My past top Big Data Cloud Services

“When do I use…?
✘Hadoop
✘NoSQL
✘Big Relational

225
NoSQL Database Types to Choose From

Let’s review some NoSQL concepts
Key-Value
Redis, Riak, Aerospike
Graph
Neo4j
Document
MongoDB
Wide-Column
Cassandra, HBase

Key Questions - Storage
✘Volume – how much now, what growth rate?
✘Variety – what type(s) of data? ‘rectangular’,‘graph’,‘k-v’, etc…
✘Velocity – batches, streams,both, what ingest rate?
✘Veracity – current state (quality) of data, amount of duplicationof data
stores, existence of authoritative (master) data management?

Actual Questions - NoSQL
✘When do you use Redis for caching?
✘When do you use MongoDB or AWS DynamoDB for JSON data?
✘When do you use Neo4j for graph-type data?
✘Why are you using somethingelse, i.e. Cassandra, Riak, etc..?

51
✘Open Source is Free ✘Not Free
§ Rapid iteration, innovation
§ Can start up for free (on premise)
§ Can ‘rent’ for cheap or free on the
cloud
§ Can use with the command line for
free
§ Some vendors offer free online training
§ Ex. www.neo4j.org
§ Constant releases
§ Can be deceptively hard to set up
(time is money)
§ Don’t forget to turn it off if on the
cloud!
§ GUI tools, support, training cost $$$
§ Ex. www.neo4j.com
NoSQL Example

Practice
Applying Concepts - NoSQL

NoSQL Applied
Log Files
• ???
Product
Catalogs
• ???
Social
Games
• ???
Social
aggregators
• ???
Line-of-
Business
• ???

NoSQL Applied
Log Files
• Columnstore
• HBase
Product
Catalogs
• Key/Value
• Redis
Social
Games
• Document
• MongoDB
Social
aggregators
• Graph
• Neo4j
Line-of-
Business
• RDBMS
• SQL Server

More than NoSQL
NoSQL
✘ Non-relational
✘ Can be optimized in-memory
✘ Eventually consistent
✘ Schema on Read
✘ Example: Aerospike
NewSQL
✘ Relational plus more
✘ Often in-memory
✘ Some kind of SQL-layer
✘ Schema on Write
✘ Example: MemSQL
U-SQL
✘ What???
✘ Microsoft’s universal SQL
language
✘ Example: Azure Data Lake

How Best to Store your Data?
Complexity Scalability Developer Cost
RDBMS easy medium* low
NoSQL medium big high
*GCP Cloud Spanner is the exception here

Practice
Applying Concepts – Real Cost of Storage Types

Cloud NoSQL Applied – AWS
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business

Cloud NoSQL Applied – AWS
Log Files
• Stream or
Hadoop
• Kinesis or
EMR
Product
Catalogs
• Key/Value
• DynamoDB
Social
Games
• Document
• MongoDB
Social
aggregators
• Graph
• Neo4j
Line-of-
Business
• RDBMS
• RDS

Cloud NoSQL Applied – GCP
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business

Cloud NoSQL Applied – GCP
Log Files
• Pub/Sub
• BigTable
Product
Catalogs
• Key/Value
• DataStore
Social
Games
• Document
• Spanner
Social
aggregators
• Graph
• Neo4j
Line-of-
Business
• RDBMS
• Spanner

Cloud Offerings – Big Relational and NoSQL
AWS Google
Managed RDBMS RDS Aurora Spanner
Cloud SQL
Data Warehouse Redshift or Athena BigQuery
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE

One Vendor’s View
I don’t
Want
Text
here

“What is Hadoop Spark?
✘ In-memory
✘ Distributed
✘ Integrated with other libraries

Spark is Fast, Unified and Usable

Disruptor: DataBricks
✘Managed Apache Spark on AWS

Practice
Applying Concepts – Apache Spark via Databricks

Hadoop is becoming Spark…
Volume
✘10 TB or greater to start
✘Growth of 25% YOY
✘Where FROM
✘Where TO
Velocity and Variety
✘Spark over Hive
✘Kafka and Samsa
Veracity
✘Pay, train and hire team
✘Top $$$ for talent, IF you can find it
✘WATCH OUT for Cloud Vendors
who promise ‘easy access’
✘Complexity of ecosystem
✘DataBricks knows best for Spark
✘Cloudera knows best for on premise

How Best to Store your Data?
RDBMS easy medium* low
Hadoop hard huge very high

Cloud Offerings – Big Data
AWS Google
Managed RDBMS RDS Aurora Spanner
Cloud SQL
Data Warehouse Redshift or Athena BigQuery
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
Hadoop / Spark Elastic MapReduce DataProc

Real World Big Data -- When do I use what?
RDBMS
65%
NoSQL
30%
Hadoop
5%

ETL is 75% of all Big Data Projects
Surveying, cleaning and loading data is
the majorityof the billable time for
new Big Data projects.

3.
Construct Data Pipelines
Build vs. Buy

Kappa Architecture on the Cloud

Pattern
✘How to build optimized cloud-based data pipelines?
-- Cloud-based ETL tools and processes
-- includes load-testingpatterns and securitypractices
-- including connectingbetween different vendor clouds

Key Questions – Ingestion and ETL
✘Volume – how much and how fast,now and future?
✘Variety – what type(s) or data, any pre-processingneeded?
✘Velocity – batches or steaming?
✘Veracity – verificationon ingest needed? new data needed?

Actual Questions – Data Ingest and ETL
✘When (at what data volume) do you need AWS Snowball and/or AWS
Direct Connect for ingest?
✘What does ‘near-real time streaming’ exactlymean?
✘When do you need to use a product vs. a libraryor service for ETL?
✘Which 3rd party partner products are available for the Data services you
are using?
✘What are your actual data qualityissues?

Serverless
Is this a cloud data pipeline patternnow?

Serverless Time Series Analysis
Batch Storage
BigQuery
Storage
Cloud Storage
Time Series Processing
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable*
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Time Series Streaming
Cloud Pub/Sub
*Note: Use Bigtable with
NoSQL workloads of 1 TB or more

DataFlow
✘ Managed Apache Beam
✘ Elegant pipeline API
✘ Beam as a product -> GCP
✘ Scala? Java?
Head-to-Head
ETL – GCP DataFlow vs. GCP DataProc
DataProc
✘ Managed Apache Spark
✘ Large user base
✘ Spark as a product -> DataBricks
✘ Cost compare?

“Considering…
✘Initial Load/Transform
✘Data Quality
✘Batch vs. Stream

Pipeline Phases
Phase 0
Eval Current Data - Quality & Quantity
Phase 1
Get New Data - Free or Premium
Phase 2
Build MVP & Forecast volume and growth
Phase 3
Load test at scale
Phase 4
Deploy – secure, audit and monitor

Cloud Big Data Vendors - ETL
AWS
competitor
✘ Notable: Many, strong ETL
Partners
GCP
✘ Fastest player / Serverless
✘ Notable: DataFlow requires
Java or Python developers

How Best to Ingest and ETL your Data?
RDBMS medium medium low
Hadoop hard huge very high

Building a Streaming Pipeline
Stream Interval Window

Key Questions - Streaming
✘Volume – how much data now and predictedover next 12 months?
✘Variety – what types of data now and future?
✘Velocity – volume of input data / time now and near future?
✘Veracity – volume of EXISTING data now

Cloud Big Data Vendors - Streaming
AWS
competitor
✘ Notable: Kinesis/Firehose
GCP
✘ Fastest player
✘ Notable: DataFlow flexible

Cloud Offerings – Data and Pipelines
AWS Google
Managed RDBMS RDS Aurora Spanner / Cloud SQL
Data Warehouse Redshift/ Athena BigQuery
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
Streaming or ML Kinesis
AWS Machine Learning
Pub/Sub or DataFlow
Google Machine Learning
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
Hadoop / Spark Elastic MapReduce DataProc
Cloud ETL Data Pipelines DataFlow

How Best to Stream your Data?
Micro-Batches easy medium low
Stream Windows difficult big high
Near Real-time very difficult huge high

Designing Streaming Cloud Data Pipelines
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business

Data Processing
Financial Services > Monte Carlo Simulations
Bespoke Apps
Compute Engine
Storage/Analysis
BigQuery
Storage
Cloud Storage
Dataflow/Beam
Cloud Dataflow
Visualization
Cloud Datalab
Storage
Cloud Bigtable
Hadoop/Spark
Cloud Dataproc

Batch
Financial Services > Time Series Analysis
Storage
BigQuery
Storage
Cloud Storage
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Cloud Pub/Sub

Private Datasets Public Datasets
Life Sciences > Variant Analysis
MSSNG Autism
Cloud Storage
Scientist
High
Throughput
Genome
Sequencers
1000 Genomes
Cloud Storage
Patient Data
Cloud Storage
Illumina Platform
Cloud Storage
Ref Genomes
Cloud Storage
TCGA
Cloud Storage
Analytics
Online Analytics
BigQuery
Batch Analytics
Cloud Dataflow
Lab Notebooks
Cloud Datalab
Data Ingest
Genomics
BAM
FASTQ

Streaming
Big Data > Complex Event Processing
Cloud Apps
Compute Engine
Streaming
Batch
Push to Devices
App Engine
Rules Engine
Cloud Dataflow Data Analysis
Cloud Datalab
Mobile Devices
Push Notifications
Report & Share
Business Analysis
Cloud Apps
Compute Engine
On-Premises
Databases
On-Premises
Applications
Processed Events
Cloud Bigtable
Events Time Series
Data Warehouse
BigQuery
Execution Results
Streaming
Cloud Pub/Sub
Transactions
Processing
Cloud Dataflow
Transaction Streams
Messaging
Cloud Pub/Sub
Rules Actions
ETL
Cloud Dataflow
Transform Data
Cloud Data
Cloud Storage
Rules Engine
Cloud Dataproc

4.
Select ETL and Visualization
Analytics and Presentation

Pattern
✘How best to Query and Visualize
-- When to use business analytics vs. predictive analytics (machine learning)
-- how best to present data to clients - partner visualizationproducts or roll
your own

Making Sense of Data
Machine
Learning
Reports Presentation

Actual Questions - Query
✘What query skills do the Dev / DevOps teams have?
✘Do we use an ORM?
✘Are transactions needed?
✘Which RDBMs features are we using now? i.e. PL-SQL, indexing…
✘Which language(s) do our analysts use? Excel, R, Python?

Graphs
What is nature of your questions?

Practice
Applying Concepts – Trying out Cypher on Neo4j

Cloud Big Data Vendors - Query
AWS
competitor
✘ Notable: Big Relational
GCP
✘ Fastest player
✘ Notable: Flexible, powerful
machine learning

Query Languages
SQL
Everyone knows it
But how well do they know it?
NoSQL Vendor Language
Too many to list
How will you learn it?
Cypher
Query language for graph databases
The future?
ORM
Good, bad or horrible?
Again, how well do they know it?
Spark
Scala or Python (functional-style)
Machine Learning Queries
SciPy, NumPy or Python
R Language
Julie Language
Many more…

Practice
Applying Concepts – Understanding Query Language Costs

How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer Cost
RDBMS
NoSQL
Hadoop

How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer Cost
RDBMS easy medium low
NoSQL hard very hard very high
Hadoop hard hard very high

Machine Learning aka Predictive Analytics
AWS
✘ ML for developers
✘ GUI-based –
regression/classification only
✘ MxNet for deep neural networks,
works best with GPUs
GCP
✘ ML for ML developers
✘ Scaled or specialized ML models
✘ Many model types – vision, text,
translate, speech, jobs, video
✘ TensorFlow for neural networks, can
use GPUs

Presentation
If you can’t see it, it’s not worth it.

Dashboards
✘ More than KPIs
✘ Mobile
✘ Alerts
✘ Data Stories
Innovation in Data Visualization
Reports
✘ Level of Detail
✘ Meaningful Taxonomies
✘ Fast enough
✘ Drill for Data

D3
The language of Data Visualization

Cloud Big Data Vendors - Visualization
AWS
✘ Notable: Partners &
QuickSight
GCP
✘ Big Query Partners
✘ Notable: 360 product - New
Dashboards, iPython-style
notebooks

5.
Review End-to-end Patterns
For major cloud vendors

Big Data > Time Series Analysis
Batch Storage
BigQuery
Storage
Cloud Storage
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable*
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Cloud Pub/Sub

Big Data > Bioinformatics

6.
Cloud IoT Patterns
Internet of Things

Data Generation Device

235,000,000,000 $The IoT Market
2017By the year
20 Billion devicesAnd a lot of users

139
Ingest Pipelines
Storage
Analytics
Application &
Presentation
Standard
Devices
HTTPS
Constrained
Devices
Non-TCP
e.g. BLE
Gateway
Internet of Things > Sensor stream ingest and processing
App
Engine
Container
Engine
Cloud
Storage
Cloud
Pub/Sub
Cloud
Dataflow
Monitoring
Logging
Cloud
Dataflow
Cloud
Datastore
Cloud
Bigtable
BigQuery
Cloud
Dataproc
Cloud
Datalab
Compute
Engine

Cloud Big Data Vendors - IoT
AWS
✘ First to market
✘ Notable: MQTT support,
AWS IoT Rules
GCP
✘ Still in Beta / Android Things
✘ Fastest player
✘ Notable: Android Things,
Weave, Nest

“The Future is Functional”
@LynnLangit

Beyond Relational

More Related Content

What's hot

Viewers also liked

Similar to Beyond Relational

More from Lynn Langit

Recently uploaded

Beyond Relational