Beyond Relational:
Applying Big Data Cloud Patterns
Lynn Langit
QCon London – March 2017
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
1.
Understand the Landscape
History
‘Old’ Big Data
Save
ALL
of your Data
(in a Data Lake)
But NOT in HDFS…sorry Cloudera
What is a Cloud-based Data Lake Architecture?
AWS Data Lake Architecture
Disruptor: Data Lake on GCP
✘Well-priced,Highly Available,Auto-scaled
✘Simple, One API, NoOps
✘SQL ‘over’ it via Query Services
Practice
Applying Concepts – Data Lake Features (AWS or GCP)
“Finding a path -- the ACTUAL
Cost of…
✘ Saving all Data into a Data Lake
✘ Using newer database technologies
✘ Going beyond Relational
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
2.
Evaluate Cloud Data Severs & Services
Choice…
is good,
right?
Past -- Workloads and Terms
Example Products Other Names
Operational Database MySQL, SQL Server,
Oracle
OTLP
Analytic Database SSAS, Cognos, SAS, OLAP
NoSQL Database MongoDB, Redis,
Cassandra, Neo4J
Document, K/V, Graph,
Column-store
Hadoop Cloudera, Hortonworks Big Data
???
The fastest growing cloud-basedBig Data products are…
Relational
The fastest growing cloud-basedBig Data products are…
“When do I use…?
✘Hadoop / Spark
✘NoSQL
✘Big Relational
“Existing relational databases
to the cloud?
Operational, Data Warehouse
Prerequisite – move / store files in the cloud – prefer Data Lake to HDFS
Cloud Vendors – Files as Data Lake
AWS
✘ S3/Glacier
✘ Most complete offering
✘ Most mature offering
✘ Notable: S3 management
GCP
✘ GCS with 4 types
✘ Lean, mean and cheap
✘ Unified API
✘ Notable: Simple pricing
Serverless
Do you need to manage servers?
Place your screenshot here
AWS Console
20 Data services
Place your screenshot here
GCP Console
11 Data Services
GCP Serverless Data Warehouse
Reference table
Query / Compute
BigQuery
Customer Lists / Reference Data
Export Ad Data
Cloud Storage
Id matching
Cloud Dataflow
Marketing List
DoubleClick
Campaign Manager
Google Analytics
Relevant Users
Cloud Storage
Analysts
DataStudio 360
Dashboards
BigQuery
✘ Mature– based on Dremel (Google)
✘ Import (batch or stream) from GCS++
✘ ANSI SQL queries
✘ Interactive or batch pricing
✘ NoOps / Serverless
Head-to-Head
Data Warehouse -- GCP BigQuery vs. AWS Redshift
Redshift
✘ Mature – big market share
✘ Big MySQL – ANSI SQL query
✘ PostgreSQL dialect
✘ Cost compare?
✘ Manual partition/scale up
✘ Huge partner ecosystem (ETL, viz)
Disruptor:
AWS Athena
✘Well-priced
✘Highly Available
✘Auto-scaled
✘Simple
✘NoOps
✘Analytics
✘NoSQL
Practice
Applying Concepts – Serverless SQL Queries
(AWS Aurora or GCP Big Query)
BigQuery
✘ Mature – based on Dremel (Google)
✘ Import (batch or stream) from GCS++
✘ ANSI SQL queries
✘ Interactive or batch pricing (streaming)
✘ 3rd party vendor integration (ETL, viz)
Head-to-Head
SQL Query -- GCP BigQuery vs. AWS Athena
Athena
✘ New – based on Apache Presto (from FB)
✘ SQL query against S3 data
✘ Cost compare - $ 5/TB scanned
36
What is Spanner?
Disruptor:
GCP Cloud Spanner
✘Well-priced
✘Highly Available
✘Auto-scaled
✘Transactional
✘NoOps
✘Operational
✘RDBMS
Aurora
✘ Mature
✘ ANSI SQL queries
✘ MySQL compliant
✘ 3rd party vendor integration (ETL, viz)
Head-to-Head
Relational Database – AWS Aurora vs. GCP Spanner
Spanner
✘ New (beta)
✘ ANSI SQL queries
✘ CAP – auto scales, transactional
consistent, auto partitions
✘ Cost compare?
Cloud Vendors
RDBMS Database (Operational or Analytics)
AWS
✘ 14X market share of next
competitor
✘ Most complete offering
✘ Most mature offering - partners
✘ Notable: Aurora, Big Relational
RDBMS
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Requires top developers
✘ Notable: BigQuery, MPP
SQL Query as a Service
Cloud Offerings – RDBMS and Files
AWS Google
Managed RDBMS RDS Aurora
RDS <Other>
Spanner
Cloud SQL
Data Warehouse Redshift
Athena*
BigQuery*
NoSQL buckets S3*
Glacier*
Cloud Storage (Multi-regional, Regional)*
Nearline, Coldline*
*Serverless services
Practice
Applying Concepts – Big Relational
Reasons to use Big Relational Cloud Services
Developers DevOps Cloud Vendors – AWS
Developers DevOps Cloud Vendors – GCP
Reasons to use Big Relational Cloud Services
Developers
✘ Most know RDBMS query patterns
✘ Many know basic administration
DevOps
✘ Most know RDBMS administration
✘ Many know basic RDBMS queries
✘ Many know query optimization
Cloud Vendors - AWS
✘ Aurora – RDBMS up to 64 TB
✘ Redshift - $ 1k USD / 1 TB / year
✘ Rich partner ecosystem – ETL
✘ Integration with AWS products
Developers
✘ Most know coding language patterns
to interact with RDBMS systems
DevOps
✘ Familiar RDBMS security patterns
✘ Familiar auditing
✘ Partner tooling integration
Cloud Vendors - GCP
✘ Spanner / Big Query – familiar SQL queries
✘ No hassle streaming ingest
✘ No hassle pay-as-you-go
✘ NoOps administration
My past top Big Data Cloud Services
“When do I use…?
✘Hadoop
✘NoSQL
✘Big Relational
225
NoSQL Database Types to Choose From
Let’s review some NoSQL concepts
Key-Value
Redis, Riak, Aerospike
Graph
Neo4j
Document
MongoDB
Wide-Column
Cassandra, HBase
“
Key Questions - Storage
✘Volume – how much now, what growth rate?
✘Variety – what type(s) of data? ‘rectangular’,‘graph’,‘k-v’, etc…
✘Velocity – batches, streams,both, what ingest rate?
✘Veracity – current state (quality) of data, amount of duplicationof data
stores, existence of authoritative (master) data management?
Actual Questions - NoSQL
✘When do you use Redis for caching?
✘When do you use MongoDB or AWS DynamoDB for JSON data?
✘When do you use Neo4j for graph-type data?
✘Why are you using somethingelse, i.e. Cassandra, Riak, etc..?
51
✘Open Source is Free ✘Not Free
§ Rapid iteration, innovation
§ Can start up for free (on premise)
§ Can ‘rent’ for cheap or free on the
cloud
§ Can use with the command line for
free
§ Some vendors offer free online training
§ Ex. www.neo4j.org
§ Constant releases
§ Can be deceptively hard to set up
(time is money)
§ Don’t forget to turn it off if on the
cloud!
§ GUI tools, support, training cost $$$
§ Ex. www.neo4j.com
NoSQL Example
Practice
Applying Concepts - NoSQL
NoSQL Applied
Log Files
• ???
Product
Catalogs
• ???
Social
Games
• ???
Social
aggregators
• ???
Line-of-
Business
• ???
NoSQL Applied
Log Files
• Columnstore
• HBase
Product
Catalogs
• Key/Value
• Redis
Social
Games
• Document
• MongoDB
Social
aggregators
• Graph
• Neo4j
Line-of-
Business
• RDBMS
• SQL Server
More than NoSQL
NoSQL
✘ Non-relational
✘ Can be optimized in-memory
✘ Eventually consistent
✘ Schema on Read
✘ Example: Aerospike
NewSQL
✘ Relational plus more
✘ Often in-memory
✘ Some kind of SQL-layer
✘ Schema on Write
✘ Example: MemSQL
U-SQL
✘ What???
✘ Microsoft’s universal SQL
language
✘ Example: Azure Data Lake
Focus
How Best to Store your Data?
Complexity Scalability Developer Cost
RDBMS easy medium* low
NoSQL medium big high
*GCP Cloud Spanner is the exception here
Practice
Applying Concepts – Real Cost of Storage Types
Cloud NoSQL Applied – AWS
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business
Cloud NoSQL Applied – AWS
Log Files
• Stream or
Hadoop
• Kinesis or
EMR
Product
Catalogs
• Key/Value
• DynamoDB
Social
Games
• Document
• MongoDB
Social
aggregators
• Graph
• Neo4j
Line-of-
Business
• RDBMS
• RDS
Cloud NoSQL Applied – GCP
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business
Cloud NoSQL Applied – GCP
Log Files
• Pub/Sub
• BigTable
Product
Catalogs
• Key/Value
• DataStore
Social
Games
• Document
• Spanner
Social
aggregators
• Graph
• Neo4j
Line-of-
Business
• RDBMS
• Spanner
Place your screenshot here
AWS Console
20 Data services
Place your screenshot here
GCP Console
11 Data Services
Cloud Offerings – Big Relational and NoSQL
AWS Google
Managed RDBMS RDS Aurora Spanner
Cloud SQL
Data Warehouse Redshift or Athena BigQuery
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
“When do I use…?
✘Hadoop / Spark
✘NoSQL
✘Big Relational
Size Matters
One Vendor’s View
I don’t
Want
Text
here
Where is Hadoop / Spark Used?
“What is Hadoop Spark?
✘ In-memory
✘ Distributed
✘ Integrated with other libraries
Spark is Fast, Unified and Usable
Disruptor: DataBricks
✘Managed Apache Spark on AWS
Practice
Applying Concepts – Apache Spark via Databricks
Hadoop is becoming Spark…
Volume
✘10 TB or greater to start
✘Growth of 25% YOY
✘Where FROM
✘Where TO
Velocity and Variety
✘Spark over Hive
✘Kafka and Samsa
Veracity
✘Pay, train and hire team
✘Top $$$ for talent, IF you can find it
✘WATCH OUT for Cloud Vendors
who promise ‘easy access’
✘Complexity of ecosystem
✘DataBricks knows best for Spark
✘Cloudera knows best for on premise
How Best to Store your Data?
Complexity Scalability Developer Cost
RDBMS easy medium* low
NoSQL medium big high
Hadoop hard huge very high
Cloud Offerings – Big Data
AWS Google
Managed RDBMS RDS Aurora Spanner
Cloud SQL
Data Warehouse Redshift or Athena BigQuery
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
Hadoop / Spark Elastic MapReduce DataProc
Real World Big Data -- When do I use what?
RDBMS
65%
NoSQL
30%
Hadoop
5%
ETL is 75% of all Big Data Projects
Surveying, cleaning and loading data is
the majorityof the billable time for
new Big Data projects.
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
3.
Construct Data Pipelines
Build vs. Buy
Kappa Architecture on the Cloud
Pattern
✘How to build optimized cloud-based data pipelines?
-- Cloud-based ETL tools and processes
-- includes load-testingpatterns and securitypractices
-- including connectingbetween different vendor clouds
Key Questions – Ingestion and ETL
✘Volume – how much and how fast,now and future?
✘Variety – what type(s) or data, any pre-processingneeded?
✘Velocity – batches or steaming?
✘Veracity – verificationon ingest needed? new data needed?
Actual Questions – Data Ingest and ETL
✘When (at what data volume) do you need AWS Snowball and/or AWS
Direct Connect for ingest?
✘What does ‘near-real time streaming’ exactlymean?
✘When do you need to use a product vs. a libraryor service for ETL?
✘Which 3rd party partner products are available for the Data services you
are using?
✘What are your actual data qualityissues?
Serverless
Is this a cloud data pipeline patternnow?
Serverless Time Series Analysis
Batch Storage
BigQuery
Storage
Cloud Storage
Time Series Processing
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable*
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Time Series Streaming
Cloud Pub/Sub
*Note: Use Bigtable with
NoSQL workloads of 1 TB or more
DataFlow
✘ Managed Apache Beam
✘ Elegant pipeline API
✘ Beam as a product -> GCP
✘ Scala? Java?
Head-to-Head
ETL – GCP DataFlow vs. GCP DataProc
DataProc
✘ Managed Apache Spark
✘ Large user base
✘ Spark as a product -> DataBricks
✘ Cost compare?
“Considering…
✘Initial Load/Transform
✘Data Quality
✘Batch vs. Stream
Pipeline Phases
Phase 0
Eval Current Data - Quality & Quantity
Phase 1
Get New Data - Free or Premium
Phase 2
Build MVP & Forecast volume and growth
Phase 3
Load test at scale
Phase 4
Deploy – secure, audit and monitor
Cloud Big Data Vendors - ETL
AWS
✘ 14X market share of next
competitor
✘ Notable: Many, strong ETL
Partners
GCP
✘ Lean, mean and cheap
✘ Fastest player / Serverless
✘ Notable: DataFlow requires
Java or Python developers
How Best to Ingest and ETL your Data?
Complexity Scalability Developer Cost
RDBMS medium medium low
NoSQL medium big high
Hadoop hard huge very high
“Considering…
✘Initial Load/Transform
✘Data Quality
✘Batch vs. Stream
Building a Streaming Pipeline
Stream Interval Window
Key Questions - Streaming
✘Volume – how much data now and predictedover next 12 months?
✘Variety – what types of data now and future?
✘Velocity – volume of input data / time now and near future?
✘Veracity – volume of EXISTING data now
Cloud Big Data Vendors - Streaming
AWS
✘ 14X market share of next
competitor
✘ Most complete offering
✘ Most mature offering
✘ Notable: Kinesis/Firehose
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Requires top developers
✘ Notable: DataFlow flexible
Cloud Offerings – Data and Pipelines
AWS Google
Managed RDBMS RDS Aurora Spanner / Cloud SQL
Data Warehouse Redshift/ Athena BigQuery
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
Streaming or ML Kinesis
AWS Machine Learning
Pub/Sub or DataFlow
Google Machine Learning
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
Hadoop / Spark Elastic MapReduce DataProc
Cloud ETL Data Pipelines DataFlow
How Best to Stream your Data?
Complexity Scalability Developer Cost
Micro-Batches easy medium low
Stream Windows difficult big high
Near Real-time very difficult huge high
Practice
Applying Concepts
Designing Streaming Cloud Data Pipelines
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business
Data Processing
Financial Services > Monte Carlo Simulations
Bespoke Apps
Compute Engine
Storage/Analysis
BigQuery
Storage
Cloud Storage
Dataflow/Beam
Cloud Dataflow
Visualization
Cloud Datalab
Storage
Cloud Bigtable
Hadoop/Spark
Cloud Dataproc
Batch
Financial Services > Time Series Analysis
Storage
BigQuery
Storage
Cloud Storage
Time Series Processing
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Time Series Streaming
Cloud Pub/Sub
Private Datasets Public Datasets
Life Sciences > Variant Analysis
MSSNG Autism
Cloud Storage
Scientist
High
Throughput
Genome
Sequencers
1000 Genomes
Cloud Storage
Patient Data
Cloud Storage
Illumina Platform
Cloud Storage
Ref Genomes
Cloud Storage
TCGA
Cloud Storage
Analytics
Online Analytics
BigQuery
Batch Analytics
Cloud Dataflow
Lab Notebooks
Cloud Datalab
Data Ingest
Genomics
BAM
FASTQ
Streaming
Big Data > Complex Event Processing
Cloud Apps
Compute Engine
Streaming
Batch
Push to Devices
App Engine
Rules Engine
Cloud Dataflow Data Analysis
Cloud Datalab
Mobile Devices
Push Notifications
Report & Share
Business Analysis
Cloud Apps
Compute Engine
On-Premises
Databases
On-Premises
Applications
Processed Events
Cloud Bigtable
Events Time Series
Data Warehouse
BigQuery
Execution Results
Streaming
Cloud Pub/Sub
Transactions
Processing
Cloud Dataflow
Transaction Streams
Messaging
Cloud Pub/Sub
Rules Actions
ETL
Cloud Dataflow
Transform Data
Cloud Data
Cloud Storage
Rules Engine
Cloud Dataproc
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
4.
Select ETL and Visualization
Analytics and Presentation
Pattern
✘How best to Query and Visualize
-- When to use business analytics vs. predictive analytics (machine learning)
-- how best to present data to clients - partner visualizationproducts or roll
your own
Making Sense of Data
Machine
Learning
Reports Presentation
Actual Questions - Query
✘What query skills do the Dev / DevOps teams have?
✘Do we use an ORM?
✘Are transactions needed?
✘Which RDBMs features are we using now? i.e. PL-SQL, indexing…
✘Which language(s) do our analysts use? Excel, R, Python?
Graphs
What is nature of your questions?
Practice
Applying Concepts – Trying out Cypher on Neo4j
Cloud Big Data Vendors - Query
AWS
✘ 5X market share of next
competitor
✘ Most complete offering
✘ Most mature offering
✘ Notable: Big Relational
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Notable: Flexible, powerful
machine learning
Query Languages
SQL
Everyone knows it
But how well do they know it?
NoSQL Vendor Language
Too many to list
How will you learn it?
Cypher
Query language for graph databases
The future?
ORM
Good, bad or horrible?
Again, how well do they know it?
Spark
Scala or Python (functional-style)
Machine Learning Queries
SciPy, NumPy or Python
R Language
Julie Language
Many more…
Practice
Applying Concepts – Understanding Query Language Costs
How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer Cost
RDBMS
NoSQL
Hadoop
How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer Cost
RDBMS easy medium low
NoSQL hard very hard very high
Hadoop hard hard very high
Machine Learning aka Predictive Analytics
AWS
✘ ML for developers
✘ GUI-based –
regression/classification only
✘ MxNet for deep neural networks,
works best with GPUs
GCP
✘ ML for ML developers
✘ Scaled or specialized ML models
✘ Many model types – vision, text,
translate, speech, jobs, video
✘ TensorFlow for neural networks, can
use GPUs
Presentation
If you can’t see it, it’s not worth it.
Dashboards
✘ More than KPIs
✘ Mobile
✘ Alerts
✘ Data Stories
Innovation in Data Visualization
Reports
✘ Level of Detail
✘ Meaningful Taxonomies
✘ Fast enough
✘ Drill for Data
D3
The language of Data Visualization
Cloud Big Data Vendors - Visualization
AWS
✘ Most complete offering
✘ Notable: Partners &
QuickSight
GCP
✘ Big Query Partners
✘ Notable: 360 product - New
Dashboards, iPython-style
notebooks
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
5.
Review End-to-end Patterns
For major cloud vendors
Big Data > Time Series Analysis
Batch Storage
BigQuery
Storage
Cloud Storage
Time Series Processing
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable*
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Time Series Streaming
Cloud Pub/Sub
*Note: Use Bigtable with
NoSQL workloads of 1 TB or more
Big Data > Bioinformatics
*Note: Use Bigtable with
NoSQL workloads of 1 TB or more
Azure Patterns
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
6.
Cloud IoT Patterns
Internet of Things
Place your screenshot here
Data Generation Device
IoT
is
Big Data
Realized
235,000,000,000 $The IoT Market
2017By the year
20 Billion devicesAnd a lot of users
IoT all the Things
139
Ingest Pipelines
Storage
Analytics
Application &
Presentation
Standard
Devices
HTTPS
Constrained
Devices
Non-TCP
e.g. BLE
Gateway
Internet of Things > Sensor stream ingest and processing
App
Engine
Container
Engine
Cloud
Storage
Cloud
Pub/Sub
Cloud
Dataflow
Monitoring
Logging
Cloud
Dataflow
Cloud
Datastore
Cloud
Bigtable
BigQuery
Cloud
Dataproc
Cloud
Datalab
Compute
Engine
IoT
Cloud Big Data Vendors - IoT
AWS
✘ First to market
✘ Most complete offering
✘ Most mature offering
✘ Notable: MQTT support,
AWS IoT Rules
GCP
✘ Still in Beta / Android Things
✘ Fastest player
✘ Requires top developers
✘ Notable: Android Things,
Weave, Nest
“The Future is Functional”
@LynnLangit
The Next Generation…

Beyond Relational

  • 1.
    Beyond Relational: Applying BigData Cloud Patterns Lynn Langit QCon London – March 2017
  • 2.
    About this Workshop Learningfrom Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 3.
  • 4.
  • 5.
  • 8.
  • 9.
    But NOT inHDFS…sorry Cloudera What is a Cloud-based Data Lake Architecture?
  • 10.
    AWS Data LakeArchitecture
  • 12.
    Disruptor: Data Lakeon GCP ✘Well-priced,Highly Available,Auto-scaled ✘Simple, One API, NoOps ✘SQL ‘over’ it via Query Services
  • 14.
    Practice Applying Concepts –Data Lake Features (AWS or GCP)
  • 15.
    “Finding a path-- the ACTUAL Cost of… ✘ Saving all Data into a Data Lake ✘ Using newer database technologies ✘ Going beyond Relational
  • 16.
    About this Workshop Learningfrom Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 17.
    2. Evaluate Cloud DataSevers & Services
  • 19.
  • 20.
    Past -- Workloadsand Terms Example Products Other Names Operational Database MySQL, SQL Server, Oracle OTLP Analytic Database SSAS, Cognos, SAS, OLAP NoSQL Database MongoDB, Redis, Cassandra, Neo4J Document, K/V, Graph, Column-store Hadoop Cloudera, Hortonworks Big Data
  • 22.
    ??? The fastest growingcloud-basedBig Data products are…
  • 23.
    Relational The fastest growingcloud-basedBig Data products are…
  • 24.
    “When do Iuse…? ✘Hadoop / Spark ✘NoSQL ✘Big Relational
  • 25.
    “Existing relational databases tothe cloud? Operational, Data Warehouse
  • 26.
    Prerequisite – move/ store files in the cloud – prefer Data Lake to HDFS
  • 27.
    Cloud Vendors –Files as Data Lake AWS ✘ S3/Glacier ✘ Most complete offering ✘ Most mature offering ✘ Notable: S3 management GCP ✘ GCS with 4 types ✘ Lean, mean and cheap ✘ Unified API ✘ Notable: Simple pricing
  • 28.
    Serverless Do you needto manage servers?
  • 29.
    Place your screenshothere AWS Console 20 Data services
  • 30.
    Place your screenshothere GCP Console 11 Data Services
  • 31.
    GCP Serverless DataWarehouse Reference table Query / Compute BigQuery Customer Lists / Reference Data Export Ad Data Cloud Storage Id matching Cloud Dataflow Marketing List DoubleClick Campaign Manager Google Analytics Relevant Users Cloud Storage Analysts DataStudio 360 Dashboards
  • 32.
    BigQuery ✘ Mature– basedon Dremel (Google) ✘ Import (batch or stream) from GCS++ ✘ ANSI SQL queries ✘ Interactive or batch pricing ✘ NoOps / Serverless Head-to-Head Data Warehouse -- GCP BigQuery vs. AWS Redshift Redshift ✘ Mature – big market share ✘ Big MySQL – ANSI SQL query ✘ PostgreSQL dialect ✘ Cost compare? ✘ Manual partition/scale up ✘ Huge partner ecosystem (ETL, viz)
  • 33.
  • 34.
    Practice Applying Concepts –Serverless SQL Queries (AWS Aurora or GCP Big Query)
  • 35.
    BigQuery ✘ Mature –based on Dremel (Google) ✘ Import (batch or stream) from GCS++ ✘ ANSI SQL queries ✘ Interactive or batch pricing (streaming) ✘ 3rd party vendor integration (ETL, viz) Head-to-Head SQL Query -- GCP BigQuery vs. AWS Athena Athena ✘ New – based on Apache Presto (from FB) ✘ SQL query against S3 data ✘ Cost compare - $ 5/TB scanned
  • 36.
  • 37.
    Disruptor: GCP Cloud Spanner ✘Well-priced ✘HighlyAvailable ✘Auto-scaled ✘Transactional ✘NoOps ✘Operational ✘RDBMS
  • 38.
    Aurora ✘ Mature ✘ ANSISQL queries ✘ MySQL compliant ✘ 3rd party vendor integration (ETL, viz) Head-to-Head Relational Database – AWS Aurora vs. GCP Spanner Spanner ✘ New (beta) ✘ ANSI SQL queries ✘ CAP – auto scales, transactional consistent, auto partitions ✘ Cost compare?
  • 39.
    Cloud Vendors RDBMS Database(Operational or Analytics) AWS ✘ 14X market share of next competitor ✘ Most complete offering ✘ Most mature offering - partners ✘ Notable: Aurora, Big Relational RDBMS GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: BigQuery, MPP SQL Query as a Service
  • 40.
    Cloud Offerings –RDBMS and Files AWS Google Managed RDBMS RDS Aurora RDS <Other> Spanner Cloud SQL Data Warehouse Redshift Athena* BigQuery* NoSQL buckets S3* Glacier* Cloud Storage (Multi-regional, Regional)* Nearline, Coldline* *Serverless services
  • 41.
  • 42.
    Reasons to useBig Relational Cloud Services Developers DevOps Cloud Vendors – AWS Developers DevOps Cloud Vendors – GCP
  • 43.
    Reasons to useBig Relational Cloud Services Developers ✘ Most know RDBMS query patterns ✘ Many know basic administration DevOps ✘ Most know RDBMS administration ✘ Many know basic RDBMS queries ✘ Many know query optimization Cloud Vendors - AWS ✘ Aurora – RDBMS up to 64 TB ✘ Redshift - $ 1k USD / 1 TB / year ✘ Rich partner ecosystem – ETL ✘ Integration with AWS products Developers ✘ Most know coding language patterns to interact with RDBMS systems DevOps ✘ Familiar RDBMS security patterns ✘ Familiar auditing ✘ Partner tooling integration Cloud Vendors - GCP ✘ Spanner / Big Query – familiar SQL queries ✘ No hassle streaming ingest ✘ No hassle pay-as-you-go ✘ NoOps administration
  • 44.
    My past topBig Data Cloud Services
  • 45.
    “When do Iuse…? ✘Hadoop ✘NoSQL ✘Big Relational
  • 46.
  • 47.
    Let’s review someNoSQL concepts Key-Value Redis, Riak, Aerospike Graph Neo4j Document MongoDB Wide-Column Cassandra, HBase
  • 48.
  • 49.
    Key Questions -Storage ✘Volume – how much now, what growth rate? ✘Variety – what type(s) of data? ‘rectangular’,‘graph’,‘k-v’, etc… ✘Velocity – batches, streams,both, what ingest rate? ✘Veracity – current state (quality) of data, amount of duplicationof data stores, existence of authoritative (master) data management?
  • 50.
    Actual Questions -NoSQL ✘When do you use Redis for caching? ✘When do you use MongoDB or AWS DynamoDB for JSON data? ✘When do you use Neo4j for graph-type data? ✘Why are you using somethingelse, i.e. Cassandra, Riak, etc..?
  • 51.
    51 ✘Open Source isFree ✘Not Free § Rapid iteration, innovation § Can start up for free (on premise) § Can ‘rent’ for cheap or free on the cloud § Can use with the command line for free § Some vendors offer free online training § Ex. www.neo4j.org § Constant releases § Can be deceptively hard to set up (time is money) § Don’t forget to turn it off if on the cloud! § GUI tools, support, training cost $$$ § Ex. www.neo4j.com NoSQL Example
  • 52.
  • 53.
    NoSQL Applied Log Files •??? Product Catalogs • ??? Social Games • ??? Social aggregators • ??? Line-of- Business • ???
  • 54.
    NoSQL Applied Log Files •Columnstore • HBase Product Catalogs • Key/Value • Redis Social Games • Document • MongoDB Social aggregators • Graph • Neo4j Line-of- Business • RDBMS • SQL Server
  • 55.
    More than NoSQL NoSQL ✘Non-relational ✘ Can be optimized in-memory ✘ Eventually consistent ✘ Schema on Read ✘ Example: Aerospike NewSQL ✘ Relational plus more ✘ Often in-memory ✘ Some kind of SQL-layer ✘ Schema on Write ✘ Example: MemSQL U-SQL ✘ What??? ✘ Microsoft’s universal SQL language ✘ Example: Azure Data Lake
  • 56.
  • 57.
    How Best toStore your Data? Complexity Scalability Developer Cost RDBMS easy medium* low NoSQL medium big high *GCP Cloud Spanner is the exception here
  • 58.
    Practice Applying Concepts –Real Cost of Storage Types
  • 59.
    Cloud NoSQL Applied– AWS Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  • 60.
    Cloud NoSQL Applied– AWS Log Files • Stream or Hadoop • Kinesis or EMR Product Catalogs • Key/Value • DynamoDB Social Games • Document • MongoDB Social aggregators • Graph • Neo4j Line-of- Business • RDBMS • RDS
  • 61.
    Cloud NoSQL Applied– GCP Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  • 62.
    Cloud NoSQL Applied– GCP Log Files • Pub/Sub • BigTable Product Catalogs • Key/Value • DataStore Social Games • Document • Spanner Social aggregators • Graph • Neo4j Line-of- Business • RDBMS • Spanner
  • 63.
    Place your screenshothere AWS Console 20 Data services
  • 64.
    Place your screenshothere GCP Console 11 Data Services
  • 65.
    Cloud Offerings –Big Relational and NoSQL AWS Google Managed RDBMS RDS Aurora Spanner Cloud SQL Data Warehouse Redshift or Athena BigQuery NoSQL buckets S3 Glacier Cloud Storage Nearline NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE
  • 66.
    “When do Iuse…? ✘Hadoop / Spark ✘NoSQL ✘Big Relational
  • 67.
  • 69.
    One Vendor’s View Idon’t Want Text here
  • 71.
    Where is Hadoop/ Spark Used?
  • 72.
    “What is HadoopSpark? ✘ In-memory ✘ Distributed ✘ Integrated with other libraries
  • 74.
    Spark is Fast,Unified and Usable
  • 75.
  • 76.
    Practice Applying Concepts –Apache Spark via Databricks
  • 77.
    Hadoop is becomingSpark… Volume ✘10 TB or greater to start ✘Growth of 25% YOY ✘Where FROM ✘Where TO Velocity and Variety ✘Spark over Hive ✘Kafka and Samsa Veracity ✘Pay, train and hire team ✘Top $$$ for talent, IF you can find it ✘WATCH OUT for Cloud Vendors who promise ‘easy access’ ✘Complexity of ecosystem ✘DataBricks knows best for Spark ✘Cloudera knows best for on premise
  • 78.
    How Best toStore your Data? Complexity Scalability Developer Cost RDBMS easy medium* low NoSQL medium big high Hadoop hard huge very high
  • 79.
    Cloud Offerings –Big Data AWS Google Managed RDBMS RDS Aurora Spanner Cloud SQL Data Warehouse Redshift or Athena BigQuery NoSQL buckets S3 Glacier Cloud Storage Nearline NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE Hadoop / Spark Elastic MapReduce DataProc
  • 80.
    Real World BigData -- When do I use what? RDBMS 65% NoSQL 30% Hadoop 5%
  • 81.
    ETL is 75%of all Big Data Projects Surveying, cleaning and loading data is the majorityof the billable time for new Big Data projects.
  • 82.
    About this Workshop Learningfrom Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 83.
  • 86.
  • 87.
    Pattern ✘How to buildoptimized cloud-based data pipelines? -- Cloud-based ETL tools and processes -- includes load-testingpatterns and securitypractices -- including connectingbetween different vendor clouds
  • 88.
    Key Questions –Ingestion and ETL ✘Volume – how much and how fast,now and future? ✘Variety – what type(s) or data, any pre-processingneeded? ✘Velocity – batches or steaming? ✘Veracity – verificationon ingest needed? new data needed?
  • 89.
    Actual Questions –Data Ingest and ETL ✘When (at what data volume) do you need AWS Snowball and/or AWS Direct Connect for ingest? ✘What does ‘near-real time streaming’ exactlymean? ✘When do you need to use a product vs. a libraryor service for ETL? ✘Which 3rd party partner products are available for the Data services you are using? ✘What are your actual data qualityissues?
  • 90.
    Serverless Is this acloud data pipeline patternnow?
  • 91.
    Serverless Time SeriesAnalysis Batch Storage BigQuery Storage Cloud Storage Time Series Processing Cloud Dataflow Analysis Cloud Datalab Storage Cloud Bigtable* Processing Cloud Dataproc Time Series Files Cloud Storage ML Cloud ML Streaming Time Series Streaming Cloud Pub/Sub *Note: Use Bigtable with NoSQL workloads of 1 TB or more
  • 92.
    DataFlow ✘ Managed ApacheBeam ✘ Elegant pipeline API ✘ Beam as a product -> GCP ✘ Scala? Java? Head-to-Head ETL – GCP DataFlow vs. GCP DataProc DataProc ✘ Managed Apache Spark ✘ Large user base ✘ Spark as a product -> DataBricks ✘ Cost compare?
  • 93.
  • 94.
    Pipeline Phases Phase 0 EvalCurrent Data - Quality & Quantity Phase 1 Get New Data - Free or Premium Phase 2 Build MVP & Forecast volume and growth Phase 3 Load test at scale Phase 4 Deploy – secure, audit and monitor
  • 95.
    Cloud Big DataVendors - ETL AWS ✘ 14X market share of next competitor ✘ Notable: Many, strong ETL Partners GCP ✘ Lean, mean and cheap ✘ Fastest player / Serverless ✘ Notable: DataFlow requires Java or Python developers
  • 96.
    How Best toIngest and ETL your Data? Complexity Scalability Developer Cost RDBMS medium medium low NoSQL medium big high Hadoop hard huge very high
  • 97.
  • 98.
    Building a StreamingPipeline Stream Interval Window
  • 99.
    Key Questions -Streaming ✘Volume – how much data now and predictedover next 12 months? ✘Variety – what types of data now and future? ✘Velocity – volume of input data / time now and near future? ✘Veracity – volume of EXISTING data now
  • 100.
    Cloud Big DataVendors - Streaming AWS ✘ 14X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Kinesis/Firehose GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: DataFlow flexible
  • 101.
    Cloud Offerings –Data and Pipelines AWS Google Managed RDBMS RDS Aurora Spanner / Cloud SQL Data Warehouse Redshift/ Athena BigQuery NoSQL buckets S3 Glacier Cloud Storage Nearline NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Streaming or ML Kinesis AWS Machine Learning Pub/Sub or DataFlow Google Machine Learning NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE Hadoop / Spark Elastic MapReduce DataProc Cloud ETL Data Pipelines DataFlow
  • 102.
    How Best toStream your Data? Complexity Scalability Developer Cost Micro-Batches easy medium low Stream Windows difficult big high Near Real-time very difficult huge high
  • 103.
  • 104.
    Designing Streaming CloudData Pipelines Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  • 105.
    Data Processing Financial Services> Monte Carlo Simulations Bespoke Apps Compute Engine Storage/Analysis BigQuery Storage Cloud Storage Dataflow/Beam Cloud Dataflow Visualization Cloud Datalab Storage Cloud Bigtable Hadoop/Spark Cloud Dataproc
  • 106.
    Batch Financial Services >Time Series Analysis Storage BigQuery Storage Cloud Storage Time Series Processing Cloud Dataflow Analysis Cloud Datalab Storage Cloud Bigtable Processing Cloud Dataproc Time Series Files Cloud Storage ML Cloud ML Streaming Time Series Streaming Cloud Pub/Sub
  • 107.
    Private Datasets PublicDatasets Life Sciences > Variant Analysis MSSNG Autism Cloud Storage Scientist High Throughput Genome Sequencers 1000 Genomes Cloud Storage Patient Data Cloud Storage Illumina Platform Cloud Storage Ref Genomes Cloud Storage TCGA Cloud Storage Analytics Online Analytics BigQuery Batch Analytics Cloud Dataflow Lab Notebooks Cloud Datalab Data Ingest Genomics BAM FASTQ
  • 108.
    Streaming Big Data >Complex Event Processing Cloud Apps Compute Engine Streaming Batch Push to Devices App Engine Rules Engine Cloud Dataflow Data Analysis Cloud Datalab Mobile Devices Push Notifications Report & Share Business Analysis Cloud Apps Compute Engine On-Premises Databases On-Premises Applications Processed Events Cloud Bigtable Events Time Series Data Warehouse BigQuery Execution Results Streaming Cloud Pub/Sub Transactions Processing Cloud Dataflow Transaction Streams Messaging Cloud Pub/Sub Rules Actions ETL Cloud Dataflow Transform Data Cloud Data Cloud Storage Rules Engine Cloud Dataproc
  • 109.
    About this Workshop Learningfrom Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 110.
    4. Select ETL andVisualization Analytics and Presentation
  • 111.
    Pattern ✘How best toQuery and Visualize -- When to use business analytics vs. predictive analytics (machine learning) -- how best to present data to clients - partner visualizationproducts or roll your own
  • 112.
    Making Sense ofData Machine Learning Reports Presentation
  • 113.
    Actual Questions -Query ✘What query skills do the Dev / DevOps teams have? ✘Do we use an ORM? ✘Are transactions needed? ✘Which RDBMs features are we using now? i.e. PL-SQL, indexing… ✘Which language(s) do our analysts use? Excel, R, Python?
  • 114.
    Graphs What is natureof your questions?
  • 115.
    Practice Applying Concepts –Trying out Cypher on Neo4j
  • 117.
    Cloud Big DataVendors - Query AWS ✘ 5X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Big Relational GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Notable: Flexible, powerful machine learning
  • 118.
    Query Languages SQL Everyone knowsit But how well do they know it? NoSQL Vendor Language Too many to list How will you learn it? Cypher Query language for graph databases The future? ORM Good, bad or horrible? Again, how well do they know it? Spark Scala or Python (functional-style) Machine Learning Queries SciPy, NumPy or Python R Language Julie Language Many more…
  • 119.
    Practice Applying Concepts –Understanding Query Language Costs
  • 120.
    How Best toQuery your Data? Business Analytics Predictive Analytics Developer Cost RDBMS NoSQL Hadoop
  • 121.
    How Best toQuery your Data? Business Analytics Predictive Analytics Developer Cost RDBMS easy medium low NoSQL hard very hard very high Hadoop hard hard very high
  • 122.
    Machine Learning akaPredictive Analytics AWS ✘ ML for developers ✘ GUI-based – regression/classification only ✘ MxNet for deep neural networks, works best with GPUs GCP ✘ ML for ML developers ✘ Scaled or specialized ML models ✘ Many model types – vision, text, translate, speech, jobs, video ✘ TensorFlow for neural networks, can use GPUs
  • 123.
    Presentation If you can’tsee it, it’s not worth it.
  • 124.
    Dashboards ✘ More thanKPIs ✘ Mobile ✘ Alerts ✘ Data Stories Innovation in Data Visualization Reports ✘ Level of Detail ✘ Meaningful Taxonomies ✘ Fast enough ✘ Drill for Data
  • 125.
    D3 The language ofData Visualization
  • 127.
    Cloud Big DataVendors - Visualization AWS ✘ Most complete offering ✘ Notable: Partners & QuickSight GCP ✘ Big Query Partners ✘ Notable: 360 product - New Dashboards, iPython-style notebooks
  • 128.
    About this Workshop Learningfrom Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 129.
  • 130.
    Big Data >Time Series Analysis Batch Storage BigQuery Storage Cloud Storage Time Series Processing Cloud Dataflow Analysis Cloud Datalab Storage Cloud Bigtable* Processing Cloud Dataproc Time Series Files Cloud Storage ML Cloud ML Streaming Time Series Streaming Cloud Pub/Sub *Note: Use Bigtable with NoSQL workloads of 1 TB or more
  • 131.
    Big Data >Bioinformatics *Note: Use Bigtable with NoSQL workloads of 1 TB or more
  • 132.
  • 133.
    About this Workshop Learningfrom Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 134.
  • 135.
    Place your screenshothere Data Generation Device
  • 136.
  • 137.
    235,000,000,000 $The IoTMarket 2017By the year 20 Billion devicesAnd a lot of users
  • 138.
  • 139.
    139 Ingest Pipelines Storage Analytics Application & Presentation Standard Devices HTTPS Constrained Devices Non-TCP e.g.BLE Gateway Internet of Things > Sensor stream ingest and processing App Engine Container Engine Cloud Storage Cloud Pub/Sub Cloud Dataflow Monitoring Logging Cloud Dataflow Cloud Datastore Cloud Bigtable BigQuery Cloud Dataproc Cloud Datalab Compute Engine
  • 140.
  • 142.
    Cloud Big DataVendors - IoT AWS ✘ First to market ✘ Most complete offering ✘ Most mature offering ✘ Notable: MQTT support, AWS IoT Rules GCP ✘ Still in Beta / Android Things ✘ Fastest player ✘ Requires top developers ✘ Notable: Android Things, Weave, Nest
  • 143.
    “The Future isFunctional” @LynnLangit
  • 144.