2. About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
15. “Finding a path -- the ACTUAL
Cost of…
✘ Saving all Data into a Data Lake
✘ Using newer database technologies
✘ Going beyond Relational
16. About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
27. Cloud Vendors – Files as Data Lake
AWS
✘ S3/Glacier
✘ Most complete offering
✘ Most mature offering
✘ Notable: S3 management
GCP
✘ GCS with 4 types
✘ Lean, mean and cheap
✘ Unified API
✘ Notable: Simple pricing
38. Aurora
✘ Mature
✘ ANSI SQL queries
✘ MySQL compliant
✘ 3rd party vendor integration (ETL, viz)
Head-to-Head
Relational Database – AWS Aurora vs. GCP Spanner
Spanner
✘ New (beta)
✘ ANSI SQL queries
✘ CAP – auto scales, transactional
consistent, auto partitions
✘ Cost compare?
39. Cloud Vendors
RDBMS Database (Operational or Analytics)
AWS
✘ 14X market share of next
competitor
✘ Most complete offering
✘ Most mature offering - partners
✘ Notable: Aurora, Big Relational
RDBMS
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Requires top developers
✘ Notable: BigQuery, MPP
SQL Query as a Service
49. Key Questions - Storage
✘Volume – how much now, what growth rate?
✘Variety – what type(s) of data? ‘rectangular’,‘graph’,‘k-v’, etc…
✘Velocity – batches, streams,both, what ingest rate?
✘Veracity – current state (quality) of data, amount of duplicationof data
stores, existence of authoritative (master) data management?
50. Actual Questions - NoSQL
✘When do you use Redis for caching?
✘When do you use MongoDB or AWS DynamoDB for JSON data?
✘When do you use Neo4j for graph-type data?
✘Why are you using somethingelse, i.e. Cassandra, Riak, etc..?
51. 51
✘Open Source is Free ✘Not Free
§ Rapid iteration, innovation
§ Can start up for free (on premise)
§ Can ‘rent’ for cheap or free on the
cloud
§ Can use with the command line for
free
§ Some vendors offer free online training
§ Ex. www.neo4j.org
§ Constant releases
§ Can be deceptively hard to set up
(time is money)
§ Don’t forget to turn it off if on the
cloud!
§ GUI tools, support, training cost $$$
§ Ex. www.neo4j.com
NoSQL Example
53. NoSQL Applied
Log Files
• ???
Product
Catalogs
• ???
Social
Games
• ???
Social
aggregators
• ???
Line-of-
Business
• ???
54. NoSQL Applied
Log Files
• Columnstore
• HBase
Product
Catalogs
• Key/Value
• Redis
Social
Games
• Document
• MongoDB
Social
aggregators
• Graph
• Neo4j
Line-of-
Business
• RDBMS
• SQL Server
55. More than NoSQL
NoSQL
✘ Non-relational
✘ Can be optimized in-memory
✘ Eventually consistent
✘ Schema on Read
✘ Example: Aerospike
NewSQL
✘ Relational plus more
✘ Often in-memory
✘ Some kind of SQL-layer
✘ Schema on Write
✘ Example: MemSQL
U-SQL
✘ What???
✘ Microsoft’s universal SQL
language
✘ Example: Azure Data Lake
57. How Best to Store your Data?
Complexity Scalability Developer Cost
RDBMS easy medium* low
NoSQL medium big high
*GCP Cloud Spanner is the exception here
77. Hadoop is becoming Spark…
Volume
✘10 TB or greater to start
✘Growth of 25% YOY
✘Where FROM
✘Where TO
Velocity and Variety
✘Spark over Hive
✘Kafka and Samsa
Veracity
✘Pay, train and hire team
✘Top $$$ for talent, IF you can find it
✘WATCH OUT for Cloud Vendors
who promise ‘easy access’
✘Complexity of ecosystem
✘DataBricks knows best for Spark
✘Cloudera knows best for on premise
78. How Best to Store your Data?
Complexity Scalability Developer Cost
RDBMS easy medium* low
NoSQL medium big high
Hadoop hard huge very high
79. Cloud Offerings – Big Data
AWS Google
Managed RDBMS RDS Aurora Spanner
Cloud SQL
Data Warehouse Redshift or Athena BigQuery
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
Hadoop / Spark Elastic MapReduce DataProc
80. Real World Big Data -- When do I use what?
RDBMS
65%
NoSQL
30%
Hadoop
5%
81. ETL is 75% of all Big Data Projects
Surveying, cleaning and loading data is
the majorityof the billable time for
new Big Data projects.
82. About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
87. Pattern
✘How to build optimized cloud-based data pipelines?
-- Cloud-based ETL tools and processes
-- includes load-testingpatterns and securitypractices
-- including connectingbetween different vendor clouds
88. Key Questions – Ingestion and ETL
✘Volume – how much and how fast,now and future?
✘Variety – what type(s) or data, any pre-processingneeded?
✘Velocity – batches or steaming?
✘Veracity – verificationon ingest needed? new data needed?
89. Actual Questions – Data Ingest and ETL
✘When (at what data volume) do you need AWS Snowball and/or AWS
Direct Connect for ingest?
✘What does ‘near-real time streaming’ exactlymean?
✘When do you need to use a product vs. a libraryor service for ETL?
✘Which 3rd party partner products are available for the Data services you
are using?
✘What are your actual data qualityissues?
91. Serverless Time Series Analysis
Batch Storage
BigQuery
Storage
Cloud Storage
Time Series Processing
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable*
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Time Series Streaming
Cloud Pub/Sub
*Note: Use Bigtable with
NoSQL workloads of 1 TB or more
92. DataFlow
✘ Managed Apache Beam
✘ Elegant pipeline API
✘ Beam as a product -> GCP
✘ Scala? Java?
Head-to-Head
ETL – GCP DataFlow vs. GCP DataProc
DataProc
✘ Managed Apache Spark
✘ Large user base
✘ Spark as a product -> DataBricks
✘ Cost compare?
94. Pipeline Phases
Phase 0
Eval Current Data - Quality & Quantity
Phase 1
Get New Data - Free or Premium
Phase 2
Build MVP & Forecast volume and growth
Phase 3
Load test at scale
Phase 4
Deploy – secure, audit and monitor
95. Cloud Big Data Vendors - ETL
AWS
✘ 14X market share of next
competitor
✘ Notable: Many, strong ETL
Partners
GCP
✘ Lean, mean and cheap
✘ Fastest player / Serverless
✘ Notable: DataFlow requires
Java or Python developers
96. How Best to Ingest and ETL your Data?
Complexity Scalability Developer Cost
RDBMS medium medium low
NoSQL medium big high
Hadoop hard huge very high
99. Key Questions - Streaming
✘Volume – how much data now and predictedover next 12 months?
✘Variety – what types of data now and future?
✘Velocity – volume of input data / time now and near future?
✘Veracity – volume of EXISTING data now
100. Cloud Big Data Vendors - Streaming
AWS
✘ 14X market share of next
competitor
✘ Most complete offering
✘ Most mature offering
✘ Notable: Kinesis/Firehose
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Requires top developers
✘ Notable: DataFlow flexible
101. Cloud Offerings – Data and Pipelines
AWS Google
Managed RDBMS RDS Aurora Spanner / Cloud SQL
Data Warehouse Redshift/ Athena BigQuery
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
Streaming or ML Kinesis
AWS Machine Learning
Pub/Sub or DataFlow
Google Machine Learning
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
Hadoop / Spark Elastic MapReduce DataProc
Cloud ETL Data Pipelines DataFlow
102. How Best to Stream your Data?
Complexity Scalability Developer Cost
Micro-Batches easy medium low
Stream Windows difficult big high
Near Real-time very difficult huge high
104. Designing Streaming Cloud Data Pipelines
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business
105. Data Processing
Financial Services > Monte Carlo Simulations
Bespoke Apps
Compute Engine
Storage/Analysis
BigQuery
Storage
Cloud Storage
Dataflow/Beam
Cloud Dataflow
Visualization
Cloud Datalab
Storage
Cloud Bigtable
Hadoop/Spark
Cloud Dataproc
106. Batch
Financial Services > Time Series Analysis
Storage
BigQuery
Storage
Cloud Storage
Time Series Processing
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Time Series Streaming
Cloud Pub/Sub
108. Streaming
Big Data > Complex Event Processing
Cloud Apps
Compute Engine
Streaming
Batch
Push to Devices
App Engine
Rules Engine
Cloud Dataflow Data Analysis
Cloud Datalab
Mobile Devices
Push Notifications
Report & Share
Business Analysis
Cloud Apps
Compute Engine
On-Premises
Databases
On-Premises
Applications
Processed Events
Cloud Bigtable
Events Time Series
Data Warehouse
BigQuery
Execution Results
Streaming
Cloud Pub/Sub
Transactions
Processing
Cloud Dataflow
Transaction Streams
Messaging
Cloud Pub/Sub
Rules Actions
ETL
Cloud Dataflow
Transform Data
Cloud Data
Cloud Storage
Rules Engine
Cloud Dataproc
109. About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
111. Pattern
✘How best to Query and Visualize
-- When to use business analytics vs. predictive analytics (machine learning)
-- how best to present data to clients - partner visualizationproducts or roll
your own
113. Actual Questions - Query
✘What query skills do the Dev / DevOps teams have?
✘Do we use an ORM?
✘Are transactions needed?
✘Which RDBMs features are we using now? i.e. PL-SQL, indexing…
✘Which language(s) do our analysts use? Excel, R, Python?
117. Cloud Big Data Vendors - Query
AWS
✘ 5X market share of next
competitor
✘ Most complete offering
✘ Most mature offering
✘ Notable: Big Relational
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Notable: Flexible, powerful
machine learning
118. Query Languages
SQL
Everyone knows it
But how well do they know it?
NoSQL Vendor Language
Too many to list
How will you learn it?
Cypher
Query language for graph databases
The future?
ORM
Good, bad or horrible?
Again, how well do they know it?
Spark
Scala or Python (functional-style)
Machine Learning Queries
SciPy, NumPy or Python
R Language
Julie Language
Many more…
120. How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer Cost
RDBMS
NoSQL
Hadoop
121. How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer Cost
RDBMS easy medium low
NoSQL hard very hard very high
Hadoop hard hard very high
122. Machine Learning aka Predictive Analytics
AWS
✘ ML for developers
✘ GUI-based –
regression/classification only
✘ MxNet for deep neural networks,
works best with GPUs
GCP
✘ ML for ML developers
✘ Scaled or specialized ML models
✘ Many model types – vision, text,
translate, speech, jobs, video
✘ TensorFlow for neural networks, can
use GPUs
124. Dashboards
✘ More than KPIs
✘ Mobile
✘ Alerts
✘ Data Stories
Innovation in Data Visualization
Reports
✘ Level of Detail
✘ Meaningful Taxonomies
✘ Fast enough
✘ Drill for Data
127. Cloud Big Data Vendors - Visualization
AWS
✘ Most complete offering
✘ Notable: Partners &
QuickSight
GCP
✘ Big Query Partners
✘ Notable: 360 product - New
Dashboards, iPython-style
notebooks
128. About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
130. Big Data > Time Series Analysis
Batch Storage
BigQuery
Storage
Cloud Storage
Time Series Processing
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable*
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Time Series Streaming
Cloud Pub/Sub
*Note: Use Bigtable with
NoSQL workloads of 1 TB or more
131. Big Data > Bioinformatics
*Note: Use Bigtable with
NoSQL workloads of 1 TB or more
133. About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
142. Cloud Big Data Vendors - IoT
AWS
✘ First to market
✘ Most complete offering
✘ Most mature offering
✘ Notable: MQTT support,
AWS IoT Rules
GCP
✘ Still in Beta / Android Things
✘ Fastest player
✘ Requires top developers
✘ Notable: Android Things,
Weave, Nest