SlideShare a Scribd company logo
1 of 144
Download to read offline
Beyond Relational:
Applying Big Data Cloud Patterns
Lynn Langit
QCon London – March 2017
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
1.
Understand the Landscape
History
‘Old’ Big Data
Save
ALL
of your Data
(in a Data Lake)
But NOT in HDFS…sorry Cloudera
What is a Cloud-based Data Lake Architecture?
AWS Data Lake Architecture
Disruptor: Data Lake on GCP
✘Well-priced,Highly Available,Auto-scaled
✘Simple, One API, NoOps
✘SQL ‘over’ it via Query Services
Practice
Applying Concepts – Data Lake Features (AWS or GCP)
“Finding a path -- the ACTUAL
Cost of…
✘ Saving all Data into a Data Lake
✘ Using newer database technologies
✘ Going beyond Relational
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
2.
Evaluate Cloud Data Severs & Services
Choice…
is good,
right?
Past -- Workloads and Terms
Example Products Other Names
Operational Database MySQL, SQL Server,
Oracle
OTLP
Analytic Database SSAS, Cognos, SAS, OLAP
NoSQL Database MongoDB, Redis,
Cassandra, Neo4J
Document, K/V, Graph,
Column-store
Hadoop Cloudera, Hortonworks Big Data
???
The fastest growing cloud-basedBig Data products are…
Relational
The fastest growing cloud-basedBig Data products are…
“When do I use…?
✘Hadoop / Spark
✘NoSQL
✘Big Relational
“Existing relational databases
to the cloud?
Operational, Data Warehouse
Prerequisite – move / store files in the cloud – prefer Data Lake to HDFS
Cloud Vendors – Files as Data Lake
AWS
✘ S3/Glacier
✘ Most complete offering
✘ Most mature offering
✘ Notable: S3 management
GCP
✘ GCS with 4 types
✘ Lean, mean and cheap
✘ Unified API
✘ Notable: Simple pricing
Serverless
Do you need to manage servers?
Place your screenshot here
AWS Console
20 Data services
Place your screenshot here
GCP Console
11 Data Services
GCP Serverless Data Warehouse
Reference table
Query / Compute
BigQuery
Customer Lists / Reference Data
Export Ad Data
Cloud Storage
Id matching
Cloud Dataflow
Marketing List
DoubleClick
Campaign Manager
Google Analytics
Relevant Users
Cloud Storage
Analysts
DataStudio 360
Dashboards
BigQuery
✘ Mature– based on Dremel (Google)
✘ Import (batch or stream) from GCS++
✘ ANSI SQL queries
✘ Interactive or batch pricing
✘ NoOps / Serverless
Head-to-Head
Data Warehouse -- GCP BigQuery vs. AWS Redshift
Redshift
✘ Mature – big market share
✘ Big MySQL – ANSI SQL query
✘ PostgreSQL dialect
✘ Cost compare?
✘ Manual partition/scale up
✘ Huge partner ecosystem (ETL, viz)
Disruptor:
AWS Athena
✘Well-priced
✘Highly Available
✘Auto-scaled
✘Simple
✘NoOps
✘Analytics
✘NoSQL
Practice
Applying Concepts – Serverless SQL Queries
(AWS Aurora or GCP Big Query)
BigQuery
✘ Mature – based on Dremel (Google)
✘ Import (batch or stream) from GCS++
✘ ANSI SQL queries
✘ Interactive or batch pricing (streaming)
✘ 3rd party vendor integration (ETL, viz)
Head-to-Head
SQL Query -- GCP BigQuery vs. AWS Athena
Athena
✘ New – based on Apache Presto (from FB)
✘ SQL query against S3 data
✘ Cost compare - $ 5/TB scanned
36
What is Spanner?
Disruptor:
GCP Cloud Spanner
✘Well-priced
✘Highly Available
✘Auto-scaled
✘Transactional
✘NoOps
✘Operational
✘RDBMS
Aurora
✘ Mature
✘ ANSI SQL queries
✘ MySQL compliant
✘ 3rd party vendor integration (ETL, viz)
Head-to-Head
Relational Database – AWS Aurora vs. GCP Spanner
Spanner
✘ New (beta)
✘ ANSI SQL queries
✘ CAP – auto scales, transactional
consistent, auto partitions
✘ Cost compare?
Cloud Vendors
RDBMS Database (Operational or Analytics)
AWS
✘ 14X market share of next
competitor
✘ Most complete offering
✘ Most mature offering - partners
✘ Notable: Aurora, Big Relational
RDBMS
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Requires top developers
✘ Notable: BigQuery, MPP
SQL Query as a Service
Cloud Offerings – RDBMS and Files
AWS Google
Managed RDBMS RDS Aurora
RDS <Other>
Spanner
Cloud SQL
Data Warehouse Redshift
Athena*
BigQuery*
NoSQL buckets S3*
Glacier*
Cloud Storage (Multi-regional, Regional)*
Nearline, Coldline*
*Serverless services
Practice
Applying Concepts – Big Relational
Reasons to use Big Relational Cloud Services
Developers DevOps Cloud Vendors – AWS
Developers DevOps Cloud Vendors – GCP
Reasons to use Big Relational Cloud Services
Developers
✘ Most know RDBMS query patterns
✘ Many know basic administration
DevOps
✘ Most know RDBMS administration
✘ Many know basic RDBMS queries
✘ Many know query optimization
Cloud Vendors - AWS
✘ Aurora – RDBMS up to 64 TB
✘ Redshift - $ 1k USD / 1 TB / year
✘ Rich partner ecosystem – ETL
✘ Integration with AWS products
Developers
✘ Most know coding language patterns
to interact with RDBMS systems
DevOps
✘ Familiar RDBMS security patterns
✘ Familiar auditing
✘ Partner tooling integration
Cloud Vendors - GCP
✘ Spanner / Big Query – familiar SQL queries
✘ No hassle streaming ingest
✘ No hassle pay-as-you-go
✘ NoOps administration
My past top Big Data Cloud Services
“When do I use…?
✘Hadoop
✘NoSQL
✘Big Relational
225
NoSQL Database Types to Choose From
Let’s review some NoSQL concepts
Key-Value
Redis, Riak, Aerospike
Graph
Neo4j
Document
MongoDB
Wide-Column
Cassandra, HBase
“
Key Questions - Storage
✘Volume – how much now, what growth rate?
✘Variety – what type(s) of data? ‘rectangular’,‘graph’,‘k-v’, etc…
✘Velocity – batches, streams,both, what ingest rate?
✘Veracity – current state (quality) of data, amount of duplicationof data
stores, existence of authoritative (master) data management?
Actual Questions - NoSQL
✘When do you use Redis for caching?
✘When do you use MongoDB or AWS DynamoDB for JSON data?
✘When do you use Neo4j for graph-type data?
✘Why are you using somethingelse, i.e. Cassandra, Riak, etc..?
51
✘Open Source is Free ✘Not Free
§ Rapid iteration, innovation
§ Can start up for free (on premise)
§ Can ‘rent’ for cheap or free on the
cloud
§ Can use with the command line for
free
§ Some vendors offer free online training
§ Ex. www.neo4j.org
§ Constant releases
§ Can be deceptively hard to set up
(time is money)
§ Don’t forget to turn it off if on the
cloud!
§ GUI tools, support, training cost $$$
§ Ex. www.neo4j.com
NoSQL Example
Practice
Applying Concepts - NoSQL
NoSQL Applied
Log Files
• ???
Product
Catalogs
• ???
Social
Games
• ???
Social
aggregators
• ???
Line-of-
Business
• ???
NoSQL Applied
Log Files
• Columnstore
• HBase
Product
Catalogs
• Key/Value
• Redis
Social
Games
• Document
• MongoDB
Social
aggregators
• Graph
• Neo4j
Line-of-
Business
• RDBMS
• SQL Server
More than NoSQL
NoSQL
✘ Non-relational
✘ Can be optimized in-memory
✘ Eventually consistent
✘ Schema on Read
✘ Example: Aerospike
NewSQL
✘ Relational plus more
✘ Often in-memory
✘ Some kind of SQL-layer
✘ Schema on Write
✘ Example: MemSQL
U-SQL
✘ What???
✘ Microsoft’s universal SQL
language
✘ Example: Azure Data Lake
Focus
How Best to Store your Data?
Complexity Scalability Developer Cost
RDBMS easy medium* low
NoSQL medium big high
*GCP Cloud Spanner is the exception here
Practice
Applying Concepts – Real Cost of Storage Types
Cloud NoSQL Applied – AWS
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business
Cloud NoSQL Applied – AWS
Log Files
• Stream or
Hadoop
• Kinesis or
EMR
Product
Catalogs
• Key/Value
• DynamoDB
Social
Games
• Document
• MongoDB
Social
aggregators
• Graph
• Neo4j
Line-of-
Business
• RDBMS
• RDS
Cloud NoSQL Applied – GCP
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business
Cloud NoSQL Applied – GCP
Log Files
• Pub/Sub
• BigTable
Product
Catalogs
• Key/Value
• DataStore
Social
Games
• Document
• Spanner
Social
aggregators
• Graph
• Neo4j
Line-of-
Business
• RDBMS
• Spanner
Place your screenshot here
AWS Console
20 Data services
Place your screenshot here
GCP Console
11 Data Services
Cloud Offerings – Big Relational and NoSQL
AWS Google
Managed RDBMS RDS Aurora Spanner
Cloud SQL
Data Warehouse Redshift or Athena BigQuery
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
“When do I use…?
✘Hadoop / Spark
✘NoSQL
✘Big Relational
Size Matters
One Vendor’s View
I don’t
Want
Text
here
Where is Hadoop / Spark Used?
“What is Hadoop Spark?
✘ In-memory
✘ Distributed
✘ Integrated with other libraries
Spark is Fast, Unified and Usable
Disruptor: DataBricks
✘Managed Apache Spark on AWS
Practice
Applying Concepts – Apache Spark via Databricks
Hadoop is becoming Spark…
Volume
✘10 TB or greater to start
✘Growth of 25% YOY
✘Where FROM
✘Where TO
Velocity and Variety
✘Spark over Hive
✘Kafka and Samsa
Veracity
✘Pay, train and hire team
✘Top $$$ for talent, IF you can find it
✘WATCH OUT for Cloud Vendors
who promise ‘easy access’
✘Complexity of ecosystem
✘DataBricks knows best for Spark
✘Cloudera knows best for on premise
How Best to Store your Data?
Complexity Scalability Developer Cost
RDBMS easy medium* low
NoSQL medium big high
Hadoop hard huge very high
Cloud Offerings – Big Data
AWS Google
Managed RDBMS RDS Aurora Spanner
Cloud SQL
Data Warehouse Redshift or Athena BigQuery
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
Hadoop / Spark Elastic MapReduce DataProc
Real World Big Data -- When do I use what?
RDBMS
65%
NoSQL
30%
Hadoop
5%
ETL is 75% of all Big Data Projects
Surveying, cleaning and loading data is
the majorityof the billable time for
new Big Data projects.
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
3.
Construct Data Pipelines
Build vs. Buy
Kappa Architecture on the Cloud
Pattern
✘How to build optimized cloud-based data pipelines?
-- Cloud-based ETL tools and processes
-- includes load-testingpatterns and securitypractices
-- including connectingbetween different vendor clouds
Key Questions – Ingestion and ETL
✘Volume – how much and how fast,now and future?
✘Variety – what type(s) or data, any pre-processingneeded?
✘Velocity – batches or steaming?
✘Veracity – verificationon ingest needed? new data needed?
Actual Questions – Data Ingest and ETL
✘When (at what data volume) do you need AWS Snowball and/or AWS
Direct Connect for ingest?
✘What does ‘near-real time streaming’ exactlymean?
✘When do you need to use a product vs. a libraryor service for ETL?
✘Which 3rd party partner products are available for the Data services you
are using?
✘What are your actual data qualityissues?
Serverless
Is this a cloud data pipeline patternnow?
Serverless Time Series Analysis
Batch Storage
BigQuery
Storage
Cloud Storage
Time Series Processing
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable*
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Time Series Streaming
Cloud Pub/Sub
*Note: Use Bigtable with
NoSQL workloads of 1 TB or more
DataFlow
✘ Managed Apache Beam
✘ Elegant pipeline API
✘ Beam as a product -> GCP
✘ Scala? Java?
Head-to-Head
ETL – GCP DataFlow vs. GCP DataProc
DataProc
✘ Managed Apache Spark
✘ Large user base
✘ Spark as a product -> DataBricks
✘ Cost compare?
“Considering…
✘Initial Load/Transform
✘Data Quality
✘Batch vs. Stream
Pipeline Phases
Phase 0
Eval Current Data - Quality & Quantity
Phase 1
Get New Data - Free or Premium
Phase 2
Build MVP & Forecast volume and growth
Phase 3
Load test at scale
Phase 4
Deploy – secure, audit and monitor
Cloud Big Data Vendors - ETL
AWS
✘ 14X market share of next
competitor
✘ Notable: Many, strong ETL
Partners
GCP
✘ Lean, mean and cheap
✘ Fastest player / Serverless
✘ Notable: DataFlow requires
Java or Python developers
How Best to Ingest and ETL your Data?
Complexity Scalability Developer Cost
RDBMS medium medium low
NoSQL medium big high
Hadoop hard huge very high
“Considering…
✘Initial Load/Transform
✘Data Quality
✘Batch vs. Stream
Building a Streaming Pipeline
Stream Interval Window
Key Questions - Streaming
✘Volume – how much data now and predictedover next 12 months?
✘Variety – what types of data now and future?
✘Velocity – volume of input data / time now and near future?
✘Veracity – volume of EXISTING data now
Cloud Big Data Vendors - Streaming
AWS
✘ 14X market share of next
competitor
✘ Most complete offering
✘ Most mature offering
✘ Notable: Kinesis/Firehose
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Requires top developers
✘ Notable: DataFlow flexible
Cloud Offerings – Data and Pipelines
AWS Google
Managed RDBMS RDS Aurora Spanner / Cloud SQL
Data Warehouse Redshift/ Athena BigQuery
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
Streaming or ML Kinesis
AWS Machine Learning
Pub/Sub or DataFlow
Google Machine Learning
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
Hadoop / Spark Elastic MapReduce DataProc
Cloud ETL Data Pipelines DataFlow
How Best to Stream your Data?
Complexity Scalability Developer Cost
Micro-Batches easy medium low
Stream Windows difficult big high
Near Real-time very difficult huge high
Practice
Applying Concepts
Designing Streaming Cloud Data Pipelines
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business
Data Processing
Financial Services > Monte Carlo Simulations
Bespoke Apps
Compute Engine
Storage/Analysis
BigQuery
Storage
Cloud Storage
Dataflow/Beam
Cloud Dataflow
Visualization
Cloud Datalab
Storage
Cloud Bigtable
Hadoop/Spark
Cloud Dataproc
Batch
Financial Services > Time Series Analysis
Storage
BigQuery
Storage
Cloud Storage
Time Series Processing
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Time Series Streaming
Cloud Pub/Sub
Private Datasets Public Datasets
Life Sciences > Variant Analysis
MSSNG Autism
Cloud Storage
Scientist
High
Throughput
Genome
Sequencers
1000 Genomes
Cloud Storage
Patient Data
Cloud Storage
Illumina Platform
Cloud Storage
Ref Genomes
Cloud Storage
TCGA
Cloud Storage
Analytics
Online Analytics
BigQuery
Batch Analytics
Cloud Dataflow
Lab Notebooks
Cloud Datalab
Data Ingest
Genomics
BAM
FASTQ
Streaming
Big Data > Complex Event Processing
Cloud Apps
Compute Engine
Streaming
Batch
Push to Devices
App Engine
Rules Engine
Cloud Dataflow Data Analysis
Cloud Datalab
Mobile Devices
Push Notifications
Report & Share
Business Analysis
Cloud Apps
Compute Engine
On-Premises
Databases
On-Premises
Applications
Processed Events
Cloud Bigtable
Events Time Series
Data Warehouse
BigQuery
Execution Results
Streaming
Cloud Pub/Sub
Transactions
Processing
Cloud Dataflow
Transaction Streams
Messaging
Cloud Pub/Sub
Rules Actions
ETL
Cloud Dataflow
Transform Data
Cloud Data
Cloud Storage
Rules Engine
Cloud Dataproc
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
4.
Select ETL and Visualization
Analytics and Presentation
Pattern
✘How best to Query and Visualize
-- When to use business analytics vs. predictive analytics (machine learning)
-- how best to present data to clients - partner visualizationproducts or roll
your own
Making Sense of Data
Machine
Learning
Reports Presentation
Actual Questions - Query
✘What query skills do the Dev / DevOps teams have?
✘Do we use an ORM?
✘Are transactions needed?
✘Which RDBMs features are we using now? i.e. PL-SQL, indexing…
✘Which language(s) do our analysts use? Excel, R, Python?
Graphs
What is nature of your questions?
Practice
Applying Concepts – Trying out Cypher on Neo4j
Cloud Big Data Vendors - Query
AWS
✘ 5X market share of next
competitor
✘ Most complete offering
✘ Most mature offering
✘ Notable: Big Relational
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Notable: Flexible, powerful
machine learning
Query Languages
SQL
Everyone knows it
But how well do they know it?
NoSQL Vendor Language
Too many to list
How will you learn it?
Cypher
Query language for graph databases
The future?
ORM
Good, bad or horrible?
Again, how well do they know it?
Spark
Scala or Python (functional-style)
Machine Learning Queries
SciPy, NumPy or Python
R Language
Julie Language
Many more…
Practice
Applying Concepts – Understanding Query Language Costs
How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer Cost
RDBMS
NoSQL
Hadoop
How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer Cost
RDBMS easy medium low
NoSQL hard very hard very high
Hadoop hard hard very high
Machine Learning aka Predictive Analytics
AWS
✘ ML for developers
✘ GUI-based –
regression/classification only
✘ MxNet for deep neural networks,
works best with GPUs
GCP
✘ ML for ML developers
✘ Scaled or specialized ML models
✘ Many model types – vision, text,
translate, speech, jobs, video
✘ TensorFlow for neural networks, can
use GPUs
Presentation
If you can’t see it, it’s not worth it.
Dashboards
✘ More than KPIs
✘ Mobile
✘ Alerts
✘ Data Stories
Innovation in Data Visualization
Reports
✘ Level of Detail
✘ Meaningful Taxonomies
✘ Fast enough
✘ Drill for Data
D3
The language of Data Visualization
Cloud Big Data Vendors - Visualization
AWS
✘ Most complete offering
✘ Notable: Partners &
QuickSight
GCP
✘ Big Query Partners
✘ Notable: 360 product - New
Dashboards, iPython-style
notebooks
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
5.
Review End-to-end Patterns
For major cloud vendors
Big Data > Time Series Analysis
Batch Storage
BigQuery
Storage
Cloud Storage
Time Series Processing
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud Bigtable*
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Time Series Streaming
Cloud Pub/Sub
*Note: Use Bigtable with
NoSQL workloads of 1 TB or more
Big Data > Bioinformatics
*Note: Use Bigtable with
NoSQL workloads of 1 TB or more
Azure Patterns
About this Workshop
Learning from Cloud (AWS / GCP) Architectures
1. Landscape
2. Data Servers and Services
3. Data Pipelines
4. ETL and VisualizationServices
5. End-to-End Patterns
6. Bonus: IoT
6.
Cloud IoT Patterns
Internet of Things
Place your screenshot here
Data Generation Device
IoT
is
Big Data
Realized
235,000,000,000 $The IoT Market
2017By the year
20 Billion devicesAnd a lot of users
IoT all the Things
139
Ingest Pipelines
Storage
Analytics
Application &
Presentation
Standard
Devices
HTTPS
Constrained
Devices
Non-TCP
e.g. BLE
Gateway
Internet of Things > Sensor stream ingest and processing
App
Engine
Container
Engine
Cloud
Storage
Cloud
Pub/Sub
Cloud
Dataflow
Monitoring
Logging
Cloud
Dataflow
Cloud
Datastore
Cloud
Bigtable
BigQuery
Cloud
Dataproc
Cloud
Datalab
Compute
Engine
IoT
Cloud Big Data Vendors - IoT
AWS
✘ First to market
✘ Most complete offering
✘ Most mature offering
✘ Notable: MQTT support,
AWS IoT Rules
GCP
✘ Still in Beta / Android Things
✘ Fastest player
✘ Requires top developers
✘ Notable: Android Things,
Weave, Nest
“The Future is Functional”
@LynnLangit
The Next Generation…

More Related Content

What's hot

Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorCask Data
 
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...Amazon Web Services
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.Amazon Web Services
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 
Bleeding Edge Databases
Bleeding Edge DatabasesBleeding Edge Databases
Bleeding Edge DatabasesLynn Langit
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageAmazon Web Services
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsScaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsYelp Engineering
 
Getting Started on Google Cloud Platform
Getting Started on Google Cloud PlatformGetting Started on Google Cloud Platform
Getting Started on Google Cloud PlatformAaron Taylor
 
Batch Processing with Containers on AWS - June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS -  June 2017 AWS Online Tech TalksBatch Processing with Containers on AWS -  June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS - June 2017 AWS Online Tech TalksAmazon Web Services
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
 
Cloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - PresentationCloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - PresentationTinarivosoaAbaniaina
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysisAmazon Web Services
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platformLi Gao
 
Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3Amazon Web Services
 
Next Generation Cloud Computing With Google - RightScale Compute 2013
Next Generation Cloud Computing With Google - RightScale Compute 2013Next Generation Cloud Computing With Google - RightScale Compute 2013
Next Generation Cloud Computing With Google - RightScale Compute 2013RightScale
 

What's hot (20)

Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data Collector
 
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Bleeding Edge Databases
Bleeding Edge DatabasesBleeding Edge Databases
Bleeding Edge Databases
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsScaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique Visitors
 
Getting Started on Google Cloud Platform
Getting Started on Google Cloud PlatformGetting Started on Google Cloud Platform
Getting Started on Google Cloud Platform
 
Batch Processing with Containers on AWS - June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS -  June 2017 AWS Online Tech TalksBatch Processing with Containers on AWS -  June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS - June 2017 AWS Online Tech Talks
 
Athena & Glue
Athena & GlueAthena & Glue
Athena & Glue
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
Cloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - PresentationCloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - Presentation
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
 
Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3
 
Next Generation Cloud Computing With Google - RightScale Compute 2013
Next Generation Cloud Computing With Google - RightScale Compute 2013Next Generation Cloud Computing With Google - RightScale Compute 2013
Next Generation Cloud Computing With Google - RightScale Compute 2013
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 

Viewers also liked

Book reviews (sixteen) from AJL Reviews 2011 to 2017
Book reviews (sixteen) from AJL Reviews 2011 to 2017Book reviews (sixteen) from AJL Reviews 2011 to 2017
Book reviews (sixteen) from AJL Reviews 2011 to 2017Nathan Rosen
 
Propuesta Técnica de Diálogo
Propuesta Técnica de DiálogoPropuesta Técnica de Diálogo
Propuesta Técnica de DiálogoCarlos Oblitas
 
Living, Learning, & Working Smart @KKU
Living, Learning, & Working Smart @KKULiving, Learning, & Working Smart @KKU
Living, Learning, & Working Smart @KKUDenpong Soodphakdee
 
Survey Monkey demographic analysis
Survey Monkey demographic analysisSurvey Monkey demographic analysis
Survey Monkey demographic analysisRubio Luis
 
Participantes fecicam
Participantes fecicamParticipantes fecicam
Participantes fecicammiciudadreal
 
Kafka & Couchbase Integration Patterns
Kafka & Couchbase Integration PatternsKafka & Couchbase Integration Patterns
Kafka & Couchbase Integration PatternsManuel Hurtado
 
Продвижение в Instagram для регионального бизнеса
Продвижение в Instagram для регионального бизнесаПродвижение в Instagram для регионального бизнеса
Продвижение в Instagram для регионального бизнесаGusarov Group
 
Diagnostico organizacional
Diagnostico organizacionalDiagnostico organizacional
Diagnostico organizacionalMilton Villalba
 
Chiara giacomini marco plantamura
Chiara giacomini marco plantamuraChiara giacomini marco plantamura
Chiara giacomini marco plantamurasecondag dicambio
 
A Passionate Teacher is a Great Teacher
A Passionate Teacher is a Great TeacherA Passionate Teacher is a Great Teacher
A Passionate Teacher is a Great TeacherJames Kiger
 
7820303 jorge parra
7820303 jorge parra7820303 jorge parra
7820303 jorge parra2334sdf
 

Viewers also liked (20)

St nar-40-ebook
St nar-40-ebookSt nar-40-ebook
St nar-40-ebook
 
Book reviews (sixteen) from AJL Reviews 2011 to 2017
Book reviews (sixteen) from AJL Reviews 2011 to 2017Book reviews (sixteen) from AJL Reviews 2011 to 2017
Book reviews (sixteen) from AJL Reviews 2011 to 2017
 
Propuesta Técnica de Diálogo
Propuesta Técnica de DiálogoPropuesta Técnica de Diálogo
Propuesta Técnica de Diálogo
 
Mindful mvp 1_preso
Mindful mvp 1_presoMindful mvp 1_preso
Mindful mvp 1_preso
 
Living, Learning, & Working Smart @KKU
Living, Learning, & Working Smart @KKULiving, Learning, & Working Smart @KKU
Living, Learning, & Working Smart @KKU
 
Survey Monkey demographic analysis
Survey Monkey demographic analysisSurvey Monkey demographic analysis
Survey Monkey demographic analysis
 
2° Domingo de Cuaresma 2017.
2° Domingo de Cuaresma 2017.2° Domingo de Cuaresma 2017.
2° Domingo de Cuaresma 2017.
 
2° Domingo de Cuaresma 2017.
2° Domingo de Cuaresma 2017.2° Domingo de Cuaresma 2017.
2° Domingo de Cuaresma 2017.
 
Klohn Crippen Berger 2017
Klohn Crippen Berger 2017Klohn Crippen Berger 2017
Klohn Crippen Berger 2017
 
Participantes fecicam
Participantes fecicamParticipantes fecicam
Participantes fecicam
 
mindful_preso2
 mindful_preso2 mindful_preso2
mindful_preso2
 
Kafka & Couchbase Integration Patterns
Kafka & Couchbase Integration PatternsKafka & Couchbase Integration Patterns
Kafka & Couchbase Integration Patterns
 
Продвижение в Instagram для регионального бизнеса
Продвижение в Instagram для регионального бизнесаПродвижение в Instagram для регионального бизнеса
Продвижение в Instagram для регионального бизнеса
 
slideshare
slideshareslideshare
slideshare
 
Diagnostico organizacional
Diagnostico organizacionalDiagnostico organizacional
Diagnostico organizacional
 
Chiara giacomini marco plantamura
Chiara giacomini marco plantamuraChiara giacomini marco plantamura
Chiara giacomini marco plantamura
 
El rol del docente
El rol del docenteEl rol del docente
El rol del docente
 
A Passionate Teacher is a Great Teacher
A Passionate Teacher is a Great TeacherA Passionate Teacher is a Great Teacher
A Passionate Teacher is a Great Teacher
 
России
РоссииРоссии
России
 
7820303 jorge parra
7820303 jorge parra7820303 jorge parra
7820303 jorge parra
 

Similar to Beyond Relational

Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data ArchitecturesLynn Langit
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
Not only SQL - Database Choices
Not only SQL - Database ChoicesNot only SQL - Database Choices
Not only SQL - Database ChoicesLynn Langit
 
AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesTobyWilman
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaHelen Rogers
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptxBuilding Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptxthando80
 
Database Choices
Database ChoicesDatabase Choices
Database ChoicesLynn Langit
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
NoSQL Options Compared
NoSQL Options ComparedNoSQL Options Compared
NoSQL Options ComparedSergey Bushik
 
T1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on awsT1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on awsAmazon Web Services
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GooglePatrick Pierson
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...Radhika Puthiyetath
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2Raul Chong
 

Similar to Beyond Relational (20)

Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Not only SQL - Database Choices
Not only SQL - Database ChoicesNot only SQL - Database Choices
Not only SQL - Database Choices
 
AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - Slides
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
 
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptxBuilding Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
 
Database Choices
Database ChoicesDatabase Choices
Database Choices
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
NoSQL Options Compared
NoSQL Options ComparedNoSQL Options Compared
NoSQL Options Compared
 
The Best of re:invent 2016
The Best of re:invent 2016The Best of re:invent 2016
The Best of re:invent 2016
 
T1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on awsT1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on aws
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs Google
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 

More from Lynn Langit

VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWSLynn Langit
 
Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless ArchitecturesLynn Langit
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids ProgrammingLynn Langit
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on DockerLynn Langit
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina LanguageLynn Langit
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsLynn Langit
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data PipelinesLynn Langit
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids ProgrammingLynn Langit
 
Serverless Reality
Serverless RealityServerless Reality
Serverless RealityLynn Langit
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesLynn Langit
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsLynn Langit
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsLynn Langit
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL ServerLynn Langit
 
What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'Lynn Langit
 
Teaching Kids Programming for Developers
Teaching Kids Programming for DevelopersTeaching Kids Programming for Developers
Teaching Kids Programming for DevelopersLynn Langit
 
Cloud-centric Internet of Things
Cloud-centric Internet of ThingsCloud-centric Internet of Things
Cloud-centric Internet of ThingsLynn Langit
 

More from Lynn Langit (20)

VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWS
 
Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless Architectures
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on Docker
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina Language
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa Skills
 
Practical cloud
Practical cloudPractical cloud
Practical cloud
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data Pipelines
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids Programming
 
Practical Cloud
Practical CloudPractical Cloud
Practical Cloud
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomics
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
 
What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'
 
Teaching Kids Programming for Developers
Teaching Kids Programming for DevelopersTeaching Kids Programming for Developers
Teaching Kids Programming for Developers
 
Cloud-centric Internet of Things
Cloud-centric Internet of ThingsCloud-centric Internet of Things
Cloud-centric Internet of Things
 

Recently uploaded

Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Beyond Relational

  • 1. Beyond Relational: Applying Big Data Cloud Patterns Lynn Langit QCon London – March 2017
  • 2. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 6.
  • 7.
  • 9. But NOT in HDFS…sorry Cloudera What is a Cloud-based Data Lake Architecture?
  • 10. AWS Data Lake Architecture
  • 11.
  • 12. Disruptor: Data Lake on GCP ✘Well-priced,Highly Available,Auto-scaled ✘Simple, One API, NoOps ✘SQL ‘over’ it via Query Services
  • 13.
  • 14. Practice Applying Concepts – Data Lake Features (AWS or GCP)
  • 15. “Finding a path -- the ACTUAL Cost of… ✘ Saving all Data into a Data Lake ✘ Using newer database technologies ✘ Going beyond Relational
  • 16. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 17. 2. Evaluate Cloud Data Severs & Services
  • 18.
  • 20. Past -- Workloads and Terms Example Products Other Names Operational Database MySQL, SQL Server, Oracle OTLP Analytic Database SSAS, Cognos, SAS, OLAP NoSQL Database MongoDB, Redis, Cassandra, Neo4J Document, K/V, Graph, Column-store Hadoop Cloudera, Hortonworks Big Data
  • 21.
  • 22. ??? The fastest growing cloud-basedBig Data products are…
  • 23. Relational The fastest growing cloud-basedBig Data products are…
  • 24. “When do I use…? ✘Hadoop / Spark ✘NoSQL ✘Big Relational
  • 25. “Existing relational databases to the cloud? Operational, Data Warehouse
  • 26. Prerequisite – move / store files in the cloud – prefer Data Lake to HDFS
  • 27. Cloud Vendors – Files as Data Lake AWS ✘ S3/Glacier ✘ Most complete offering ✘ Most mature offering ✘ Notable: S3 management GCP ✘ GCS with 4 types ✘ Lean, mean and cheap ✘ Unified API ✘ Notable: Simple pricing
  • 28. Serverless Do you need to manage servers?
  • 29. Place your screenshot here AWS Console 20 Data services
  • 30. Place your screenshot here GCP Console 11 Data Services
  • 31. GCP Serverless Data Warehouse Reference table Query / Compute BigQuery Customer Lists / Reference Data Export Ad Data Cloud Storage Id matching Cloud Dataflow Marketing List DoubleClick Campaign Manager Google Analytics Relevant Users Cloud Storage Analysts DataStudio 360 Dashboards
  • 32. BigQuery ✘ Mature– based on Dremel (Google) ✘ Import (batch or stream) from GCS++ ✘ ANSI SQL queries ✘ Interactive or batch pricing ✘ NoOps / Serverless Head-to-Head Data Warehouse -- GCP BigQuery vs. AWS Redshift Redshift ✘ Mature – big market share ✘ Big MySQL – ANSI SQL query ✘ PostgreSQL dialect ✘ Cost compare? ✘ Manual partition/scale up ✘ Huge partner ecosystem (ETL, viz)
  • 34. Practice Applying Concepts – Serverless SQL Queries (AWS Aurora or GCP Big Query)
  • 35. BigQuery ✘ Mature – based on Dremel (Google) ✘ Import (batch or stream) from GCS++ ✘ ANSI SQL queries ✘ Interactive or batch pricing (streaming) ✘ 3rd party vendor integration (ETL, viz) Head-to-Head SQL Query -- GCP BigQuery vs. AWS Athena Athena ✘ New – based on Apache Presto (from FB) ✘ SQL query against S3 data ✘ Cost compare - $ 5/TB scanned
  • 37. Disruptor: GCP Cloud Spanner ✘Well-priced ✘Highly Available ✘Auto-scaled ✘Transactional ✘NoOps ✘Operational ✘RDBMS
  • 38. Aurora ✘ Mature ✘ ANSI SQL queries ✘ MySQL compliant ✘ 3rd party vendor integration (ETL, viz) Head-to-Head Relational Database – AWS Aurora vs. GCP Spanner Spanner ✘ New (beta) ✘ ANSI SQL queries ✘ CAP – auto scales, transactional consistent, auto partitions ✘ Cost compare?
  • 39. Cloud Vendors RDBMS Database (Operational or Analytics) AWS ✘ 14X market share of next competitor ✘ Most complete offering ✘ Most mature offering - partners ✘ Notable: Aurora, Big Relational RDBMS GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: BigQuery, MPP SQL Query as a Service
  • 40. Cloud Offerings – RDBMS and Files AWS Google Managed RDBMS RDS Aurora RDS <Other> Spanner Cloud SQL Data Warehouse Redshift Athena* BigQuery* NoSQL buckets S3* Glacier* Cloud Storage (Multi-regional, Regional)* Nearline, Coldline* *Serverless services
  • 42. Reasons to use Big Relational Cloud Services Developers DevOps Cloud Vendors – AWS Developers DevOps Cloud Vendors – GCP
  • 43. Reasons to use Big Relational Cloud Services Developers ✘ Most know RDBMS query patterns ✘ Many know basic administration DevOps ✘ Most know RDBMS administration ✘ Many know basic RDBMS queries ✘ Many know query optimization Cloud Vendors - AWS ✘ Aurora – RDBMS up to 64 TB ✘ Redshift - $ 1k USD / 1 TB / year ✘ Rich partner ecosystem – ETL ✘ Integration with AWS products Developers ✘ Most know coding language patterns to interact with RDBMS systems DevOps ✘ Familiar RDBMS security patterns ✘ Familiar auditing ✘ Partner tooling integration Cloud Vendors - GCP ✘ Spanner / Big Query – familiar SQL queries ✘ No hassle streaming ingest ✘ No hassle pay-as-you-go ✘ NoOps administration
  • 44. My past top Big Data Cloud Services
  • 45. “When do I use…? ✘Hadoop ✘NoSQL ✘Big Relational
  • 46. 225 NoSQL Database Types to Choose From
  • 47. Let’s review some NoSQL concepts Key-Value Redis, Riak, Aerospike Graph Neo4j Document MongoDB Wide-Column Cassandra, HBase
  • 48.
  • 49. Key Questions - Storage ✘Volume – how much now, what growth rate? ✘Variety – what type(s) of data? ‘rectangular’,‘graph’,‘k-v’, etc… ✘Velocity – batches, streams,both, what ingest rate? ✘Veracity – current state (quality) of data, amount of duplicationof data stores, existence of authoritative (master) data management?
  • 50. Actual Questions - NoSQL ✘When do you use Redis for caching? ✘When do you use MongoDB or AWS DynamoDB for JSON data? ✘When do you use Neo4j for graph-type data? ✘Why are you using somethingelse, i.e. Cassandra, Riak, etc..?
  • 51. 51 ✘Open Source is Free ✘Not Free § Rapid iteration, innovation § Can start up for free (on premise) § Can ‘rent’ for cheap or free on the cloud § Can use with the command line for free § Some vendors offer free online training § Ex. www.neo4j.org § Constant releases § Can be deceptively hard to set up (time is money) § Don’t forget to turn it off if on the cloud! § GUI tools, support, training cost $$$ § Ex. www.neo4j.com NoSQL Example
  • 53. NoSQL Applied Log Files • ??? Product Catalogs • ??? Social Games • ??? Social aggregators • ??? Line-of- Business • ???
  • 54. NoSQL Applied Log Files • Columnstore • HBase Product Catalogs • Key/Value • Redis Social Games • Document • MongoDB Social aggregators • Graph • Neo4j Line-of- Business • RDBMS • SQL Server
  • 55. More than NoSQL NoSQL ✘ Non-relational ✘ Can be optimized in-memory ✘ Eventually consistent ✘ Schema on Read ✘ Example: Aerospike NewSQL ✘ Relational plus more ✘ Often in-memory ✘ Some kind of SQL-layer ✘ Schema on Write ✘ Example: MemSQL U-SQL ✘ What??? ✘ Microsoft’s universal SQL language ✘ Example: Azure Data Lake
  • 56. Focus
  • 57. How Best to Store your Data? Complexity Scalability Developer Cost RDBMS easy medium* low NoSQL medium big high *GCP Cloud Spanner is the exception here
  • 58. Practice Applying Concepts – Real Cost of Storage Types
  • 59. Cloud NoSQL Applied – AWS Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  • 60. Cloud NoSQL Applied – AWS Log Files • Stream or Hadoop • Kinesis or EMR Product Catalogs • Key/Value • DynamoDB Social Games • Document • MongoDB Social aggregators • Graph • Neo4j Line-of- Business • RDBMS • RDS
  • 61. Cloud NoSQL Applied – GCP Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  • 62. Cloud NoSQL Applied – GCP Log Files • Pub/Sub • BigTable Product Catalogs • Key/Value • DataStore Social Games • Document • Spanner Social aggregators • Graph • Neo4j Line-of- Business • RDBMS • Spanner
  • 63. Place your screenshot here AWS Console 20 Data services
  • 64. Place your screenshot here GCP Console 11 Data Services
  • 65. Cloud Offerings – Big Relational and NoSQL AWS Google Managed RDBMS RDS Aurora Spanner Cloud SQL Data Warehouse Redshift or Athena BigQuery NoSQL buckets S3 Glacier Cloud Storage Nearline NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE
  • 66. “When do I use…? ✘Hadoop / Spark ✘NoSQL ✘Big Relational
  • 68.
  • 69. One Vendor’s View I don’t Want Text here
  • 70.
  • 71. Where is Hadoop / Spark Used?
  • 72. “What is Hadoop Spark? ✘ In-memory ✘ Distributed ✘ Integrated with other libraries
  • 73.
  • 74. Spark is Fast, Unified and Usable
  • 76. Practice Applying Concepts – Apache Spark via Databricks
  • 77. Hadoop is becoming Spark… Volume ✘10 TB or greater to start ✘Growth of 25% YOY ✘Where FROM ✘Where TO Velocity and Variety ✘Spark over Hive ✘Kafka and Samsa Veracity ✘Pay, train and hire team ✘Top $$$ for talent, IF you can find it ✘WATCH OUT for Cloud Vendors who promise ‘easy access’ ✘Complexity of ecosystem ✘DataBricks knows best for Spark ✘Cloudera knows best for on premise
  • 78. How Best to Store your Data? Complexity Scalability Developer Cost RDBMS easy medium* low NoSQL medium big high Hadoop hard huge very high
  • 79. Cloud Offerings – Big Data AWS Google Managed RDBMS RDS Aurora Spanner Cloud SQL Data Warehouse Redshift or Athena BigQuery NoSQL buckets S3 Glacier Cloud Storage Nearline NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE Hadoop / Spark Elastic MapReduce DataProc
  • 80. Real World Big Data -- When do I use what? RDBMS 65% NoSQL 30% Hadoop 5%
  • 81. ETL is 75% of all Big Data Projects Surveying, cleaning and loading data is the majorityof the billable time for new Big Data projects.
  • 82. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 84.
  • 85.
  • 87. Pattern ✘How to build optimized cloud-based data pipelines? -- Cloud-based ETL tools and processes -- includes load-testingpatterns and securitypractices -- including connectingbetween different vendor clouds
  • 88. Key Questions – Ingestion and ETL ✘Volume – how much and how fast,now and future? ✘Variety – what type(s) or data, any pre-processingneeded? ✘Velocity – batches or steaming? ✘Veracity – verificationon ingest needed? new data needed?
  • 89. Actual Questions – Data Ingest and ETL ✘When (at what data volume) do you need AWS Snowball and/or AWS Direct Connect for ingest? ✘What does ‘near-real time streaming’ exactlymean? ✘When do you need to use a product vs. a libraryor service for ETL? ✘Which 3rd party partner products are available for the Data services you are using? ✘What are your actual data qualityissues?
  • 90. Serverless Is this a cloud data pipeline patternnow?
  • 91. Serverless Time Series Analysis Batch Storage BigQuery Storage Cloud Storage Time Series Processing Cloud Dataflow Analysis Cloud Datalab Storage Cloud Bigtable* Processing Cloud Dataproc Time Series Files Cloud Storage ML Cloud ML Streaming Time Series Streaming Cloud Pub/Sub *Note: Use Bigtable with NoSQL workloads of 1 TB or more
  • 92. DataFlow ✘ Managed Apache Beam ✘ Elegant pipeline API ✘ Beam as a product -> GCP ✘ Scala? Java? Head-to-Head ETL – GCP DataFlow vs. GCP DataProc DataProc ✘ Managed Apache Spark ✘ Large user base ✘ Spark as a product -> DataBricks ✘ Cost compare?
  • 94. Pipeline Phases Phase 0 Eval Current Data - Quality & Quantity Phase 1 Get New Data - Free or Premium Phase 2 Build MVP & Forecast volume and growth Phase 3 Load test at scale Phase 4 Deploy – secure, audit and monitor
  • 95. Cloud Big Data Vendors - ETL AWS ✘ 14X market share of next competitor ✘ Notable: Many, strong ETL Partners GCP ✘ Lean, mean and cheap ✘ Fastest player / Serverless ✘ Notable: DataFlow requires Java or Python developers
  • 96. How Best to Ingest and ETL your Data? Complexity Scalability Developer Cost RDBMS medium medium low NoSQL medium big high Hadoop hard huge very high
  • 98. Building a Streaming Pipeline Stream Interval Window
  • 99. Key Questions - Streaming ✘Volume – how much data now and predictedover next 12 months? ✘Variety – what types of data now and future? ✘Velocity – volume of input data / time now and near future? ✘Veracity – volume of EXISTING data now
  • 100. Cloud Big Data Vendors - Streaming AWS ✘ 14X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Kinesis/Firehose GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: DataFlow flexible
  • 101. Cloud Offerings – Data and Pipelines AWS Google Managed RDBMS RDS Aurora Spanner / Cloud SQL Data Warehouse Redshift/ Athena BigQuery NoSQL buckets S3 Glacier Cloud Storage Nearline NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Streaming or ML Kinesis AWS Machine Learning Pub/Sub or DataFlow Google Machine Learning NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE Hadoop / Spark Elastic MapReduce DataProc Cloud ETL Data Pipelines DataFlow
  • 102. How Best to Stream your Data? Complexity Scalability Developer Cost Micro-Batches easy medium low Stream Windows difficult big high Near Real-time very difficult huge high
  • 104. Designing Streaming Cloud Data Pipelines Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  • 105. Data Processing Financial Services > Monte Carlo Simulations Bespoke Apps Compute Engine Storage/Analysis BigQuery Storage Cloud Storage Dataflow/Beam Cloud Dataflow Visualization Cloud Datalab Storage Cloud Bigtable Hadoop/Spark Cloud Dataproc
  • 106. Batch Financial Services > Time Series Analysis Storage BigQuery Storage Cloud Storage Time Series Processing Cloud Dataflow Analysis Cloud Datalab Storage Cloud Bigtable Processing Cloud Dataproc Time Series Files Cloud Storage ML Cloud ML Streaming Time Series Streaming Cloud Pub/Sub
  • 107. Private Datasets Public Datasets Life Sciences > Variant Analysis MSSNG Autism Cloud Storage Scientist High Throughput Genome Sequencers 1000 Genomes Cloud Storage Patient Data Cloud Storage Illumina Platform Cloud Storage Ref Genomes Cloud Storage TCGA Cloud Storage Analytics Online Analytics BigQuery Batch Analytics Cloud Dataflow Lab Notebooks Cloud Datalab Data Ingest Genomics BAM FASTQ
  • 108. Streaming Big Data > Complex Event Processing Cloud Apps Compute Engine Streaming Batch Push to Devices App Engine Rules Engine Cloud Dataflow Data Analysis Cloud Datalab Mobile Devices Push Notifications Report & Share Business Analysis Cloud Apps Compute Engine On-Premises Databases On-Premises Applications Processed Events Cloud Bigtable Events Time Series Data Warehouse BigQuery Execution Results Streaming Cloud Pub/Sub Transactions Processing Cloud Dataflow Transaction Streams Messaging Cloud Pub/Sub Rules Actions ETL Cloud Dataflow Transform Data Cloud Data Cloud Storage Rules Engine Cloud Dataproc
  • 109. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 110. 4. Select ETL and Visualization Analytics and Presentation
  • 111. Pattern ✘How best to Query and Visualize -- When to use business analytics vs. predictive analytics (machine learning) -- how best to present data to clients - partner visualizationproducts or roll your own
  • 112. Making Sense of Data Machine Learning Reports Presentation
  • 113. Actual Questions - Query ✘What query skills do the Dev / DevOps teams have? ✘Do we use an ORM? ✘Are transactions needed? ✘Which RDBMs features are we using now? i.e. PL-SQL, indexing… ✘Which language(s) do our analysts use? Excel, R, Python?
  • 114. Graphs What is nature of your questions?
  • 115. Practice Applying Concepts – Trying out Cypher on Neo4j
  • 116.
  • 117. Cloud Big Data Vendors - Query AWS ✘ 5X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Big Relational GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Notable: Flexible, powerful machine learning
  • 118. Query Languages SQL Everyone knows it But how well do they know it? NoSQL Vendor Language Too many to list How will you learn it? Cypher Query language for graph databases The future? ORM Good, bad or horrible? Again, how well do they know it? Spark Scala or Python (functional-style) Machine Learning Queries SciPy, NumPy or Python R Language Julie Language Many more…
  • 119. Practice Applying Concepts – Understanding Query Language Costs
  • 120. How Best to Query your Data? Business Analytics Predictive Analytics Developer Cost RDBMS NoSQL Hadoop
  • 121. How Best to Query your Data? Business Analytics Predictive Analytics Developer Cost RDBMS easy medium low NoSQL hard very hard very high Hadoop hard hard very high
  • 122. Machine Learning aka Predictive Analytics AWS ✘ ML for developers ✘ GUI-based – regression/classification only ✘ MxNet for deep neural networks, works best with GPUs GCP ✘ ML for ML developers ✘ Scaled or specialized ML models ✘ Many model types – vision, text, translate, speech, jobs, video ✘ TensorFlow for neural networks, can use GPUs
  • 123. Presentation If you can’t see it, it’s not worth it.
  • 124. Dashboards ✘ More than KPIs ✘ Mobile ✘ Alerts ✘ Data Stories Innovation in Data Visualization Reports ✘ Level of Detail ✘ Meaningful Taxonomies ✘ Fast enough ✘ Drill for Data
  • 125. D3 The language of Data Visualization
  • 126.
  • 127. Cloud Big Data Vendors - Visualization AWS ✘ Most complete offering ✘ Notable: Partners & QuickSight GCP ✘ Big Query Partners ✘ Notable: 360 product - New Dashboards, iPython-style notebooks
  • 128. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 129. 5. Review End-to-end Patterns For major cloud vendors
  • 130. Big Data > Time Series Analysis Batch Storage BigQuery Storage Cloud Storage Time Series Processing Cloud Dataflow Analysis Cloud Datalab Storage Cloud Bigtable* Processing Cloud Dataproc Time Series Files Cloud Storage ML Cloud ML Streaming Time Series Streaming Cloud Pub/Sub *Note: Use Bigtable with NoSQL workloads of 1 TB or more
  • 131. Big Data > Bioinformatics *Note: Use Bigtable with NoSQL workloads of 1 TB or more
  • 133. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  • 135. Place your screenshot here Data Generation Device
  • 137. 235,000,000,000 $The IoT Market 2017By the year 20 Billion devicesAnd a lot of users
  • 138. IoT all the Things
  • 139. 139 Ingest Pipelines Storage Analytics Application & Presentation Standard Devices HTTPS Constrained Devices Non-TCP e.g. BLE Gateway Internet of Things > Sensor stream ingest and processing App Engine Container Engine Cloud Storage Cloud Pub/Sub Cloud Dataflow Monitoring Logging Cloud Dataflow Cloud Datastore Cloud Bigtable BigQuery Cloud Dataproc Cloud Datalab Compute Engine
  • 140. IoT
  • 141.
  • 142. Cloud Big Data Vendors - IoT AWS ✘ First to market ✘ Most complete offering ✘ Most mature offering ✘ Notable: MQTT support, AWS IoT Rules GCP ✘ Still in Beta / Android Things ✘ Fastest player ✘ Requires top developers ✘ Notable: Android Things, Weave, Nest
  • 143. “The Future is Functional” @LynnLangit