1
Google Cloud & Data Pipeline
Patterns
@LynnLangit
2
Google Cloud in Australia
Data center here in 2017
3
GCP and Patterns
Developer-first
• Fast, flexible and cheap
• Virtual Machines / GCE
• Storage / GCS
Servers ➡ Containers ➡ Functions
• Data Warehouse
• Internet of Things (IoT)
• Bioinformatics
1. Modern Cloud by Example 2. GCP Data Pipeline Patterns
**And also, something New…
4Confidential & ProprietaryGoogle Cloud Platform 4
Demo – Storage / GCS
5
6Confidential & ProprietaryGoogle Cloud Platform 6
Demo – Virtual Machines / GCE
7
Virtual Machines /
GCE
• Fast
• Spin up in seconds
• Tools - SSH, gcloud console
• Flexible
• Custom sizing – slider 
• OS variety – Linux or Windows
• Cheap and Simple
• Auto discount for use
• Pre-emptible
Storage / GCS
• Fast
• Very fast within region
• Tools included
• Flexible
• 4 storage options
• Simple to use / understand
• Cheap
• Pricing by type
8
9
Pipeline Architectures
10Google Cloud Platform 10
Data Warehousing
11
Big Data > Data Warehouse
Reference table
Query / Compute
BigQuery
Customer Lists / Reference
Data
Export Ad
Data
Cloud Storage
Id matching
Cloud Dataflow
Marketing List
DoubleClick
Campaign Manager
Google Analytics
Relevant Users
Cloud Storage
Analysts
DataStudio
360
Dashboards
12Confidential & ProprietaryGoogle Cloud Platform 12
Demo – BigQuery
13
Batch
Streaming
Big Data > Log Processing
Log Storage
Cloud Storage
Log Streaming
Cloud Pub/Sub
Log Analytics
BigQuery
Log Processing
Cloud Dataflow
14
Cloud Dataflow /
Apache Beam
15
Big Data > Time Series Analysis
Batch Storage
BigQuery
Storage
Cloud Storage
Time Series Processing
Cloud Dataflow
Analysis
Cloud Datalab
Storage
Cloud
Bigtable*
Processing
Cloud Dataproc
Time Series Files
Cloud Storage
ML
Cloud ML
Streaming
Time Series Streaming
Cloud Pub/Sub
*Note: Use Bigtable with
NoSQL workloads of 1 TB or more
16
Streaming
Big Data > Complex Event Processing
Cloud Apps
Compute Engine
Streamin
g
Batch
Push to Devices
App Engine
Rules Engine
Cloud Dataflow Data Analysis
Cloud Datalab
Mobile Devices
Push Notifications
Report &
Share
Business Analysis
Cloud Apps
Compute
Engine
On-Premises
Databases
On-Premises
Applications
Processed Events
Cloud Bigtable
Events Time Series
Data
Warehouse
BigQuery
Execution Results
Streaming
Cloud Pub/Sub
Transactions
Processing
Cloud Dataflow
Transaction Streams
Messaging
Cloud Pub/Sub
Rules Actions
ETL
Cloud Dataflow
Transform Data
Cloud Data
Cloud Storage
Rules Engine
Cloud Dataproc
1717
Files
• Cloud Storage
Compute
• Big Query
• Cloud Dataflow
Other
• 3rd party ETL
• 3rd party dashboards
Core Products for Data
Warehousing
More on Big Query…
• Interactive or Batch query
• ANSI SQL compliant
• Cost control - Purchase ‘slots’
• NoOps Data Warehouse
18Google Cloud Platform 18
Big Relational
1919
What is Spanner?
20Confidential & ProprietaryGoogle Cloud Platform 20
Demo – Cloud Spanner
21Google Cloud Platform 21
Internet of Things
22
Internet of Things > MQTT
IoT Warehouse
BigQuery
IoT Application
App Engine
Stream Analytics
Cloud Dataflow
IoT Topic
Cloud Pub/Sub
MQTT
Devices
Auto-scaled Broker
Tier
Custom MQTT broker
MQTT Broker
Compute Engine
RabbitMQ
Cloud Load
Balancing
23
Ingest Pipelines
Storage
Analytics
Application &
Presentation
Standard
Devices
HTTPS
Constraine
d
Devices
Non-TCP
e.g. BLE
Gateway
Internet of Things > Sensor stream ingest and
processing
App
Engine
Container
Engine
Cloud
Storage
Cloud
Pub/Sub
Cloud
Dataflow
Monitoring
Logging
Cloud
Dataflow
Cloud
Datastore
Cloud
Bigtable
BigQuer
y
Cloud
Dataproc
Cloud
Datalab
Compute
Engine
24
Retail > Beacons and Targeted Marketing
Events
Cloud Bigtable
Proximity Events
Analytics
BigQuery
Data Warehouse
Messaging
Cloud Pub/Sub
Proximity Streams
Processing
Cloud Dataflow
Stream Processing
Notifications
App Engine
Push to Devices
Mobile-Push
Notifications
Office Business
Systems
Beacons
Proximity
Notifications
Messaging
Cloud Pub/Sub
Queued Notifications
2525
Files & Storage
• Cloud Storage
• Big Table
Compute & Ingest
• Cloud Pub/Sub
• Big Query
• Cloud Dataflow
Core Products for IoT
26Confidential & ProprietaryGoogle Cloud Platform 26
Demo – Machine Learning
27Google Cloud Platform 27
Bioinformatics
28
Patient
Analytics
Life Sciences > Patient Monitoring
Analytics
Process Data
Prediction API
Ingest
Cloud Pub/Sub
Storage
Cloud Bigtable
Alerts
Notifications
Cloud Pub/Sub
Health Care
Professional
Patient Monitors
(pulse, blood
sugar, exercise)
29
Private Datasets Public Datasets
Life Sciences > Variant Analysis
MSSNG Autism
Cloud Storage
Scientist
High
Throughput
Genome
Sequencers
1000 Genomes
Cloud Storage
Patient Data
Cloud Storage
Illumina Platform
Cloud Storage
Ref Genomes
Cloud Storage
TCGA
Cloud Storage
Analytics
Online Analytics
BigQuery
Batch Analytics
Cloud Dataflow
Lab Notebooks
Cloud Datalab
Data Ingest
Genomics
BAM
FAST
Q
30
Ingest
Elastic Cluster
Storage
Analytics
Life Sciences > Genomics, Secondary Analysis
Carrier
Interconnect
High
Throughput
Genome
Sequencer
s
Scientist
Raw Datafiles
Cloud Storage
Processed Data
Cloud Storage
Metadata
Cloud SQL
Lab notebooks
Cloud Datalab
HPC Cluster
Compute
Engine
10 Nodes
Ingest Server
Compute
Engine
Online Analytics
BigQuery
Cloud Load
Balancing
Cloud
Network
3131
• Cloud Storage
• Big Query
• Compute Engine
• Cloud Dataflow
• Public datasets on GCP
Core Products for
Bioinformatics
33
“The Future is Functional”
@LynnLangit

Google Cloud and Data Pipeline Patterns

  • 1.
    1 Google Cloud &Data Pipeline Patterns @LynnLangit
  • 2.
    2 Google Cloud inAustralia Data center here in 2017
  • 3.
    3 GCP and Patterns Developer-first •Fast, flexible and cheap • Virtual Machines / GCE • Storage / GCS Servers ➡ Containers ➡ Functions • Data Warehouse • Internet of Things (IoT) • Bioinformatics 1. Modern Cloud by Example 2. GCP Data Pipeline Patterns **And also, something New…
  • 4.
    4Confidential & ProprietaryGoogleCloud Platform 4 Demo – Storage / GCS
  • 5.
  • 6.
    6Confidential & ProprietaryGoogleCloud Platform 6 Demo – Virtual Machines / GCE
  • 7.
    7 Virtual Machines / GCE •Fast • Spin up in seconds • Tools - SSH, gcloud console • Flexible • Custom sizing – slider  • OS variety – Linux or Windows • Cheap and Simple • Auto discount for use • Pre-emptible Storage / GCS • Fast • Very fast within region • Tools included • Flexible • 4 storage options • Simple to use / understand • Cheap • Pricing by type
  • 8.
  • 9.
  • 10.
    10Google Cloud Platform10 Data Warehousing
  • 11.
    11 Big Data >Data Warehouse Reference table Query / Compute BigQuery Customer Lists / Reference Data Export Ad Data Cloud Storage Id matching Cloud Dataflow Marketing List DoubleClick Campaign Manager Google Analytics Relevant Users Cloud Storage Analysts DataStudio 360 Dashboards
  • 12.
    12Confidential & ProprietaryGoogleCloud Platform 12 Demo – BigQuery
  • 13.
    13 Batch Streaming Big Data >Log Processing Log Storage Cloud Storage Log Streaming Cloud Pub/Sub Log Analytics BigQuery Log Processing Cloud Dataflow
  • 14.
  • 15.
    15 Big Data >Time Series Analysis Batch Storage BigQuery Storage Cloud Storage Time Series Processing Cloud Dataflow Analysis Cloud Datalab Storage Cloud Bigtable* Processing Cloud Dataproc Time Series Files Cloud Storage ML Cloud ML Streaming Time Series Streaming Cloud Pub/Sub *Note: Use Bigtable with NoSQL workloads of 1 TB or more
  • 16.
    16 Streaming Big Data >Complex Event Processing Cloud Apps Compute Engine Streamin g Batch Push to Devices App Engine Rules Engine Cloud Dataflow Data Analysis Cloud Datalab Mobile Devices Push Notifications Report & Share Business Analysis Cloud Apps Compute Engine On-Premises Databases On-Premises Applications Processed Events Cloud Bigtable Events Time Series Data Warehouse BigQuery Execution Results Streaming Cloud Pub/Sub Transactions Processing Cloud Dataflow Transaction Streams Messaging Cloud Pub/Sub Rules Actions ETL Cloud Dataflow Transform Data Cloud Data Cloud Storage Rules Engine Cloud Dataproc
  • 17.
    1717 Files • Cloud Storage Compute •Big Query • Cloud Dataflow Other • 3rd party ETL • 3rd party dashboards Core Products for Data Warehousing More on Big Query… • Interactive or Batch query • ANSI SQL compliant • Cost control - Purchase ‘slots’ • NoOps Data Warehouse
  • 18.
    18Google Cloud Platform18 Big Relational
  • 19.
  • 20.
    20Confidential & ProprietaryGoogleCloud Platform 20 Demo – Cloud Spanner
  • 21.
    21Google Cloud Platform21 Internet of Things
  • 22.
    22 Internet of Things> MQTT IoT Warehouse BigQuery IoT Application App Engine Stream Analytics Cloud Dataflow IoT Topic Cloud Pub/Sub MQTT Devices Auto-scaled Broker Tier Custom MQTT broker MQTT Broker Compute Engine RabbitMQ Cloud Load Balancing
  • 23.
    23 Ingest Pipelines Storage Analytics Application & Presentation Standard Devices HTTPS Constraine d Devices Non-TCP e.g.BLE Gateway Internet of Things > Sensor stream ingest and processing App Engine Container Engine Cloud Storage Cloud Pub/Sub Cloud Dataflow Monitoring Logging Cloud Dataflow Cloud Datastore Cloud Bigtable BigQuer y Cloud Dataproc Cloud Datalab Compute Engine
  • 24.
    24 Retail > Beaconsand Targeted Marketing Events Cloud Bigtable Proximity Events Analytics BigQuery Data Warehouse Messaging Cloud Pub/Sub Proximity Streams Processing Cloud Dataflow Stream Processing Notifications App Engine Push to Devices Mobile-Push Notifications Office Business Systems Beacons Proximity Notifications Messaging Cloud Pub/Sub Queued Notifications
  • 25.
    2525 Files & Storage •Cloud Storage • Big Table Compute & Ingest • Cloud Pub/Sub • Big Query • Cloud Dataflow Core Products for IoT
  • 26.
    26Confidential & ProprietaryGoogleCloud Platform 26 Demo – Machine Learning
  • 27.
    27Google Cloud Platform27 Bioinformatics
  • 28.
    28 Patient Analytics Life Sciences >Patient Monitoring Analytics Process Data Prediction API Ingest Cloud Pub/Sub Storage Cloud Bigtable Alerts Notifications Cloud Pub/Sub Health Care Professional Patient Monitors (pulse, blood sugar, exercise)
  • 29.
    29 Private Datasets PublicDatasets Life Sciences > Variant Analysis MSSNG Autism Cloud Storage Scientist High Throughput Genome Sequencers 1000 Genomes Cloud Storage Patient Data Cloud Storage Illumina Platform Cloud Storage Ref Genomes Cloud Storage TCGA Cloud Storage Analytics Online Analytics BigQuery Batch Analytics Cloud Dataflow Lab Notebooks Cloud Datalab Data Ingest Genomics BAM FAST Q
  • 30.
    30 Ingest Elastic Cluster Storage Analytics Life Sciences> Genomics, Secondary Analysis Carrier Interconnect High Throughput Genome Sequencer s Scientist Raw Datafiles Cloud Storage Processed Data Cloud Storage Metadata Cloud SQL Lab notebooks Cloud Datalab HPC Cluster Compute Engine 10 Nodes Ingest Server Compute Engine Online Analytics BigQuery Cloud Load Balancing Cloud Network
  • 31.
    3131 • Cloud Storage •Big Query • Compute Engine • Cloud Dataflow • Public datasets on GCP Core Products for Bioinformatics
  • 32.
    33 “The Future isFunctional” @LynnLangit

Editor's Notes

  • #20 https://cloud.google.com/spanner/ https://research.google.com/pubs/pub45855.html https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
  • #34 Icon and sample diagrams landing page https://cloud.google.com/icons