1
3rd Generation Data Platform
William Vambenepe
Lead Product Manager for Big Data services
Google Cloud
@vambenepe
2
Street nameStreet number
Street View
Sign
Business facade
Sign
Business name
Traffic light
Traffic signStreet number
Analytics as an
IT project
Gen 1
On-prem or
colocation
Gen 2
Virtualized
Datacenters
User Managed
User Configured
User Maintained
Fully managed infra
Auto-optimized
Pay for just what you need
Analytics as a
service
4
Managed Data Services - Focus on Insight vs Infrastructure
PB+ Scale, No-Ops, Batch & Streaming of Data
Insights/
Analytics
Resource
Provisioning
Performance
Tuning
Monitoring
Reliability
Deployment &
Configuration
Handling
Growing Scale
Utilization
Improvements
Insights/
Analytics
5
15 Years of Tackling Big Data Problems
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Open
Source
2005
Google
Cloud
Product
s
BigTable Spanner
2016
Millwheel TensorflowDataflowFlume JavaDremel
6
15 Years of Tackling Big Data Problems
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Open
Source
2005
Google
Cloud
Product
s
BigTable Millwheel TensorflowSpanner
2016
DataflowFlume JavaDremel
7
15 Years of Tackling Big Data Problems
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Flume Java
Open
Source
2005
Google
Cloud
Product
s
BigQuery Pub/Sub Dataflow Bigtable
BigTable Dremel Spanner
ML
2016
Millwheel TensorflowDataflow
8
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
9
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
10
Cloud Pub/Sub
Fast, reliable, event delivery. Serverless, autoscaling, pay for what you use.
11
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
12
Processing Time vs. Event Time
13
Beam Model: Asking the Right Questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
The Beam Model: What is Being Computed?
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam Model: What is Being Computed?
The Beam Model: Where in Event Time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
The Beam Model: Where in Event Time?
The Beam Model: When in Processing Time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
The Beam Model: When in Processing Time?
The Beam Model: How Do Refinements Relate?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam Model: How Do Refinements Relate?
22
Beam=Batch+Stream
Apache Beam (incubating)
Cloud Dataflow
Based on Apache Beam. Pipelines are portable to your favorite runtime.
23
Compute and Storage
Unbounded
Bounded
Resource Management
Resource Auto-scaler
Dynamic Work
Rebalancer
Work Scheduler
Monitoring
Log Collection
Graph Optimization
Auto-Healing
Intelligent WatermarkingS
O
U
R
C
E
S
I
N
K
Cloud Dataflow
Serverless, autoscaling, pay for what you use.
24
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
25
BigQuery
Now with full support for Enterprise Data Warehouse
SQL
Flat-rate Pricing
Standard
SQL
ODBC
Connector
DML - Beta
Identity Access and
Management
26
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
27
Cloud Dataproc
Ready-to use Spark and Hadoop clusters in 90 seconds
Integrated with Cloud
Storage, Cloud Logging,
Cloud Monitoring, and more.
While active, Dataproc
clusters are billed minute-by-
minute.
Dataproc clusters can make
use of low-cost preemptible
Compute Engine VMs.
Minute-by-Minute Billing Preemptible VMsNative Spark and Hadoop Cloud Integrated
Run Spark and Hadoop
applications out of the box
without modification.
28
Manually scale clusters up
or down based on need,
even when jobs are running.
REST API and Integration
with Google Cloud SDK for
rapid development.
Select between multiple
Spark and Hadoop versions;
configure properties easily.
Developer Tools Easily ConfiguredAnytime Scaling Initialization Actions
Execute scripts on cluster
creation to quickly
customize and configure
clusters.
Cloud Dataproc
Ready-to use Spark and Hadoop clusters in 90 seconds
29
Cloud Dataproc
Demonstrably more cost-effective
Source: Michael Li & Ariel
M'ndange-Pfupfu on O’Reilly:
https://www.oreilly.com/ideas/
spark-comparison-aws-vs-gcp
30
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle

BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform

  • 1.
    1 3rd Generation DataPlatform William Vambenepe Lead Product Manager for Big Data services Google Cloud @vambenepe
  • 2.
    2 Street nameStreet number StreetView Sign Business facade Sign Business name Traffic light Traffic signStreet number
  • 3.
    Analytics as an ITproject Gen 1 On-prem or colocation Gen 2 Virtualized Datacenters User Managed User Configured User Maintained Fully managed infra Auto-optimized Pay for just what you need Analytics as a service
  • 4.
    4 Managed Data Services- Focus on Insight vs Infrastructure PB+ Scale, No-Ops, Batch & Streaming of Data Insights/ Analytics Resource Provisioning Performance Tuning Monitoring Reliability Deployment & Configuration Handling Growing Scale Utilization Improvements Insights/ Analytics
  • 5.
    5 15 Years ofTackling Big Data Problems Google Papers 20082002 2004 2006 2010 2012 2014 2015 GFS Map Reduce Open Source 2005 Google Cloud Product s BigTable Spanner 2016 Millwheel TensorflowDataflowFlume JavaDremel
  • 6.
    6 15 Years ofTackling Big Data Problems Google Papers 20082002 2004 2006 2010 2012 2014 2015 GFS Map Reduce Open Source 2005 Google Cloud Product s BigTable Millwheel TensorflowSpanner 2016 DataflowFlume JavaDremel
  • 7.
    7 15 Years ofTackling Big Data Problems Google Papers 20082002 2004 2006 2010 2012 2014 2015 GFS Map Reduce Flume Java Open Source 2005 Google Cloud Product s BigQuery Pub/Sub Dataflow Bigtable BigTable Dremel Spanner ML 2016 Millwheel TensorflowDataflow
  • 8.
    8 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow Serverless platform, auto-optimized usage, across the entire data lifecycle
  • 9.
    9 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow Serverless platform, auto-optimized usage, across the entire data lifecycle
  • 10.
    10 Cloud Pub/Sub Fast, reliable,event delivery. Serverless, autoscaling, pay for what you use.
  • 11.
    11 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow Serverless platform, auto-optimized usage, across the entire data lifecycle
  • 12.
  • 13.
    13 Beam Model: Askingthe Right Questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  • 14.
    The Beam Model:What is Being Computed? PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey());
  • 15.
    The Beam Model:What is Being Computed?
  • 16.
    The Beam Model:Where in Event Time? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey());
  • 17.
    The Beam Model:Where in Event Time?
  • 18.
    The Beam Model:When in Processing Time? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());
  • 19.
    The Beam Model:When in Processing Time?
  • 20.
    The Beam Model:How Do Refinements Relate? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey());
  • 21.
    The Beam Model:How Do Refinements Relate?
  • 22.
    22 Beam=Batch+Stream Apache Beam (incubating) CloudDataflow Based on Apache Beam. Pipelines are portable to your favorite runtime.
  • 23.
    23 Compute and Storage Unbounded Bounded ResourceManagement Resource Auto-scaler Dynamic Work Rebalancer Work Scheduler Monitoring Log Collection Graph Optimization Auto-Healing Intelligent WatermarkingS O U R C E S I N K Cloud Dataflow Serverless, autoscaling, pay for what you use.
  • 24.
    24 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow Serverless platform, auto-optimized usage, across the entire data lifecycle
  • 25.
    25 BigQuery Now with fullsupport for Enterprise Data Warehouse SQL Flat-rate Pricing Standard SQL ODBC Connector DML - Beta Identity Access and Management
  • 26.
    26 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow Serverless platform, auto-optimized usage, across the entire data lifecycle
  • 27.
    27 Cloud Dataproc Ready-to useSpark and Hadoop clusters in 90 seconds Integrated with Cloud Storage, Cloud Logging, Cloud Monitoring, and more. While active, Dataproc clusters are billed minute-by- minute. Dataproc clusters can make use of low-cost preemptible Compute Engine VMs. Minute-by-Minute Billing Preemptible VMsNative Spark and Hadoop Cloud Integrated Run Spark and Hadoop applications out of the box without modification.
  • 28.
    28 Manually scale clustersup or down based on need, even when jobs are running. REST API and Integration with Google Cloud SDK for rapid development. Select between multiple Spark and Hadoop versions; configure properties easily. Developer Tools Easily ConfiguredAnytime Scaling Initialization Actions Execute scripts on cluster creation to quickly customize and configure clusters. Cloud Dataproc Ready-to use Spark and Hadoop clusters in 90 seconds
  • 29.
    29 Cloud Dataproc Demonstrably morecost-effective Source: Michael Li & Ariel M'ndange-Pfupfu on O’Reilly: https://www.oreilly.com/ideas/ spark-comparison-aws-vs-gcp
  • 30.
    30 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow Serverless platform, auto-optimized usage, across the entire data lifecycle