Data @ Google
Use the same tools Google uses to manage its Big Data
Eric Anderson - Product Manager, @ericmander
2
Street nameStreet number
Street View
Sign
Business facade
Sign
Business name
Traffic light
Traffic signStreet number
3
Managed Data Services - Focus on Insight vs Infrastructure
PB+ Scale, No-Ops, Batch & Streaming of Data
Insights/
Analytics
Resource
Provisioning
Performance
Tuning
Monitoring
Reliability
Deployment &
Configuration
Handling
Growing Scale
Utilization
Improvements
Insights/
Analytics
4
15 Years of Tackling Big Data Problems
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Open
Source
2005
Google
Cloud
Product
s
BigTable Spanner
2016
Millwheel TensorflowDataflowFlume JavaDremel
5
15 Years of Tackling Big Data Problems
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Open
Source
2005
Google
Cloud
Product
s
BigTable Millwheel TensorflowSpanner
2016
DataflowFlume JavaDremel
6
“ ”
Google is living a few years in the future and
sending the rest of us messages.
Doug Cutting
Hadoop Co-Creator
7
15 Years of Tackling Big Data Problems
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Flume Java
Open
Source
2005
Google
Cloud
Product
s
BigQuery Pub/Sub Dataflow Bigtable
BigTable Dremel Spanner
ML
2016
Millwheel TensorflowDataflow
8
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
9
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
10
Cloud Pub/Sub
Fast, reliable, event delivery. Serverless, autoscaling, pay for what you use.
11
Out of order data
The hard part of stream processing
12
Cloud Dataflow
Unified batch/stream pipelines.
13
Beam=Batch+Stream
Apache Beam (incubating)
Cloud Dataflow
Based on Apache Beam. Pipelines are portable to your favorite runtime.
14
Compute and Storage
Unbounded
Bounded
Resource Management
Resource Auto-scaler
Dynamic Work
Rebalancer
Work Scheduler
Monitoring
Log Collection
Graph Optimization
Auto-Healing
Intelligent WatermarkingS
O
U
R
C
E
S
I
N
K
Cloud Dataflow
Serverless, autoscaling, pay for what you use.
15
“
”
We are very excited about the productivity
benefits offered by Cloud Dataflow and
Cloud Pub/Sub. It took half a day to rewrite
something that had previously taken over
six months to build using Spark.
Paul Clarke
Director of Technology, Ocado
“From traditional batch processing to rock-solid event delivery to the nearly
magical abilities of BigQuery, building on Google’s data infrastructure
provides us with a significant advantage where it matters the most.”
Nicholas Harteau
VP of Engineering and Infrastructure
16
17
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
18
Cloud Dataproc
Ready-to use Spark and Hadoop clusters in 90 seconds
Integrated with Cloud
Storage, Cloud Logging,
Cloud Monitoring, and more.
While active, Dataproc
clusters are billed minute-by-
minute.
Dataproc clusters can make
use of low-cost preemptible
Compute Engine VMs.
Minute-by-Minute Billing Preemptible VMsNative Spark and Hadoop Cloud Integrated
Run Spark and Hadoop
applications out of the box
without modification.
19
Cloud Dataproc
Demonstrably more cost-effective
Source: Michael Li & Ariel
M'ndange-Pfupfu on O’Reilly:
https://www.oreilly.com/ideas/
spark-comparison-aws-vs-gcp
20
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
21
Google Cloud Bigtable
Google confidential │ Do not distribute
Overview:
Data to process: Data in the Consolidated Audit Trail (CAT).
A data repository of all equities and options orders, quotes,
and events
Challenges:
How to process the CAT and organize 100 billion market
events into an “order lifecycle” in a 4 hour window
Store 6 years (~30PB) of data
Cloud Bigtable to process and run queries
and tolerate volume increases
6 BILLION
MARKET EVENTS
WRITTEN PER HOUR
1.7 GIGs
PER SECOND
PER HOUR
6 TBs
10 BN
WRITTEN
PER HOUR BURSTS
1.7 GIGABYTES
PER SECOND
10 TERABYTES
PER HOUR
23
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
Serverless platform, auto-optimized usage, across the entire data lifecycle
24
– Mattias P Johansson, Software Engineer, Spotify
“With Google Cloud Platform, we benefitted by having a
virtual supercomputer on demand, without having to deal
with all the usual space, power, cooling and networking
issues.
Just a few years ago, we would have needed to use the
largest supercomputers on the planet to do what we’re
now able to do with Google”
– Mark Johnson, CEO, Descartes Labs
“Right at the start of the partnership we were able
to reduce time to insight from 96 hours to 30
minutes by using BigQuery.”
– Gary Sanders, Head of Digital Analytics, Lloyds Banking Group
“Everyone involved unanimously picked GCP. It came
down to this: we believe the core technology is better.”
– Peter Bakkam, Platform Lead, Quizlet
Do you feel this way about your Data Warehouse?
25
BigQuery
Now with full support for Enterprise Data Warehouse
SQL
Flat-rate Pricing
Standard
SQL
ODBC
Connector
DML - Beta
Identity Access and
Management
26
BETA
BETA
GAGA
Use your own data to train models
Cloud Datalab
Cloud Machine Learning
Cloud Storage Google BigQuery Develop/Model/Test
Cloud Dataflow
GA
Train
Predict
Features:
● Fully Managed NewSQL database service with relational
semantics and global consistency
● Global replication options for low-latency reads across the
globe
● Consistent transactions across millions of rows
Use Cases:
● Large application workloads with very high write volumes or
large datasets (3 TB+)
● Geographically distributed control planes with global
consistency guarantees
Pricing:
● Pay per hour of node compute for throughput
● Pay per GB/month of data stored
Cloud Spanner
Google Cloud Platform

Eric Andersen Keynote

  • 1.
    Data @ Google Usethe same tools Google uses to manage its Big Data Eric Anderson - Product Manager, @ericmander
  • 2.
    2 Street nameStreet number StreetView Sign Business facade Sign Business name Traffic light Traffic signStreet number
  • 3.
    3 Managed Data Services- Focus on Insight vs Infrastructure PB+ Scale, No-Ops, Batch & Streaming of Data Insights/ Analytics Resource Provisioning Performance Tuning Monitoring Reliability Deployment & Configuration Handling Growing Scale Utilization Improvements Insights/ Analytics
  • 4.
    4 15 Years ofTackling Big Data Problems Google Papers 20082002 2004 2006 2010 2012 2014 2015 GFS Map Reduce Open Source 2005 Google Cloud Product s BigTable Spanner 2016 Millwheel TensorflowDataflowFlume JavaDremel
  • 5.
    5 15 Years ofTackling Big Data Problems Google Papers 20082002 2004 2006 2010 2012 2014 2015 GFS Map Reduce Open Source 2005 Google Cloud Product s BigTable Millwheel TensorflowSpanner 2016 DataflowFlume JavaDremel
  • 6.
    6 “ ” Google isliving a few years in the future and sending the rest of us messages. Doug Cutting Hadoop Co-Creator
  • 7.
    7 15 Years ofTackling Big Data Problems Google Papers 20082002 2004 2006 2010 2012 2014 2015 GFS Map Reduce Flume Java Open Source 2005 Google Cloud Product s BigQuery Pub/Sub Dataflow Bigtable BigTable Dremel Spanner ML 2016 Millwheel TensorflowDataflow
  • 8.
    8 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow Serverless platform, auto-optimized usage, across the entire data lifecycle
  • 9.
    9 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow Serverless platform, auto-optimized usage, across the entire data lifecycle
  • 10.
    10 Cloud Pub/Sub Fast, reliable,event delivery. Serverless, autoscaling, pay for what you use.
  • 11.
    11 Out of orderdata The hard part of stream processing
  • 12.
  • 13.
    13 Beam=Batch+Stream Apache Beam (incubating) CloudDataflow Based on Apache Beam. Pipelines are portable to your favorite runtime.
  • 14.
    14 Compute and Storage Unbounded Bounded ResourceManagement Resource Auto-scaler Dynamic Work Rebalancer Work Scheduler Monitoring Log Collection Graph Optimization Auto-Healing Intelligent WatermarkingS O U R C E S I N K Cloud Dataflow Serverless, autoscaling, pay for what you use.
  • 15.
    15 “ ” We are veryexcited about the productivity benefits offered by Cloud Dataflow and Cloud Pub/Sub. It took half a day to rewrite something that had previously taken over six months to build using Spark. Paul Clarke Director of Technology, Ocado
  • 16.
    “From traditional batchprocessing to rock-solid event delivery to the nearly magical abilities of BigQuery, building on Google’s data infrastructure provides us with a significant advantage where it matters the most.” Nicholas Harteau VP of Engineering and Infrastructure 16
  • 17.
    17 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow Serverless platform, auto-optimized usage, across the entire data lifecycle
  • 18.
    18 Cloud Dataproc Ready-to useSpark and Hadoop clusters in 90 seconds Integrated with Cloud Storage, Cloud Logging, Cloud Monitoring, and more. While active, Dataproc clusters are billed minute-by- minute. Dataproc clusters can make use of low-cost preemptible Compute Engine VMs. Minute-by-Minute Billing Preemptible VMsNative Spark and Hadoop Cloud Integrated Run Spark and Hadoop applications out of the box without modification.
  • 19.
    19 Cloud Dataproc Demonstrably morecost-effective Source: Michael Li & Ariel M'ndange-Pfupfu on O’Reilly: https://www.oreilly.com/ideas/ spark-comparison-aws-vs-gcp
  • 20.
    20 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow Serverless platform, auto-optimized usage, across the entire data lifecycle
  • 21.
  • 22.
    Google confidential │Do not distribute Overview: Data to process: Data in the Consolidated Audit Trail (CAT). A data repository of all equities and options orders, quotes, and events Challenges: How to process the CAT and organize 100 billion market events into an “order lifecycle” in a 4 hour window Store 6 years (~30PB) of data Cloud Bigtable to process and run queries and tolerate volume increases 6 BILLION MARKET EVENTS WRITTEN PER HOUR 1.7 GIGs PER SECOND PER HOUR 6 TBs 10 BN WRITTEN PER HOUR BURSTS 1.7 GIGABYTES PER SECOND 10 TERABYTES PER HOUR
  • 23.
    23 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow Serverless platform, auto-optimized usage, across the entire data lifecycle
  • 24.
    24 – Mattias PJohansson, Software Engineer, Spotify “With Google Cloud Platform, we benefitted by having a virtual supercomputer on demand, without having to deal with all the usual space, power, cooling and networking issues. Just a few years ago, we would have needed to use the largest supercomputers on the planet to do what we’re now able to do with Google” – Mark Johnson, CEO, Descartes Labs “Right at the start of the partnership we were able to reduce time to insight from 96 hours to 30 minutes by using BigQuery.” – Gary Sanders, Head of Digital Analytics, Lloyds Banking Group “Everyone involved unanimously picked GCP. It came down to this: we believe the core technology is better.” – Peter Bakkam, Platform Lead, Quizlet Do you feel this way about your Data Warehouse?
  • 25.
    25 BigQuery Now with fullsupport for Enterprise Data Warehouse SQL Flat-rate Pricing Standard SQL ODBC Connector DML - Beta Identity Access and Management
  • 26.
    26 BETA BETA GAGA Use your owndata to train models Cloud Datalab Cloud Machine Learning Cloud Storage Google BigQuery Develop/Model/Test Cloud Dataflow GA Train Predict
  • 27.
    Features: ● Fully ManagedNewSQL database service with relational semantics and global consistency ● Global replication options for low-latency reads across the globe ● Consistent transactions across millions of rows Use Cases: ● Large application workloads with very high write volumes or large datasets (3 TB+) ● Geographically distributed control planes with global consistency guarantees Pricing: ● Pay per hour of node compute for throughput ● Pay per GB/month of data stored Cloud Spanner Google Cloud Platform

Editor's Notes

  • #3 a lot of info in pictures 23 billion words in Wikipedia 40 billion textual lines in StreetView Make the point that we didn’t realize how valuable the pictures were originally and later we revisited and extracted all this additional value. Willing to make a bet all the audience have similarly valuable data.
  • #17 What really tipped the scales towards Google for spotify was their experience with Google’s data platform and tools. “Good infrastructure isn’t just about keeping things up and running, it’s about making all of our teams more efficient and more effective, and Google’s data stack does that for us in spades. From traditional batch processing with Dataproc, to rock-solid event delivery with Pub/Sub to the nearly magical abilities of BigQuery, building on Google’s data infrastructure provides us with a significant advantage where it matters the most. We have a large and complex backend, so this is a large and complex project that will take us some time to complete. We’re looking forward to sharing our experiences with you as we go, so watch our engineering blog for more information on what we learn, build and break along the way. We’re pretty excited about our Googley future and hope you’ll find it interesting too.