Critical Breakthroughs and technical
Challenges in Big Data Driven
Innovation
Paolo Spreafico
Head of EMEA Data Solution Engineers, Google Cloud Platform
Google Cloud Platform 2
Organize the world’s
information and make it
universally accessible
and useful.
Google’s Mission
2
“
#cloudconf2016
#cloudconf2016
Google Cloud Platform 5
By 2020, there will be 8 Billion connected smart phones
Source: Boston Consulting Group :
The Mobile Revolution: How Mobile Technologies Drive a Trillion-Dollar Impact
IDC, 2015
— 2X more than today.
And 32 Billion connected “IOT” devices
— 6X more than today.
Building what’s next 6
Source: IDC
increase in data
(4ZB to 45ZB)
connected
devices
of data “touched”
by the cloud
40%35B10x
Organisation
Data Questions
Technology
Data is key (among others)
“Companies in the top third of their industry in the
use of data- driven decision making were, on
average, 5% more productive and 6% more profitable
than their competitors.”
Andrew McAfee and Erik Brynjolfsson, MIT
What does Cloud 3.0 look like?
Google Cloud Platform 9
Storage Processing Memory Network
Single-node computing
“Some assembly required”
True, on-demand cloud
An actual, global
elastic cloud
Cloud 3.0
Invest your energy
in great apps
Colocation
Your kit, someone
else’s building.
Yours to manage.
Cloud 1.0
Today's Cloud:
Virtualized
Data Centers
Standard virtual kit,
for rent. Still yours
to manage.
Cloud 2.0
Automation
Google Cloud Platform Vision
Messaging Big Data Containers NoSQL
http://googleasiapacific.blogspot.se/2015/06/growing-our-data-center-in-singapore.html
For the past 17 years, Google has
been building out the fastest, most
powerful, highest quality cloud
infrastructure on the planet.
Edge locations in virtually every
country in the world
Our Network
77Peering locations
10+ Years of Tackling Big Data Problems
Google Cloud Platform 13
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Flume
Java
Millwheel
Open
Source
2005
Google
Cloud
Products BigQuery Pub/Sub Dataflow Bigtable
BigTable Dremel PubSub
Apache
Beam
Tensorflow
Google’s Data Services for everyone
Confidential + Proprietary
Storage and Databases
Cloud Storage
The Google Cloud data toolbox
Cloud SQL
Cloud Bigtable
Cloud Datastore
Big Data and Analytics
BigQuery
Cloud Pub/Sub
Cloud Dataflow
Cloud Dataproc
Cloud Datalab
Machine Learning
Cloud Machine Learning
Cloud Translate API
Cloud Vision API
Cloud Speech API
Confidential + Proprietary
A common configuration: draw conclusions
Events, metrics,
etc.
Stream
Batch
Applications and
Reports
Cloud Datalab
Visualization and BI
Co-workers
Batch
B C
A
Raw logs, files,
assets, Google
Analytics data etc.
A serverless big data stack
that scales automatically
Confidential & ProprietaryGoogle Cloud Platform 18
Complexities of Big Data Processing
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
growing scale
Utilization
improvements
Time to Understanding
Typical Big Data
Processing
Confidential & ProprietaryGoogle Cloud Platform 19
Spend Time on ‘What’ not ‘How’
Time to Understanding
Big Data Processing
with Google Cloud
Platform
Programming
More time to dig
into your data
Cloud 3.0 Big Data Lifecycle
Cloud Logs
Google App
Engine
Google Analytics
Premium
Cloud Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
(SQL)
Capture Store Analyze
Batch
Process
Stream
Cloud
Monitoring
Real-time analytics
Cloud Dataflow
Cloud ML
Real-time
dashboard
Real-time
alerts
Use
Data
Scientists
Analysts
Smart
apps
Catalog & Data Lifecycle Automation
Cloud
Datalab
Cloud Dataproc
Data Studio
Confidential & ProprietaryGoogle Cloud Platform 21
Emerging Big Data Challenges
Real-time
data ingestion
Machine learning
at scale
Batch or streaming?
Analytics at the
speed of thought
Batch or Streaming?
Why do you have to choose?
Breakthrough #1
Google Cloud Platform Confidential & Proprietary 23
We don’t really use MapReduce anymore
Urs Hölzle
SVP Technical
Infrastructure Google
“ ”
Confidential + Proprietary
A common configuration: capturing input
Cloud Pub/Sub
Reliable, many-to-many, asynchronous messaging
Cloud Storage
Powerful, simple and cost-effective object storage
Raw logs, files,
assets, Google
Analytics data etc.
Events, metrics,
etc.
Confidential + Proprietary
A common configuration: process and transform
Events, metrics,
etc.
Cloud Dataflow
Data processing engine for
batch and stream processing
Stream
Batch
Raw logs, files,
assets, Google
Analytics data etc.
Confidential + Proprietary
A common configuration: process and transform
Events, metrics,
etc.
Cloud Dataflow
Data processing engine for
batch and stream processing
Stream
Batch
Cloud Dataproc
Managed Spark and Hadoop
Batch
Raw logs, files,
assets, Google
Analytics data etc.
Confidential + Proprietary
A common configuration: analyze and store
Events, metrics,
etc.
Stream
Batch
BigQuery
Extremely fast
and cheap on-demand
analytics engine
Bigtable
High performance
NoSQL database for
large workloads
Batch
Raw logs, files,
assets, Google
Analytics data etc.
Confidential + Proprietary
A common configuration: draw conclusions
Events, metrics,
etc.
Stream
Batch
Applications and
Reports
Cloud Datalab
Visualization and BI
Co-workers
Batch
B C
A
Raw logs, files,
assets, Google
Analytics data etc.
Real-time data ingestion
(and at scale)
Breakthrough #2
Google confidential │ Do not distribute
Overview:
Data to process: Data in the Consolidated Audit Trail (CAT).
A data repository of all equities and options orders, quotes,
and events
Challenges:
How to process the CAT and organize 100 billion market
events into an “order lifecycle” in a 4 hour window
Store 6 years (~30PB) of data
Cloud Bigtable to process and run queries
and tolerate volume increases
6 BILLION
MARKET EVENTS
WRITTEN PER HOUR
1.7 GIGs
PER SECOND
PER HOUR
6 TBs
10 BN
WRITTEN
PER HOUR BURSTS
1.7 GIGABYTES
PER SECOND
10 TERABYTES
PER HOUR
Google confidential │ Do not distributehttps://www.youtube.com/watch?v=fqOpaCS117Q
Analytics at the speed of
thought
(and at scale)
Breakthrough #3
Building what’s next 33
Scales automatically
No setup or administration
Stream up to 100,000 rows p/sec
Easily integrates with third-party software
Google BigQuery
makes complex data analysis simple
Confidential + Proprietary
Google BigQuery Performance Example ?
Running an inefficient regular expression over 100 billion rows in
less than 60 seconds
Source: https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query
1000-core Hadoop Cluster
= 2.5 hours
Before
Making ad hoc Queries
with BigQuery < 5min
After
● 500+ Games
● Hundreds of Analysts
● Terabytes of Data Daily
Google BigQuery
The Power of Google Dremel for everyone
Storage Compute
Fast Ingest
Query
Terabit Network
“Right at the start of the partnership we were
able to reduce time to insight from 96 hours to
30 minutes by using BigQuery, allowing us to
react in real time to customer needs and
provide better service..”
Gary Sanders
Head of the bank's digital analytics function
https://www.finextra.com/newsarticle/28566/lloyds-partners-google-on-data-analytics
Machine learning for everyone
Breakthrough #4
Google Cloud Platform 4040
“
"Machine learning is a core,
transformative way by which we're
rethinking everything we're doing … we're
thoughtfully applying it across all our
products, be it search, ads, YouTube or
Play."
Google confidential | Do not distribute
Applications that can see, hear and understand
Confidential & ProprietaryGoogle Cloud Platform 42
TensorFlow
Deep Learning technology currently
powering over 100 Google services
Generalizable to vision, sound, text,
video and other data
Runs on CPUs or GPUs, desktop,
server, or mobile computing
platforms.
Distributed via Apache 2.0 OSS
license
Use your own data to train models
Google Cloud Platform Confidential & Proprietary 44
What Cloud Machine Learning Can Do
● Fully managed service
● Train using a custom Tensor Flow
graph
● Batch and online predictions, at scale
● Integrated Datalab experience
● Regression and classification tasks
Fully trained, easy to use
Machine Learning models
Cloud
Translate API
Cloud
Vision API
Cloud
Speech API
Cloud
Vision API
Label
Detection
Landmark
Detection
OCR
Logo
Detection
Face
Detection
Explicit
Content
Detection
{"landmarkAnnotations": [
"description":"Arc de Triomphe",
"locations": [{"latLng": {
"latitude":48.873667,
“longitude":2.295134}}],
"score":0.94231218
]}
Cloud
Speech API
Recognizes over 80 languages and variants
Can return text in real-time
Highly accurate, even in noisy environments
Access from any device
Powered by Google’s machine learning
Speech API Demo
Click for Demo
“What are you sinking about ? “
https://www.google.com/intl/en/chrome/demos/speech.html
Machine Learning Use Cases
Structured Data
Classification/ Regression
● Customer Churn Analysis
● Product Diagnostics
● Forecasting
Recommendation
● Content Personalization
● Product X-Sells/Up-sells
Anomaly Detection
● Fraud Detection
● Asset Sensor Diagnostics
● Log Metric Anomalies
Unstructured Data
Image Analytics
● Identify damaged shipments
● Explicit Content Classification
Text Analytics
● Call Center log analysis
● Language Identification
● Topic Classification
● Sentiment Analysis
cloud.google.com
Google Cloud Platform Confidential & Proprietary 52
Google’s Approach to
Cloud Security & Compliance
● Tens of thousands of custom built, homogenous
systems
● Dozens of datacenters for redundancy
● Data encryption in transit and at rest
● Secure software development process
● External security verifications
● 500+ security engineers
● 160+ academic research papers on security
● Vulnerability Reward Program
We store our own data in this environment
SSAE-16
SOC 1
SSAE-16
SOC 2
SSAE-16
SOC 3
ISO
27001
HIPAA
(BAA)
PCI DSS
v3.0
FISMA FedRamp
GAE Complete Complete Complete Complete H2 15 Complete
FISMA
(Moderate)
H2 15
GCS Complete Complete Complete Complete Complete Complete n/a H2 15
GCE Complete Complete Complete Complete Complete Complete n/a H2 15
Datastore Complete Complete Complete Complete H2 15 Complete n/a H2 15
Big Query Complete Complete Complete Complete Complete Complete n/a H2 15
Cloud SQL Complete Complete Complete Complete Complete Complete n/a H2 15
Genomics Complete Complete Complete Complete Complete n/a n/a H2 15
Apps Complete Complete Complete Complete Complete n/a GAFG only H2 15
Certifications
Google Cloud Platform Confidential & Proprietary 56
https://cloud.google.com/solutions/machine-learning-with-financial-time-series-data
Demo: Predicting the NYSE daily outcome
Google Cloud Platform Confidential & Proprietary 57
Get more info: Google Cloud for Financial Services
https://cloud.google.com/solutions/finserv/

Critical Breakthroughs and Challenges in Big Data and Analytics

  • 1.
    Critical Breakthroughs andtechnical Challenges in Big Data Driven Innovation Paolo Spreafico Head of EMEA Data Solution Engineers, Google Cloud Platform
  • 2.
    Google Cloud Platform2 Organize the world’s information and make it universally accessible and useful. Google’s Mission 2 “
  • 3.
  • 4.
  • 5.
    Google Cloud Platform5 By 2020, there will be 8 Billion connected smart phones Source: Boston Consulting Group : The Mobile Revolution: How Mobile Technologies Drive a Trillion-Dollar Impact IDC, 2015 — 2X more than today. And 32 Billion connected “IOT” devices — 6X more than today.
  • 6.
    Building what’s next6 Source: IDC increase in data (4ZB to 45ZB) connected devices of data “touched” by the cloud 40%35B10x
  • 7.
    Organisation Data Questions Technology Data iskey (among others) “Companies in the top third of their industry in the use of data- driven decision making were, on average, 5% more productive and 6% more profitable than their competitors.” Andrew McAfee and Erik Brynjolfsson, MIT
  • 8.
    What does Cloud3.0 look like?
  • 9.
    Google Cloud Platform9 Storage Processing Memory Network Single-node computing “Some assembly required” True, on-demand cloud An actual, global elastic cloud Cloud 3.0 Invest your energy in great apps Colocation Your kit, someone else’s building. Yours to manage. Cloud 1.0 Today's Cloud: Virtualized Data Centers Standard virtual kit, for rent. Still yours to manage. Cloud 2.0 Automation Google Cloud Platform Vision Messaging Big Data Containers NoSQL
  • 10.
    http://googleasiapacific.blogspot.se/2015/06/growing-our-data-center-in-singapore.html For the past17 years, Google has been building out the fastest, most powerful, highest quality cloud infrastructure on the planet.
  • 11.
    Edge locations invirtually every country in the world Our Network
  • 12.
  • 13.
    10+ Years ofTackling Big Data Problems Google Cloud Platform 13 Google Papers 20082002 2004 2006 2010 2012 2014 2015 GFS Map Reduce Flume Java Millwheel Open Source 2005 Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable BigTable Dremel PubSub Apache Beam Tensorflow
  • 15.
  • 16.
    Confidential + Proprietary Storageand Databases Cloud Storage The Google Cloud data toolbox Cloud SQL Cloud Bigtable Cloud Datastore Big Data and Analytics BigQuery Cloud Pub/Sub Cloud Dataflow Cloud Dataproc Cloud Datalab Machine Learning Cloud Machine Learning Cloud Translate API Cloud Vision API Cloud Speech API
  • 17.
    Confidential + Proprietary Acommon configuration: draw conclusions Events, metrics, etc. Stream Batch Applications and Reports Cloud Datalab Visualization and BI Co-workers Batch B C A Raw logs, files, assets, Google Analytics data etc. A serverless big data stack that scales automatically
  • 18.
    Confidential & ProprietaryGoogleCloud Platform 18 Complexities of Big Data Processing Programming Resource provisioning Performance tuning Monitoring Reliability Deployment & configuration Handling growing scale Utilization improvements Time to Understanding Typical Big Data Processing
  • 19.
    Confidential & ProprietaryGoogleCloud Platform 19 Spend Time on ‘What’ not ‘How’ Time to Understanding Big Data Processing with Google Cloud Platform Programming More time to dig into your data
  • 20.
    Cloud 3.0 BigData Lifecycle Cloud Logs Google App Engine Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Process Stream Cloud Monitoring Real-time analytics Cloud Dataflow Cloud ML Real-time dashboard Real-time alerts Use Data Scientists Analysts Smart apps Catalog & Data Lifecycle Automation Cloud Datalab Cloud Dataproc Data Studio
  • 21.
    Confidential & ProprietaryGoogleCloud Platform 21 Emerging Big Data Challenges Real-time data ingestion Machine learning at scale Batch or streaming? Analytics at the speed of thought
  • 22.
    Batch or Streaming? Whydo you have to choose? Breakthrough #1
  • 23.
    Google Cloud PlatformConfidential & Proprietary 23 We don’t really use MapReduce anymore Urs Hölzle SVP Technical Infrastructure Google “ ”
  • 24.
    Confidential + Proprietary Acommon configuration: capturing input Cloud Pub/Sub Reliable, many-to-many, asynchronous messaging Cloud Storage Powerful, simple and cost-effective object storage Raw logs, files, assets, Google Analytics data etc. Events, metrics, etc.
  • 25.
    Confidential + Proprietary Acommon configuration: process and transform Events, metrics, etc. Cloud Dataflow Data processing engine for batch and stream processing Stream Batch Raw logs, files, assets, Google Analytics data etc.
  • 26.
    Confidential + Proprietary Acommon configuration: process and transform Events, metrics, etc. Cloud Dataflow Data processing engine for batch and stream processing Stream Batch Cloud Dataproc Managed Spark and Hadoop Batch Raw logs, files, assets, Google Analytics data etc.
  • 27.
    Confidential + Proprietary Acommon configuration: analyze and store Events, metrics, etc. Stream Batch BigQuery Extremely fast and cheap on-demand analytics engine Bigtable High performance NoSQL database for large workloads Batch Raw logs, files, assets, Google Analytics data etc.
  • 28.
    Confidential + Proprietary Acommon configuration: draw conclusions Events, metrics, etc. Stream Batch Applications and Reports Cloud Datalab Visualization and BI Co-workers Batch B C A Raw logs, files, assets, Google Analytics data etc.
  • 29.
    Real-time data ingestion (andat scale) Breakthrough #2
  • 30.
    Google confidential │Do not distribute Overview: Data to process: Data in the Consolidated Audit Trail (CAT). A data repository of all equities and options orders, quotes, and events Challenges: How to process the CAT and organize 100 billion market events into an “order lifecycle” in a 4 hour window Store 6 years (~30PB) of data Cloud Bigtable to process and run queries and tolerate volume increases 6 BILLION MARKET EVENTS WRITTEN PER HOUR 1.7 GIGs PER SECOND PER HOUR 6 TBs 10 BN WRITTEN PER HOUR BURSTS 1.7 GIGABYTES PER SECOND 10 TERABYTES PER HOUR
  • 31.
    Google confidential │Do not distributehttps://www.youtube.com/watch?v=fqOpaCS117Q
  • 32.
    Analytics at thespeed of thought (and at scale) Breakthrough #3
  • 33.
    Building what’s next33 Scales automatically No setup or administration Stream up to 100,000 rows p/sec Easily integrates with third-party software Google BigQuery makes complex data analysis simple
  • 34.
    Confidential + Proprietary GoogleBigQuery Performance Example ? Running an inefficient regular expression over 100 billion rows in less than 60 seconds Source: https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query
  • 35.
    1000-core Hadoop Cluster =2.5 hours Before Making ad hoc Queries with BigQuery < 5min After ● 500+ Games ● Hundreds of Analysts ● Terabytes of Data Daily
  • 36.
    Google BigQuery The Powerof Google Dremel for everyone Storage Compute Fast Ingest Query Terabit Network
  • 38.
    “Right at thestart of the partnership we were able to reduce time to insight from 96 hours to 30 minutes by using BigQuery, allowing us to react in real time to customer needs and provide better service..” Gary Sanders Head of the bank's digital analytics function https://www.finextra.com/newsarticle/28566/lloyds-partners-google-on-data-analytics
  • 39.
    Machine learning foreveryone Breakthrough #4
  • 40.
    Google Cloud Platform4040 “ "Machine learning is a core, transformative way by which we're rethinking everything we're doing … we're thoughtfully applying it across all our products, be it search, ads, YouTube or Play."
  • 41.
    Google confidential |Do not distribute Applications that can see, hear and understand
  • 42.
    Confidential & ProprietaryGoogleCloud Platform 42 TensorFlow Deep Learning technology currently powering over 100 Google services Generalizable to vision, sound, text, video and other data Runs on CPUs or GPUs, desktop, server, or mobile computing platforms. Distributed via Apache 2.0 OSS license
  • 43.
    Use your owndata to train models
  • 44.
    Google Cloud PlatformConfidential & Proprietary 44 What Cloud Machine Learning Can Do ● Fully managed service ● Train using a custom Tensor Flow graph ● Batch and online predictions, at scale ● Integrated Datalab experience ● Regression and classification tasks
  • 45.
    Fully trained, easyto use Machine Learning models Cloud Translate API Cloud Vision API Cloud Speech API
  • 46.
  • 47.
    {"landmarkAnnotations": [ "description":"Arc deTriomphe", "locations": [{"latLng": { "latitude":48.873667, “longitude":2.295134}}], "score":0.94231218 ]}
  • 48.
    Cloud Speech API Recognizes over80 languages and variants Can return text in real-time Highly accurate, even in noisy environments Access from any device Powered by Google’s machine learning
  • 49.
    Speech API Demo Clickfor Demo “What are you sinking about ? “ https://www.google.com/intl/en/chrome/demos/speech.html
  • 50.
    Machine Learning UseCases Structured Data Classification/ Regression ● Customer Churn Analysis ● Product Diagnostics ● Forecasting Recommendation ● Content Personalization ● Product X-Sells/Up-sells Anomaly Detection ● Fraud Detection ● Asset Sensor Diagnostics ● Log Metric Anomalies Unstructured Data Image Analytics ● Identify damaged shipments ● Explicit Content Classification Text Analytics ● Call Center log analysis ● Language Identification ● Topic Classification ● Sentiment Analysis
  • 51.
  • 52.
    Google Cloud PlatformConfidential & Proprietary 52
  • 53.
    Google’s Approach to CloudSecurity & Compliance
  • 54.
    ● Tens ofthousands of custom built, homogenous systems ● Dozens of datacenters for redundancy ● Data encryption in transit and at rest ● Secure software development process ● External security verifications ● 500+ security engineers ● 160+ academic research papers on security ● Vulnerability Reward Program We store our own data in this environment
  • 55.
    SSAE-16 SOC 1 SSAE-16 SOC 2 SSAE-16 SOC3 ISO 27001 HIPAA (BAA) PCI DSS v3.0 FISMA FedRamp GAE Complete Complete Complete Complete H2 15 Complete FISMA (Moderate) H2 15 GCS Complete Complete Complete Complete Complete Complete n/a H2 15 GCE Complete Complete Complete Complete Complete Complete n/a H2 15 Datastore Complete Complete Complete Complete H2 15 Complete n/a H2 15 Big Query Complete Complete Complete Complete Complete Complete n/a H2 15 Cloud SQL Complete Complete Complete Complete Complete Complete n/a H2 15 Genomics Complete Complete Complete Complete Complete n/a n/a H2 15 Apps Complete Complete Complete Complete Complete n/a GAFG only H2 15 Certifications
  • 56.
    Google Cloud PlatformConfidential & Proprietary 56 https://cloud.google.com/solutions/machine-learning-with-financial-time-series-data Demo: Predicting the NYSE daily outcome
  • 57.
    Google Cloud PlatformConfidential & Proprietary 57 Get more info: Google Cloud for Financial Services https://cloud.google.com/solutions/finserv/