Big Data On Google Cloud
Tu Pham - IO extended 2017
CTO @ Dyno
ADataasservicecompany
Technologies: Java, Python, all kind of databases and Cloud
platform from Google, Aws, Azure.
Interests: Cloud computing / architecture, technology
evolution, distributed systems.
Husband, Father, GDE, Open source contributor.
Tu Pham
foto: Lars Kruse, Aarhus Universitet
3
Giới thiệu Dyno: 

- Tech marketing & digital
agency
For	the	past	17 years,	Google		
has	been	building	out	the		
world’s	fastest,	most	powerful,		
highest	quality	cloud		
infrastructure	on the planet.
Images by ConnieZhou
Google	 Cloud	 Platform	 is	 built	 on	
the	 s a m e	 infrastructure	 that	
powers	Google.
Images	by	Connie	Zhou
Google’s	Platform
“[Google's]	ability	to	build,	organize,	and		
operate	a	huge	network	of	servers	and	
fiber-		optic	cables	with	an	efficiency	and	
speed	that		rocks	physics	on	its	heels.	
This is whatmakes Google Google: its	physical		
network,	its	thousands	of	fiber	miles,	and	
those		many	thousands	of	servers	that,	in	
aggregate,		add	up	to	the	mother of all
clouds.”
-	
Wired
77
Peering locations
Yes,	We	Can	Power	that
Web Mobile Storage	&	Database
Big	Data Highly	Scalable	System	 Data	Mining	
Cloud	Platform
Google CloudPlatform
Organize	the	world’s		
information	and	make	it		
universally	accessible	and	
useful.
Google’s Mission
2
“
Google CloudPlatform 5
Source: Boston Consulting Group:
The Mobile Revolution: How Mobile Technologies Drive a Trillion-DollarImpact
IDC,2015
By	2020,	there	will	be	8	Billion	connected	smart	phones	
—			2X	more	than	today.
And 32 Billion connected “IOT”devices
—6X more thantoday.
Exploring	the	Cloud
IaaS	
Infrastructure-as-a-
Service
PaaS	
Platform-as-a-
Service
SaaS	
Software-as-a-
Service
Google	Cloud	
Platform
Cloud	Platform
Google	Compute	Engine
Cloud	Platform
• Flexible	Infrastructure	
• Customer	VM	Size	
• Online	Disk	Resizing	
• Network	
• Internal	Network	
• Firewall	
• Load	Balancing	
• External	Ip	Address	
• Billing	
• Sustained	Usage	Discounts	
• Preemptible	VM
App	Engine
•	Fully	Managed	Platform	
• Popular	Programming	Language	Support	
• Flexible	and	Scalable	Application	Storage	
• Auto-scaling	
• Versioning	and	Traffic	Splitting	
• Local	Developer	Tools	
•	Third-party	Frameworks	and	Extensions
Cloud	Platform
• Global	Presence	
• Flexible	Delivery	Options	
• Pull	
• Push	
• Data	Reliability	
• Flow	Control	
• Data	Security	And	Protection
Cloud	Platform
Pub	Sub
• Reliable	&	Consistency	
Processing	
• Unified	Programing	Model	
• Intelligence	Work	Scheduling	
• Auto	Scaling	
• Monitoring	
• Open	Source
Cloud	Platform
Cloud	Data	Flow
• Versioning	
• Static	Sites	
• Resumable	Transfers	
• Object	Change	Notifications	
• TB	scale	
Cloud	Platform
Cloud	Storage
Cloud	SQL
• Fully	managed	
• Ease	of	Use	
• Highly	Reliable	
• Flexible	Charging	
• Security,	Availability,	Durability	
• Easy	Migration	&	Data	
Portability	
• Optimized	Mysql	versions
Cloud	Platform
Big	Query
• Fully	Managed	Big	Data	Analytics	Service	
• Support	SQL		
• Fast	
• Scalable	
• Flexible	and	Familiar	
• Security	and	Reliability	
Cloud	Platform
Data	Proc
• Includes	
• Apache	Hadoop	
• Apache	Pig	
• Apache	Hive	
• Apache	Spark	
• Fast	And	Scalable	Data	Processing	
• Flexible	Virtual	Machines	
• Resizable	Cluster	
Cloud	Platform
Data	Lab
• Powerful	Data	Exploration	
• Scalable	
• Data	Management	
• Visualization	
• Open	Source	(Jupyter)	
Cloud	Platform
Google’s Data Services for everyone
A common configuration: draw	conclusions
CloudDatalab
Events,	metrics,		
etc.	
Stream	
Visualization and BI
Raw	logs,	files,		
assets,	Google	
Analytics	data	etc.	 Co-workers
Batch	
Batch	
B C Applications and
A Reports
Confidential +Proprietary
A	serverless big	data	stack	that	
scales	automatically
10+	Years	of	Tackling	Big	Data			Problems
Google CloudPlatform 13
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Flume
Java
Millwheel
Open
Source
2005
Google
Cloud
Products BigQuery Pub/Sub Dataflow Bigtable
BigTable Dremel PubSub
Apache
Beam
Tensorflow
Confidential & ProprietaryGoogle Cloud Platform 24
Transform Data into Actions
Exploration &
Collaboration
Databases Storage
Data
Preparation &
Processing
Analytics
Advanced
Analytics &
Intelligence
Mobile apps
Sensors and
devices
Web apps
Relational
Key-value
Document
SQL
Wide column
Object
Stream
processing
Batch
processing
Data
preparation
Federated
query
Data catalog
Data
exploration
Data
visualization
Developers
Data scientists
Business
analysts
Development
environment
for Machine
Learning
Pre-Trained
Machine
Learning
models
Data
Ingestion
Messaging
Logs
Confidential & ProprietaryGoogle Cloud Platform 25
Transform Data into Actions
Data
Preparation &
Processing
Cloud Dataflow
Cloud Dataproc
Exploration &
Collaboration
Google
BigQuery
Cloud Datalab
Google
Analytics 360
Cloud Dataproc
Mobile apps
Sensors and
devices
Web apps
Developers
Data scientists
Business
analysts
Data Ingestion
Cloud Pub/Sub
App Engine
Databases/
Storage
Cloud SQL
Cloud Bigtable
Cloud
Datastore
Cloud Storage
Analytics
Google BigQuery
Google
Analytics 360
Cloud Dataproc
Google Drive
Advanced
Analytics &
Intelligence
Cloud Machine
Learning
Translate API
Vision API
Speech API
Google Cloud Platform 3
Apache Spark and Apache Hadoop should be
fast, easy, and cost-effective.
Google	Cloud	Data	Proc
Traditional Spark and Hadoop clusters
Google Cloud Dataproc
Google Cloud Dataproc - under the hood
Applications on
the cluster
Dataproc Jobs
GCP Products
Spark
PySpark
Spark SQL
MapReduce
Pig
Hive
Dataproc Cluster
Spark & Hadoop OSS
Cloud Dataproc Agent
Google Cloud Services
Dataproc Jobs FeaturesData Outputs
Easy, fast, cost-effective
Fast
Things take seconds to minutes, not hours or weeks
Easy
Be an expert with your data, not your data infrastructure
Cost-effective
Pay for exactly what you use
Running Hadoop on Google Cloud
bdutil
Free OSS Toolkit
Dataproc
Managed Hadoop
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
Custom Code
Monitoring/Health
Dev Integration
Manual Scaling
Job Submission
GCP Connectivity
Deployment
Creation
On
Premise
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
Google Managed
Google Cloud Platform
Customer Managed
Vendor
Hadoop
Custom Code
Monitoring/Health
Dev Integration
Scaling
Job Submission
GCP Connectivity
Deployment
Creation
6
Cloud Dataproc - integrated
6
Cloud Dataproc is
natively integrated with
several Google Cloud
Platform products as
part of an integrated
data platform.
Storage
Operations
Data
7
Where Cloud Dataproc fits into GCP
7
Google Bigtable
(HBase)
Google BigQuery
(Analytics, Data warehouse)
Stackdriver Logging
(Logging Ops.)
Google Cloud Dataflow
(Batch/Stream Processing)
Google Cloud Storage
(HCFS/HDFS)
Stackdriver Monitoring
(Monitoring)
Building what’s next 33
Scales automatically
No setup or administration
Stream up to 100,000 rowsp/sec
Easily integrates with third-partysoftware
Google BigQuery
makes	complex	data	analysis	simple
Confidential +
Proprietary
Google	BigQuery	Performance	Example	?
Running an inefficient	regular expression over 100 billion rowsin
less than 60 seconds
Source: https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-
query
Google	BigQuery
The	Power	of	Google	Dremel	for	everyone
Storage Compute
Fast Ingest
Query
Terabit Network
1000-core Hadoop Cluster
= 2.5 hours
Before
Making ad hocQueries
with BigQuery <5min
After
● 500+	Games
● Hundreds	of	Analysts	
● Terabytes	of	Data	Daily
“Right	at	the	start	of	the	partnership	we	were		
able	to	reduce	time	to	insight	from	96	hours	to		
30	minutes	by	using	BigQuery,	allowing	us	to		
react	in	real	time	to	customer	needs	and		provide	
better	service..”
GarySanders
Head of the bank's digital analyticsfunction
https://www.finextra.com/newsarticle/28566/lloyds-partners-google-on-data-analytics
Big Data Challenges At Dyno
- Multi TB data warehouse
- Raw input > 100 GB new raw data per day (Structured
& Unstructured)
- 65 online data source
- Unlimited offline data source
- Face with data quality problem everyday
- From user information & behavior to user interest &
intention
- Manage high performance / cost effective system
JOIN THE FLIGHT - WE ARE HIRING
IO Extended 2017
Twitter: @phamptu
Email: tu@dyno.vn
Frontend Developer: goo.gl/EY8RvV
Backend Developer: goo.gl/BnmmK6

Big data on google cloud