Serverless Data Architecture at scale on Google Cloud Platform

Serverless Data Architecture at scale on
Google Cloud Platform
Lorenzo Ridi
Machine Learning/Data Science Meetup
Rome, 02-02-2017

I’ve been a
Research Fellow @UniFI
I am a
Software Engineer @Noovle
I am a
Qualified Developer
I am a
Authorized Trainer
Hi, I’m Lorenzo!

Google’s Mission
Organize the world’s information and make
it universally accessible and useful.
“
”

2002 2004 2006 2008 2010 2012 2014 2016
GFS
MapReduce TensorFlow
BigTable
Dremel
Colossus
Flume
Megastore
Spanner
Millwheel
PubSub
F1
Google’s Data Research

2002 2004 2006 2008 2010 2012 2014 2016
ML
PubSub
DataFlow
DataStore
DataFlow
Cloud Storage
BigQuery
BigTable
DataProc
Cloud Storage
Google’s Data Products

GA
Cloud
Natural
Language
BetaGAGA
Cloud
Speech
Cloud
Translate
Cloud
Vision
Stay tuned...
Fully trained ML models from Google Cloud that allow a general developer to take
advantage of rich machine learning capabilities with simple REST based services.
Pre-Trained Machine Learning Models

tensorflow.org
github.com/tensorflow
Open Source Software Library for
Machine Learning.
Cloud Machine Learning
Managed service that enables
you to easily build machine
learning models, that work on any
type of data, of any size.
Use your own data to train models

Cracking Black Friday
Adding Machine Learning to a serverless
data analysis pipeline

Black Friday (ˈblæk fraɪdɪ)
noun
The day following Thanksgiving Day in the
United States. Since 1932, it has been
regarded as the beginning of the Christmas
shopping season.

Black Friday in the US
2012 - 2016
source: Google Trends, November 23rd 2016

Black Friday in Italy
2012 - 2016
source: Google Trends, November 23rd 2016

What are we doing
Processing
+ analytics
Tweets about
black friday
insights

Pub/Sub
Container
Engine
(Kubernetes)
How we’re gonna do it

What is Google Cloud Pub/Sub?
● Google Cloud Pub/Sub is a
fully-managed real-time
messaging service.
○ Guaranteed delivery
■ “At least once” semantics
○ Reliable at scale
■ Messages are replicated in
different zones

From Twitter to Pub/Sub
$ gcloud beta pubsub topics create blackfridaytweets
Created topic [blackfridaytweets].
SHELL

?
Pub/Sub Topic
Subscription A
Subscription B
Subscription C
Consumer A
Consumer B
Consumer C

● Simple Python application using the TweePy library
# somewhere in the code, track a given set of keywords
stream = Stream(auth, listener)
stream.filter(track=['blackfriday', [...]])
[...]
# somewhere else, write messages to Pub/Sub
for line in data_lines:
pub = base64.urlsafe_b64encode(line)
messages.append({'data': pub})
body = {'messages': messages}
resp = client.projects().topics().publish(
topic='blackfridaytweets', body=body).execute(num_retries=NUM_RETRIES)
PYTHON

App
+
Libs

VM
App
+
Libs

App
+
Libs
Container

App
+
Libs
Container
FROM google/python
RUN pip install --upgrade pip
RUN pip install pyopenssl ndg-httpsclient pyasn1
RUN pip install tweepy
RUN pip install --upgrade google-api-python-client
RUN pip install python-dateutil
ADD twitter-to-pubsub.py /twitter-to-pubsub.py
ADD utils.py /utils.py
CMD python twitter-to-pubsub.py
DOCKERFILE

App
+
Libs
Container Pod

What is Kubernetes (K8S)?
● An orchestration tool for managing a
cluster of containers across multiple
hosts
○ Scaling, rolling upgrades, A/B testing, etc.
● Declarative – not procedural
○ Auto-scales and self-heals to desired
state
● Supports multiple container runtimes,
currently Docker and CoreOS Rkt
● Open-source: github.com/kubernetes

App
+
Libs
Container Pod
apiVersion: v1
kind: ReplicationController
metadata:
[...]
Spec:
replicas: 1
template:
metadata:
labels:
name: twitter-stream
spec:
containers:
- name: twitter-to-pubsub
image: gcr.io/codemotion-2016-demo/pubsub_pipeline
env:
- name: PUBSUB_TOPIC
value: ...
YAML

App
+
Libs
Container Pod Node

Node
Pod A Pod B

Node 1
Node 2

$ gcloud container clusters create codemotion-2016-demo-cluster
Creating cluster cluster-1...done.
Created [...projects/codemotion-2016-demo/.../clusters/codemotion-2016-demo-cluster].
$ gcloud container clusters get-credentials codemotion-2016-demo-cluster
Fetching cluster endpoint and auth data.
kubeconfig entry generated for cluster-1.
$ kubectl create -f ~/git/kube-pubsub-bq/pubsub/twitter-stream.yaml
replicationcontroller “twitter-stream” created.
SHELL

Pub/Sub
Kubernetes

Pub/Sub
Kubernetes
Dataflow

Pub/Sub
Kubernetes
Dataflow
BigQuery

What is Google Cloud Dataflow?
● Cloud Dataflow is a collection
of open source SDKs to
implement parallel processing
pipelines.
○ same programming model for
streaming and batch pipelines
● Cloud Dataflow is a managed
service to run parallel
processing pipelines on

What is Google BigQuery?
● Google BigQuery is a fully-
managed Analytic Data
Warehouse solution allowing
real-time analysis of Petabyte-
scale datasets.
● Enterprise-grade features
○ Batch and streaming (100K
rows/sec) data ingestion
○ JDBC/ODBC connectors
○ Rich SQL-2011-compliant query
language
○ Supports updates and deletes
new!
new!

From Pub/Sub to BigQuery
Pub/Sub Topic
Subscription
Read tweets
from
Pub/Sub
Format
tweets for
BigQuery
Write tweets
on BigQuery
BigQuery
Table
Dataflow Pipeline

● A Dataflow pipeline is a Java program.
// TwitterProcessor.java
public static void main(String[] args) {
Pipeline p = Pipeline.create();
PCollection<String> tweets = p.apply(PubsubIO.Read.topic("...blackfridaytweets"));
PCollection<TableRow> formattedTweets = tweets.apply(ParDo.of(new DoFormat()));
formattedTweets.apply(BigQueryIO.Write.to(tableReference));
p.run();
}
JAVA

● A Dataflow pipeline is a Java program.
// Do Function (to be used within a ParDo)
private static final class DoFormat extends DoFn<String, TableRow> {
private static final long serialVersionUID = 1L;
@Override
public void processElement(DoFn<String, TableRow>.ProcessContext c) {
c.output(createTableRow(c.element()));
}
}
// Helper method
private static TableRow createTableRow(String tweet) throws IOException {
return JacksonFactory.getDefaultInstance().fromString(tweet, TableRow.class);
}
JAVA

● Use Maven to build, deploy or update the Pipeline.
$ mvn compile exec:java -Dexec.mainClass=it.noovle.dataflow.TwitterProcessor
-Dexec.args="--streaming"
[...]
INFO: To cancel the job using the 'gcloud' tool, run:
> gcloud alpha dataflow jobs --project=codemotion-2016-demo cancel 2016-11-
19_15_49_53-5264074060979116717
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 18.131s
[INFO] Finished at: Sun Nov 20 00:49:54 CET 2016
[INFO] Final Memory: 28M/362M
[INFO] ------------------------------------------------------------------------
SHELL

● You can monitor your pipelines from Cloud Console.

● Data start flowing into BigQuery tables. You can run queries
from the CLI or the Web Interface.

Pub/Sub
Kubernetes
Dataflow
BigQuery
Data
Studio

Pub/Sub
Kubernetes
Dataflow
BigQuery
Natural
Language
API
Data
Studio

Sentiment Analysis with Natural Language API
Polarity: [-1,1]
Magnitude: [0,+inf)
Text

Polarity: [-1,1]
Magnitude: [0,+inf)
Text
sentiment = polarity x magnitude

Pub/Sub Topic
Read tweets
from
Pub/Sub
Write tweets
on BigQuery BigQuery
Tables
Dataflow Pipeline
Filter and
Evaluate
sentiment
Format
tweets for
BigQuery
Write tweets
on BigQuery
Format
tweets for
BigQuery

● We just add the additional necessary steps.
public static void main(String[] args) {
Pipeline p = Pipeline.create();
PCollection<String> tweets = p.apply(PubsubIO.Read.topic("...blackfridaytweets"));
PCollection<String> sentTweets = tweets.apply(ParDo.of(new DoFilterAndProcess()));
PCollection<TableRow> formSentTweets = sentTweets.apply(ParDo.of(new DoFormat()));
formSentTweets.apply(BigQueryIO.Write.to(sentTableReference));
PCollection<TableRow> formattedTweets = tweets.apply(ParDo.of(new DoFormat()));
formattedTweets.apply(BigQueryIO.Write.to(tableReference));
p.run();
}
JAVA
PCollection<String> sentTweets = tweets.apply(ParDo.of(new DoFilterAndProcess()));
PCollection<TableRow> formSentTweets = sentTweets.apply(ParDo.of(new DoFormat()));
formSentTweets.apply(BigQueryIO.Write.to(sentTableReference));

● The update process preserves all in-flight data.
$ mvn compile exec:java -Dexec.mainClass=it.noovle.dataflow.TwitterProcessor
-Dexec.args="--streaming --update --jobName=twitterprocessor-lorenzo-1107222550"
[...]
INFO: To cancel the job using the 'gcloud' tool, run:
> gcloud alpha dataflow jobs --project=codemotion-2016-demo cancel 2016-11-
19_15_49_53-5264074060979116717
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 18.131s
[INFO] Finished at: Sun Nov 20 00:49:54 CET 2016
[INFO] Final Memory: 28M/362M
[INFO] ------------------------------------------------------------------------
SHELL

Pub/Sub
Kubernetes
Dataflow
BigQuery
Data
Studio
We did it!
Natural
Language
API

Polarity: -1.0
Magnitude: 1.5
Polarity: -1.0
Magnitude: 2.1

Thank you!
lorenzo.ridi@noovle.it

Serverless Data Architecture at scale on Google Cloud Platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Serverless Data Architecture at scale on Google Cloud Platform

Similar to Serverless Data Architecture at scale on Google Cloud Platform (20)

More from MeetupDataScienceRoma

More from MeetupDataScienceRoma (20)

Recently uploaded

Recently uploaded (20)

Serverless Data Architecture at scale on Google Cloud Platform

Editor's Notes