Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi - Codemotion Milan 2016

779 views

Published on

Let's say you are in charge of the design of a data processing architecture. You finally managed to deploy and tune the "ideal" configuration, but data are like weeds: they just KEEP GROWING! So, eventually you have to add new pieces, or even start over and re-design everything. Sounds familiar? Cloud can be a solution to this Data Architect nightmare. In this talk we will build an end-to-end Serverless, No-Ops, scalable and reliable data solution, based on Google Cloud Platform.

Published in: Technology
  • Be the first to comment

Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi - Codemotion Milan 2016

  1. 1. Serverless Data Architecture at scale on Google Cloud Platform Lorenzo Ridi MILAN 25-26 NOVEMBER 2016
  2. 2. What’s the date today?
  3. 3. Black Friday (ˈblæk fraɪdɪ) noun The day following Thanksgiving Day in the United States. Since 1932, it has been regarded as the beginning of the Christmas shopping season.
  4. 4. Black Friday in the US 2012 - 2016 source: Google Trends, November 23rd 2016
  5. 5. Black Friday in Italy 2012 - 2016 source: Google Trends, November 23rd 2016
  6. 6. What are we doing Processing + analytics Tweets about black friday insights
  7. 7. How we’re gonna do it
  8. 8. How we’re gonna do it
  9. 9. Pub/Sub Container Engine (Kubernetes) How we’re gonna do it
  10. 10. What is Google Cloud Pub/Sub? ● Google Cloud Pub/Sub is a fully-managed real-time messaging service. ○ Guaranteed delivery ■ “At least once” semantics ○ Reliable at scale ■ Messages are replicated in different zones
  11. 11. From Twitter to Pub/Sub $ gcloud beta pubsub topics create blackfridaytweets Created topic [blackfridaytweets]. SHELL
  12. 12. From Twitter to Pub/Sub ? Pub/Sub Topic Subscription A Subscription B Subscription C Consumer A Consumer B Consumer C
  13. 13. From Twitter to Pub/Sub ● Simple Python application using the TweePy library # somewhere in the code, track a given set of keywords stream = Stream(auth, listener) stream.filter(track=['blackfriday', [...]]) [...] # somewhere else, write messages to Pub/Sub for line in data_lines: pub = base64.urlsafe_b64encode(line) messages.append({'data': pub}) body = {'messages': messages} resp = client.projects().topics().publish( topic='blackfridaytweets', body=body).execute(num_retries=NUM_RETRIES) PYTHON
  14. 14. From Twitter to Pub/Sub App + Libs
  15. 15. VM From Twitter to Pub/Sub App + Libs
  16. 16. VM From Twitter to Pub/Sub App + Libs
  17. 17. From Twitter to Pub/Sub App + Libs Container
  18. 18. From Twitter to Pub/Sub App + Libs Container FROM google/python RUN pip install --upgrade pip RUN pip install pyopenssl ndg-httpsclient pyasn1 RUN pip install tweepy RUN pip install --upgrade google-api-python-client RUN pip install python-dateutil ADD twitter-to-pubsub.py /twitter-to-pubsub.py ADD utils.py /utils.py CMD python twitter-to-pubsub.py DOCKERFILE
  19. 19. From Twitter to Pub/Sub App + Libs Container
  20. 20. From Twitter to Pub/Sub App + Libs Container Pod
  21. 21. What is Kubernetes (K8S)? ● An orchestration tool for managing a cluster of containers across multiple hosts ○ Scaling, rolling upgrades, A/B testing, etc. ● Declarative – not procedural ○ Auto-scales and self-heals to desired state ● Supports multiple container runtimes, currently Docker and CoreOS Rkt ● Open-source: github.com/kubernetes
  22. 22. From Twitter to Pub/Sub App + Libs Container Pod apiVersion: v1 kind: ReplicationController metadata: [...] Spec: replicas: 1 template: metadata: labels: name: twitter-stream spec: containers: - name: twitter-to-pubsub image: gcr.io/codemotion-2016-demo/pubsub_pipeline env: - name: PUBSUB_TOPIC value: ... YAML
  23. 23. From Twitter to Pub/Sub App + Libs Container Pod
  24. 24. From Twitter to Pub/Sub App + Libs Container Pod Node
  25. 25. Node From Twitter to Pub/Sub Pod A Pod B
  26. 26. From Twitter to Pub/Sub Node 1 Node 2
  27. 27. From Twitter to Pub/Sub $ gcloud container clusters create codemotion-2016-demo-cluster Creating cluster cluster-1...done. Created [...projects/codemotion-2016-demo/.../clusters/codemotion-2016-demo-cluster]. $ gcloud container clusters get-credentials codemotion-2016-demo-cluster Fetching cluster endpoint and auth data. kubeconfig entry generated for cluster-1. $ kubectl create -f ~/git/kube-pubsub-bq/pubsub/twitter-stream.yaml replicationcontroller “twitter-stream” created. SHELL
  28. 28. Pub/Sub Kubernetes How we’re gonna do it
  29. 29. Pub/Sub Kubernetes Dataflow How we’re gonna do it
  30. 30. Pub/Sub Kubernetes Dataflow BigQuery How we’re gonna do it
  31. 31. What is Google Cloud Dataflow? ● Cloud Dataflow is a collection of open source SDKs to implement parallel processing pipelines. ○ same programming model for streaming and batch pipelines ● Cloud Dataflow is a managed service to run parallel processing pipelines on Google Cloud Platform
  32. 32. What is Google BigQuery? ● Google BigQuery is a fully- managed Analytic Data Warehouse solution allowing real-time analysis of Petabyte- scale datasets. ● Enterprise-grade features ○ Batch and streaming (100K rows/sec) data ingestion ○ JDBC/ODBC connectors ○ Rich SQL-2011-compliant query language ○ Supports updates and deletes new! new!
  33. 33. From Pub/Sub to BigQuery Pub/Sub Topic Subscription Read tweets from Pub/Sub Format tweets for BigQuery Write tweets on BigQuery BigQuery Table Dataflow Pipeline
  34. 34. From Pub/Sub to BigQuery ● A Dataflow pipeline is a Java program. // TwitterProcessor.java public static void main(String[] args) { Pipeline p = Pipeline.create(); PCollection<String> tweets = p.apply(PubsubIO.Read.topic("...blackfridaytweets")); PCollection<TableRow> formattedTweets = tweets.apply(ParDo.of(new DoFormat())); formattedTweets.apply(BigQueryIO.Write.to(tableReference)); p.run(); } JAVA
  35. 35. From Pub/Sub to BigQuery ● A Dataflow pipeline is a Java program. // TwitterProcessor.java // Do Function (to be used within a ParDo) private static final class DoFormat extends DoFn<String, TableRow> { private static final long serialVersionUID = 1L; @Override public void processElement(DoFn<String, TableRow>.ProcessContext c) { c.output(createTableRow(c.element())); } } // Helper method private static TableRow createTableRow(String tweet) throws IOException { return JacksonFactory.getDefaultInstance().fromString(tweet, TableRow.class); } JAVA
  36. 36. From Pub/Sub to BigQuery ● Use Maven to build, deploy or update the Pipeline. $ mvn compile exec:java -Dexec.mainClass=it.noovle.dataflow.TwitterProcessor -Dexec.args="--streaming" [...] INFO: To cancel the job using the 'gcloud' tool, run: > gcloud alpha dataflow jobs --project=codemotion-2016-demo cancel 2016-11- 19_15_49_53-5264074060979116717 [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 18.131s [INFO] Finished at: Sun Nov 20 00:49:54 CET 2016 [INFO] Final Memory: 28M/362M [INFO] ------------------------------------------------------------------------ SHELL
  37. 37. From Pub/Sub to BigQuery ● You can monitor your pipelines from Cloud Console.
  38. 38. From Pub/Sub to BigQuery ● Data start flowing into BigQuery tables. You can run queries from the CLI or the Web Interface.
  39. 39. Pub/Sub Kubernetes Dataflow BigQuery How we’re gonna do it
  40. 40. Pub/Sub Kubernetes Dataflow BigQuery Data Studio How we’re gonna do it
  41. 41. Pub/Sub Kubernetes Dataflow BigQuery How we’re gonna do it Data Studio
  42. 42. Pub/Sub Kubernetes Dataflow BigQuery How we’re gonna do it Natural Language API Data Studio
  43. 43. Sentiment Analysis with Natural Language API Polarity: [-1,1] Magnitude: [0,+inf) Text
  44. 44. Sentiment Analysis with Natural Language API Polarity: [-1,1] Magnitude: [0,+inf) Text sentiment = polarity x magnitude
  45. 45. Sentiment Analysis with Natural Language API Pub/Sub Topic Read tweets from Pub/Sub Write tweets on BigQuery BigQuery Tables Dataflow Pipeline Filter and Evaluate sentiment Format tweets for BigQuery Write tweets on BigQuery Format tweets for BigQuery
  46. 46. From Pub/Sub to BigQuery ● We just add the additional necessary steps. // TwitterProcessor.java public static void main(String[] args) { Pipeline p = Pipeline.create(); PCollection<String> tweets = p.apply(PubsubIO.Read.topic("...blackfridaytweets")); PCollection<String> sentTweets = tweets.apply(ParDo.of(new DoFilterAndProcess())); PCollection<TableRow> formSentTweets = sentTweets.apply(ParDo.of(new DoFormat())); formSentTweets.apply(BigQueryIO.Write.to(sentTableReference)); PCollection<TableRow> formattedTweets = tweets.apply(ParDo.of(new DoFormat())); formattedTweets.apply(BigQueryIO.Write.to(tableReference)); p.run(); } JAVA PCollection<String> sentTweets = tweets.apply(ParDo.of(new DoFilterAndProcess())); PCollection<TableRow> formSentTweets = sentTweets.apply(ParDo.of(new DoFormat())); formSentTweets.apply(BigQueryIO.Write.to(sentTableReference));
  47. 47. From Pub/Sub to BigQuery ● The update process preserves all in-flight data. $ mvn compile exec:java -Dexec.mainClass=it.noovle.dataflow.TwitterProcessor -Dexec.args="--streaming --update --jobName=twitterprocessor-lorenzo-1107222550" [...] INFO: To cancel the job using the 'gcloud' tool, run: > gcloud alpha dataflow jobs --project=codemotion-2016-demo cancel 2016-11- 19_15_49_53-5264074060979116717 [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 18.131s [INFO] Finished at: Sun Nov 20 00:49:54 CET 2016 [INFO] Final Memory: 28M/362M [INFO] ------------------------------------------------------------------------ SHELL
  48. 48. From Pub/Sub to BigQuery
  49. 49. Pub/Sub Kubernetes Dataflow BigQuery Data Studio We did it! Natural Language API
  50. 50. Pub/Sub Kubernetes Dataflow BigQuery Data Studio We did it! Natural Language API
  51. 51. Live demo
  52. 52. Polarity: -1.0 Magnitude: 1.5 Polarity: -1.0 Magnitude: 2.1
  53. 53. Thank you!

×