Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Best Practices on GCP

459 views

Published on

Learn all the Best Practices for Big Data and Analytics using GCP

Published in: Technology
  • Be the first to comment

Big Data Best Practices on GCP

  1. 1. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. Big Data Best Practices Real Time Analytics Lior Hipsh 10/7/17
  2. 2. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io AllCloud in a NutShell ● 9 years of cloud experience ● 1500+ successful deployments ● 1000+ customers ● 3 operating centers
  3. 3. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Agenda Big Data Introduction Real Time Analytics GCP DataFlow
  4. 4. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Big Data Volume - DWHs and Storage, Shard based DB (NoSQL & SQL), Unstructured data parallel processing Velocity - Real time analytics, quick response & reduced DWH Variety - Schemaless - flexibility of the data - (Document DB); flexibility of Relations (Graph DB)
  5. 5. Data...
  6. 6. ...can be big...
  7. 7. ...really, really big... Tuesday Wednesday Thursday
  8. 8. … maybe infinitely big... 9:008:00 14:0013:0012:0011:0010:00
  9. 9. … With unknown delays. 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00
  10. 10. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io “Historical” Pattern High Volume Store Structured Data Injestion (transport, capture) DWH BI Structured Data ETL steps created OLAP cubes and any processed digested data ETL (sql)
  11. 11. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Today Common Pattern High Volume (batch) Injestion (transport, capture) Data Processing (batch) DWH/SQL BI Multi step/pipes processing. Best to pass temporary data via the transport Multi step/pipes processing could be required also on digested data for additional analysis ETL Analytical data Transformed data Unstructured & Structured Data Analytics data processing typically by Map/Reduced as Spark or Hadoop over files or NoSQL. ETL can also be done by Map Reduce but mostly done by ETL tools
  12. 12. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics Simplest Ingestion Data Processing (streaming / Rule Based Engine/ CEP) BI (visual+sm all size db) Action RT vs Batch - level of 2-3 sec and below Data may not be ETL to DWH after analytics been produced Does Batch is just Real-Time with skew parameter = 1h?... Analytical data
  13. 13. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics In practice Ingestion (capture) Data Processing BI Database (digested output) In Memory ● MapReduce ● SQL in-mem DB ● NoSQL in mem (e.g. Redis) ● Transport/Queues Rules accesses Memory
  14. 14. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Bigtable GCP Simple Pattern Pubsub DataFlow Big Query BI Tool (e.g. Tableau) C SQL Multi step/pipes processing Case of processed output analytics is yet high volume
  15. 15. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics with File Archive Ingestion Data Processing BI Low cost Bucket Database In Memory
  16. 16. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics w/ Retroactive Batch processing Ingestion Data Processing (streaming) BI Low cost Bucket Data Processing (batch) Database
  17. 17. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics with DWH for “off line” Ingestion Data Processing BI SQL Database DWHETL Analytical data Raw data
  18. 18. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io On GCP Pubsub DataFlow (streaming) Big Query (3 month raw ) BI Tool (e.g. Tableau) C SQL (digested data) Multi step/pipes processing Low cost Bucket full history Analytical data BQ support internal aging which can save the low cost bucket
  19. 19. 19 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow CEP over GCP Stack
  20. 20. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics Stack architect decisions to do Ingestion Message bus Files Database pubsub, Kafka, bucket, HBASE, HFDS, BigTable, etc Data Processing SQL Rules programmatically Tableau, Looker, Data Studio (free…), BO GCP DataFlow (Apache Beam), Apache Flink, Spark Streaming, Drools , SQLStream, Tibco Streaming Analytics , IBM Streams Share Batch & real time pipe Separate DWH Columnar DWH Low cost SQL DB (if possible) BigQuery, IBM Netezza , Vertica, InfoBright, Teradata BI OLAP Report Generator Data Processing
  21. 21. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Data Processing - CEP engines Complex Event Processing GCP DataFlow - programmatically (Apache Beam) , Python/Java. Same code framework used also on batch processing and real time. Apache Flink - programmatically. Spark Streaming - micro-batches. Kafka Streams (programmatically). Drools (Jboss) Sqlstream (SQL rules) Esper (SQL like - “EPL” - Event Processing Language) Cisco Stream Analytics (SQL)
  22. 22. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Pipes Technology decision guide Pipe logic to be able to use data outside the streams (ext db) Pipe code better be testable also out of cloud (+ cloud agnostic) Day 0 decision - do we need the time pipe also on batch. Extensibility to unmanaged pipe - e.g. - CPP code that do one of the steps Eco-system/Libs - i.e. - does the pipe needs Sci Libs or ML as well.
  23. 23. 23 Beam=Batch+Stream Apache Beam (incubating) Cloud Dataflow Based on Apache Beam. Pipelines are portable to your favorite runtime.
  24. 24. Confidential & ProprietaryGoogle Cloud Platform 24 • Movement • Filtering • Enrichment • Shaping • Reduction • Batch computation • Continuous computation • Composition • External orchestration • Simulation Where might you use Cloud Dataflow? AnalysisETL Orchestration
  25. 25. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io “Windowing” Data typically is an infinite time series. Need to check rule match per event while using historical data from the last X minutes. Framework works by definition of Windows, mainly using sliding windows. Can be tied to arrival time or custom event time Watermarks + Triggers enable robust completeness
  26. 26. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Dataflow -> Apache Beam Batch and Real time become “2 edge points” on a scale of processing definitions for delay-from-real-time “factor” : a parameter in the processing code In default , none parameterized - do batch. Full control (per processing of a data collection) on the Windowing and time shift from event to processing. Full streaming control. Python or Java Open Source. Can run it in cloud or at home. Code can be running on Spark or Flink. Dynamic Work Rebalancing
  27. 27. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Multi Pipes Flows Data Processing engine should support stream processing (pushing/routing output to next stream/pipe). Option for multi-step processing supported without going via transport Monitoring is a must. Recovery is a must. Auto-scale (cloud…) of each step. Assumes peaks. Cross Cloud and Hybrids
  28. 28. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Ingestion guide Go on bucket for “first line” if possible. Can work in many systems, including some IoT, where devices can upload short batches rather single event at a time. React to files by moving to pubsub - flatten peaks issue Invest time on sharding design (good on any sharded system….) No need in GCP ! (there are Partitions in Beam but for App logic needs)
  29. 29. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Thank you!
  30. 30. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Backup slides
  31. 31. Confidential & ProprietaryGoogle Cloud Platform 31 Scenario
  32. 32. Confidential & ProprietaryGoogle Cloud Platform 32 Pipeline p = Pipeline.create( OptionsBuilder.RunOnService(true, false)); PCollection<String> rawData = p.begin().apply(TextIO.Read .from(OptionsBuilder.GCS_RAWDUMP_URI)); PCollection<PlaybackEvent> events = rawData.apply( new ParseTransform()); events.apply(new ArchiveTransform()); events.apply(new SessionAnalysisTransform()); events.apply(new AssetTransform()); p.run(); Java 7 Implementation
  33. 33. 33 Cloud Pub/Sub Fast, reliable, event delivery. Serverless, autoscaling, pay for what you use.

×