[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
Big Data Best Practices
Real Time Analytics
Lior Hipsh
10/7/17
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
AllCloud in a NutShell
● 9 years of cloud experience
● 1500+ successful deployments
● 1000+ customers
● 3 operating centers
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Agenda
Big Data Introduction
Real Time Analytics
GCP DataFlow
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Big Data
Volume - DWHs and Storage, Shard based DB (NoSQL & SQL), Unstructured
data parallel processing
Velocity - Real time analytics, quick response & reduced DWH
Variety - Schemaless - flexibility of the data - (Document DB); flexibility of
Relations (Graph DB)
Data...
...can be big...
...really, really big...
Tuesday
Wednesday
Thursday
… maybe infinitely big...
9:008:00 14:0013:0012:0011:0010:00
… With unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
“Historical” Pattern
High Volume Store Structured Data
Injestion
(transport,
capture)
DWH BI
Structured Data
ETL steps created
OLAP cubes and any
processed digested
data
ETL
(sql)
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Today Common Pattern
High Volume (batch)
Injestion
(transport,
capture)
Data
Processing
(batch)
DWH/SQL BI
Multi step/pipes
processing. Best to
pass temporary data via
the transport
Multi step/pipes
processing could be
required also on
digested data for
additional analysis
ETL
Analytical
data
Transformed
data
Unstructured & Structured Data
Analytics data processing typically by
Map/Reduced as Spark or Hadoop over
files or NoSQL.
ETL can also be done by Map Reduce
but mostly done by ETL tools
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics
Simplest
Ingestion
Data
Processing
(streaming /
Rule Based
Engine/ CEP)
BI
(visual+sm
all size db)
Action
RT vs Batch - level of 2-3 sec
and below
Data may not be ETL to DWH
after analytics been produced
Does Batch is just Real-Time
with skew parameter = 1h?...
Analytical
data
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics
In practice
Ingestion
(capture)
Data
Processing
BI
Database
(digested
output)
In Memory
● MapReduce
● SQL in-mem DB
● NoSQL in mem (e.g. Redis)
● Transport/Queues
Rules accesses
Memory
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Bigtable
GCP Simple Pattern
Pubsub DataFlow
Big
Query
BI Tool
(e.g. Tableau)
C SQL
Multi step/pipes
processing
Case of processed
output analytics is
yet high volume
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics with File Archive
Ingestion
Data
Processing
BI
Low cost
Bucket
Database
In Memory
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics w/ Retroactive Batch
processing
Ingestion
Data
Processing
(streaming)
BI
Low cost
Bucket
Data
Processing
(batch)
Database
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics with DWH for “off
line”
Ingestion
Data
Processing
BI
SQL
Database
DWHETL
Analytical
data
Raw
data
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
On GCP
Pubsub
DataFlow
(streaming)
Big
Query
(3 month
raw )
BI Tool
(e.g. Tableau)
C SQL
(digested
data)
Multi step/pipes
processing
Low cost Bucket
full history
Analytical
data
BQ support internal aging
which can save the low
cost bucket
19
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
CEP over GCP Stack
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics Stack
architect decisions to do
Ingestion
Message bus
Files
Database
pubsub, Kafka,
bucket, HBASE,
HFDS, BigTable, etc
Data Processing
SQL Rules
programmatically
Tableau, Looker, Data
Studio (free…), BO
GCP DataFlow
(Apache Beam),
Apache Flink, Spark
Streaming, Drools ,
SQLStream, Tibco
Streaming Analytics ,
IBM Streams
Share Batch &
real time pipe
Separate
DWH
Columnar DWH
Low cost SQL DB
(if possible)
BigQuery, IBM
Netezza , Vertica,
InfoBright, Teradata
BI
OLAP
Report
Generator
Data Processing
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Data Processing - CEP engines
Complex Event Processing
GCP DataFlow - programmatically (Apache
Beam) , Python/Java. Same code
framework used also on batch processing
and real time.
Apache Flink - programmatically.
Spark Streaming - micro-batches.
Kafka Streams (programmatically).
Drools (Jboss)
Sqlstream (SQL rules)
Esper (SQL like - “EPL” - Event Processing
Language)
Cisco Stream Analytics (SQL)
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Pipes Technology
decision guide
Pipe logic to be able to use data outside the streams (ext db)
Pipe code better be testable also out of cloud (+ cloud agnostic)
Day 0 decision - do we need the time pipe also on batch.
Extensibility to unmanaged pipe - e.g. - CPP code that do one of the steps
Eco-system/Libs - i.e. - does the pipe needs Sci Libs or ML as well.
23
Beam=Batch+Stream
Apache Beam (incubating)
Cloud Dataflow
Based on Apache Beam. Pipelines are portable to your favorite runtime.
Confidential & ProprietaryGoogle Cloud Platform 24
• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
Where might you use Cloud Dataflow?
AnalysisETL Orchestration
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
“Windowing”
Data typically is an infinite time series.
Need to check rule match per event
while using historical data from the
last X minutes.
Framework works by definition of
Windows, mainly using sliding
windows.
Can be tied to arrival time or custom
event time
Watermarks + Triggers enable robust
completeness
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Dataflow -> Apache Beam
Batch and Real time become “2 edge points” on a scale of processing definitions for
delay-from-real-time “factor” : a parameter in the processing code
In default , none parameterized - do batch.
Full control (per processing of a data collection) on the Windowing and time shift from event
to processing.
Full streaming control.
Python or Java
Open Source. Can run it in cloud or at home.
Code can be running on Spark or Flink.
Dynamic Work Rebalancing
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Multi Pipes Flows
Data Processing engine should support stream processing (pushing/routing output to next stream/pipe).
Option for multi-step processing supported without going via transport
Monitoring is a must.
Recovery is a must.
Auto-scale (cloud…) of each step. Assumes peaks.
Cross Cloud and Hybrids
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Ingestion guide
Go on bucket for “first line” if possible.
Can work in many systems, including some IoT, where devices can upload
short batches rather single event at a time.
React to files by moving to pubsub - flatten peaks issue
Invest time on sharding design (good on any sharded system….)
No need in GCP ! (there are Partitions in Beam but for App logic needs)
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Thank you!
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Backup slides
Confidential & ProprietaryGoogle Cloud Platform 31
Scenario
Confidential & ProprietaryGoogle Cloud Platform 32
Pipeline p = Pipeline.create(
OptionsBuilder.RunOnService(true, false));
PCollection<String> rawData = p.begin().apply(TextIO.Read
.from(OptionsBuilder.GCS_RAWDUMP_URI));
PCollection<PlaybackEvent> events = rawData.apply(
new ParseTransform());
events.apply(new ArchiveTransform());
events.apply(new SessionAnalysisTransform());
events.apply(new AssetTransform());
p.run();
Java 7 Implementation
33
Cloud Pub/Sub
Fast, reliable, event delivery. Serverless, autoscaling, pay for what you use.

Big Data Best Practices on GCP

  • 1.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. Big Data Best Practices Real Time Analytics Lior Hipsh 10/7/17
  • 2.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io AllCloud in a NutShell ● 9 years of cloud experience ● 1500+ successful deployments ● 1000+ customers ● 3 operating centers
  • 3.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Agenda Big Data Introduction Real Time Analytics GCP DataFlow
  • 4.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Big Data Volume - DWHs and Storage, Shard based DB (NoSQL & SQL), Unstructured data parallel processing Velocity - Real time analytics, quick response & reduced DWH Variety - Schemaless - flexibility of the data - (Document DB); flexibility of Relations (Graph DB)
  • 5.
  • 6.
  • 7.
  • 8.
    … maybe infinitelybig... 9:008:00 14:0013:0012:0011:0010:00
  • 9.
    … With unknowndelays. 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00
  • 10.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io “Historical” Pattern High Volume Store Structured Data Injestion (transport, capture) DWH BI Structured Data ETL steps created OLAP cubes and any processed digested data ETL (sql)
  • 11.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Today Common Pattern High Volume (batch) Injestion (transport, capture) Data Processing (batch) DWH/SQL BI Multi step/pipes processing. Best to pass temporary data via the transport Multi step/pipes processing could be required also on digested data for additional analysis ETL Analytical data Transformed data Unstructured & Structured Data Analytics data processing typically by Map/Reduced as Spark or Hadoop over files or NoSQL. ETL can also be done by Map Reduce but mostly done by ETL tools
  • 12.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics Simplest Ingestion Data Processing (streaming / Rule Based Engine/ CEP) BI (visual+sm all size db) Action RT vs Batch - level of 2-3 sec and below Data may not be ETL to DWH after analytics been produced Does Batch is just Real-Time with skew parameter = 1h?... Analytical data
  • 13.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics In practice Ingestion (capture) Data Processing BI Database (digested output) In Memory ● MapReduce ● SQL in-mem DB ● NoSQL in mem (e.g. Redis) ● Transport/Queues Rules accesses Memory
  • 14.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Bigtable GCP Simple Pattern Pubsub DataFlow Big Query BI Tool (e.g. Tableau) C SQL Multi step/pipes processing Case of processed output analytics is yet high volume
  • 15.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics with File Archive Ingestion Data Processing BI Low cost Bucket Database In Memory
  • 16.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics w/ Retroactive Batch processing Ingestion Data Processing (streaming) BI Low cost Bucket Data Processing (batch) Database
  • 17.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics with DWH for “off line” Ingestion Data Processing BI SQL Database DWHETL Analytical data Raw data
  • 18.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io On GCP Pubsub DataFlow (streaming) Big Query (3 month raw ) BI Tool (e.g. Tableau) C SQL (digested data) Multi step/pipes processing Low cost Bucket full history Analytical data BQ support internal aging which can save the low cost bucket
  • 19.
    19 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) CloudStorage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow CEP over GCP Stack
  • 20.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics Stack architect decisions to do Ingestion Message bus Files Database pubsub, Kafka, bucket, HBASE, HFDS, BigTable, etc Data Processing SQL Rules programmatically Tableau, Looker, Data Studio (free…), BO GCP DataFlow (Apache Beam), Apache Flink, Spark Streaming, Drools , SQLStream, Tibco Streaming Analytics , IBM Streams Share Batch & real time pipe Separate DWH Columnar DWH Low cost SQL DB (if possible) BigQuery, IBM Netezza , Vertica, InfoBright, Teradata BI OLAP Report Generator Data Processing
  • 21.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Data Processing - CEP engines Complex Event Processing GCP DataFlow - programmatically (Apache Beam) , Python/Java. Same code framework used also on batch processing and real time. Apache Flink - programmatically. Spark Streaming - micro-batches. Kafka Streams (programmatically). Drools (Jboss) Sqlstream (SQL rules) Esper (SQL like - “EPL” - Event Processing Language) Cisco Stream Analytics (SQL)
  • 22.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Pipes Technology decision guide Pipe logic to be able to use data outside the streams (ext db) Pipe code better be testable also out of cloud (+ cloud agnostic) Day 0 decision - do we need the time pipe also on batch. Extensibility to unmanaged pipe - e.g. - CPP code that do one of the steps Eco-system/Libs - i.e. - does the pipe needs Sci Libs or ML as well.
  • 23.
    23 Beam=Batch+Stream Apache Beam (incubating) CloudDataflow Based on Apache Beam. Pipelines are portable to your favorite runtime.
  • 24.
    Confidential & ProprietaryGoogleCloud Platform 24 • Movement • Filtering • Enrichment • Shaping • Reduction • Batch computation • Continuous computation • Composition • External orchestration • Simulation Where might you use Cloud Dataflow? AnalysisETL Orchestration
  • 25.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io “Windowing” Data typically is an infinite time series. Need to check rule match per event while using historical data from the last X minutes. Framework works by definition of Windows, mainly using sliding windows. Can be tied to arrival time or custom event time Watermarks + Triggers enable robust completeness
  • 26.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Dataflow -> Apache Beam Batch and Real time become “2 edge points” on a scale of processing definitions for delay-from-real-time “factor” : a parameter in the processing code In default , none parameterized - do batch. Full control (per processing of a data collection) on the Windowing and time shift from event to processing. Full streaming control. Python or Java Open Source. Can run it in cloud or at home. Code can be running on Spark or Flink. Dynamic Work Rebalancing
  • 27.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Multi Pipes Flows Data Processing engine should support stream processing (pushing/routing output to next stream/pipe). Option for multi-step processing supported without going via transport Monitoring is a must. Recovery is a must. Auto-scale (cloud…) of each step. Assumes peaks. Cross Cloud and Hybrids
  • 28.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Ingestion guide Go on bucket for “first line” if possible. Can work in many systems, including some IoT, where devices can upload short batches rather single event at a time. React to files by moving to pubsub - flatten peaks issue Invest time on sharding design (good on any sharded system….) No need in GCP ! (there are Partitions in Beam but for App logic needs)
  • 29.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Thank you!
  • 30.
    [EDIT IN MASTER]Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Backup slides
  • 31.
    Confidential & ProprietaryGoogleCloud Platform 31 Scenario
  • 32.
    Confidential & ProprietaryGoogleCloud Platform 32 Pipeline p = Pipeline.create( OptionsBuilder.RunOnService(true, false)); PCollection<String> rawData = p.begin().apply(TextIO.Read .from(OptionsBuilder.GCS_RAWDUMP_URI)); PCollection<PlaybackEvent> events = rawData.apply( new ParseTransform()); events.apply(new ArchiveTransform()); events.apply(new SessionAnalysisTransform()); events.apply(new AssetTransform()); p.run(); Java 7 Implementation
  • 33.
    33 Cloud Pub/Sub Fast, reliable,event delivery. Serverless, autoscaling, pay for what you use.

Editor's Notes

  • #3 Who uses today dataflow?
  • #4 Who uses today dataflow?
  • #6 here’s gaming logs each square represents an event where a user scored some points for their team
  • #7 game gets popular
  • #8 start organizing it into a repeated structure
  • #9 repetitive structure just a cheap way of representing an infinite data source. game logs are continuous distributed systems can cause ambiguity...
  • #10 Lets look at some points that were scored at 8am <animate> red score 8am, received quickly <animate> yellow score also happened at 8am, received at 8:30 due to network congestion <animate> green element was hours late. this was someone playing in airplane mode on the plane. had to wait for it to land. so now we’ve got an unordered, infinite data set, how do we process it...