GCP at Vente-Exclusive
Big Data on Google Cloud
Flash sales?
We reserve a limited
supply of products at
a supplier
We prepare them for
selling
You buy some of the
products on offer
We order only what
was purchased
And we ship it!
Process: brands
Pre-Sales Marketing Sales LogisticsSourcing
Process: prepare the sales
Pre-Sales Marketing Sales LogisticsSourcing
Process: communicate
Pre-Sales Marketing Sales LogisticsSourcing
Process: sales started
Pre-Sales Marketing Sales LogisticsSourcing
Process: ship it
Pre-Sales Marketing Sales LogisticsSourcing
Data
● Avoiding returns
● Better product offering
● More relevant communication
● Cost efficient advertising
● Reporting to our suppliers
● Forecasting of logistics workforce
● Estimating of stock purchases
● Investigate where to diversify
● More detailed customer insights
● Improving multi-channel conversion
● Much much more …
Crunching Power
Dark Ages
Hadoop rollout
Database
SQL Server Cloud Storage Compute Engine
Hadoop: uploading denormalized
SQL Server Cloud Storage Compute Engine
Hadoop: nightly processing
SQL Server Cloud Storage Compute Engine
Hadoop: download to SQL
SQL Server Cloud Storage Compute Engine
Hadoop: complete setup
SQL Server Cloud Storage Compute Engine
Hadoop: DataProc service
SQL Server Cloud Storage DataProc
Pig Example
ul = LOAD '${users}' using AvroStorage();
province = LOAD 'gs://bucket/bi_ProvincePostalCode.csv'
using CSVExcelStorage() as (country_id:int,postal_code:chararray,province:chararray);
ul = foreach ul generate user_id AS user_id, ... vp_member_id as vp_member_id;
ulp = JOIN ul BY (country_id,postal_code) LEFT OUTER, province BY (country_id,postal_code) USING 'replicated';
ulp = foreach ulp generate
ISOToBQ('${year}-${month}-${day}T00:00:00Z') AS bucket_date,
user_id AS user_id, … loyalty_segment as loyalty_segment,
vp_member_id as vp_member_id;
store ulp into '${out}' using JsonStorage();
DataProc Console
Hadoop Learnings
● Cloud Storage is a life saver
○ Reliable
○ Always accessible
○ Not gone when you delete the cluster (read HDFS)
○ Well worth the speed tradeoff
● On demand cluster
○ Saves Compute cost
● DataProc
○ Cluster up in about 1 minute
○ Better per job overview and logging
BigQuery rollout
BigQuery: interactive queries
BigQuery: preparing for import
Cloud Storage BigQueryDataProc
BigQuery: importing
Cloud Storage BigQueryDataProc
BigQuery: exporting to storage
Cloud Storage BigQueryDataProc
BigQuery: reuse in Hadoop
Cloud Storage BigQueryDataProc
BigQuery: download to SQL
Cloud Storage BigQuerySQL Server
BigQuery: big picture
Cloud Storage BigQueryDataProc
BigQuery: connector
Cloud Storage BigQuery
DataProc
BigQuery Learnings
● User centric
○ Don’t have to be a Data Scientist or Engineer
○ Think more about the datasets and table layout
○ Harder to change, because it’s used throughout the company
● Cost
○ One shots are never a problem
○ Integration in incur hidden cost
● Bucketization
○ Buckets per day
○ Monthly or Years if data is too small
○ Create views for users
Interlude
Age of Enlightenment
Orchestration
Orchestration: foundation of big data
Visualisation
Visualisation: tableau
Visualisation: integration in BackOffice
Integration Learnings
● Making it visible for the business
● Keep an eye on hidden cost
● Training is key
Realtime
Space Age
Cloud PubSub
PubSub: collecting user behaviour
Cloud PubSub
App Engine
PubSub: publishing events on-premises
Cloud PubSub
App Engine
PubSub: 3th party integration
Cloud PubSub
App Engine
PubSub: publish via App Engine
Cloud PubSub
App Engine
PubSub: ingres
Cloud PubSub
App Engine
Cloud DataFlow
DataFlow: PubSub source
DataFlow
(streaming)
DataFlow (batch)
BigQuery
DataFlow: streaming processing
DataFlow
(streaming)
DataFlow (batch)
BigQuery
DataFlow: streaming BigQuery inserts
DataFlow
(streaming)
DataFlow (batch)
BigQuery
DataFlow: real time dashboards
DataFlow
(streaming)
DataFlow (batch)
BigQuery
DataFlow: ingres backup
DataFlow
(streaming)
DataFlow (batch)
BigQuery
DataFlow: unified model
DataFlow
(streaming)
DataFlow (batch)
BigQuery
DataFlow: restore or new flow
DataFlow (batch)
DataFlow
(streaming)
BigQuery
DataFlow: big real time picture
DataFlow (batch)
DataFlow
(streaming)
BigQuery
Real-time Learnings
● DataFlow is key
○ Unified model
○ Powerful semantics for streaming
Complete picture
Fitting it all together
Current State of Affairs
Q&A
No state secrets

Google Cloud Platform at Vente-Exclusive.com