Simply Business’
Data Platform
By Dani Solà
1. Introductions
2. Some context
3. Data platform evolution
4. Cool stuff we’ve done
5. Lessons learned
6. Peeking into the future
7. References
Table of contents
1. Introductions
Nice to meet you
Hello! I’m Dani :)
This is Simply Business
● Largest UK business insurance provider
● Over 450,000 policy holders
● Using BML, tech and data to disrupt the business insurance market
● Acquired in 2016 (£120M) and again by Travellers in 2017 (£402M)
● #1 best company to work for in 2015 and 2016 among other awards
● Certified B Corporation since 2017
2. Context, context, co...!
Is everything
Mission:
To enable Simply Business to
create value through data
Data Environment - The 5Vs
● ⏬ Low volume: about ~1M events/day
● High variety: nearly 100 event types and growing
● High velocity: sub-second for apps that need it
● ⏫ High veracity: using strong schemas for most data points
● ⏫ High value: as a data-driven company, all departments use data on a daily basis
Data and Analytics team values
● Simplicity: simple is easier to maintain and understand (it’s hard!)
● Adaptability: data tools and techniques change very fast, don’t fight it
● Empowerment and self-serve: we provide a platform to do the easy things easy
● Pioneering: we push the boundaries of what’s possible with data
Data Platform Capabilities
● KPIs and MI: obviously
● Product Analytics: understand how our products perform
● Customer Analytics: understand how our customers behave
● Experimentation Tools: to test all our assumptions
● Data Integration: bringing all our data in one place
● Customer Comms: it’s very data intensive
● Machine Learning: because understanding the present is not enough!
Analytics usage
3. Data platform evolution
“Change is the only constant” - A data engineer
The batch days: 2014-2015
Team: 2-3 data platform engineers
Tech:
● Vainilla Snowplow Analytics for the event pipeline that ran on EMR
● Homegrown Change Data Capture (CDC) pipeline to flatten MongoDB collections
● Looker for web and product analytics, SQL Server for top-level KPIs
Sources Ingest Process Store Serve
Website
Event
Collector
Redshift
MongoDB
S3
Scalding on
EMR
Change Data
Capture
Adwords
Email
Batch
Importer
...
Batch
Exporter
hourly job
Data
modelling
cron jobs
NRT first steps: 2016-2017
Team: 3-4 data platform engineers
Changes:
● We added a NRT pipeline in order to expose event data back to transactional apps
● We used Kinesis as message bus; we didn’t want to manage anything
● The data is stored in MongoDB for real-time access
Sources Ingest Process Store Serve
Website
Event
Collector
Redshift
MongoDB
S3
Scalding on
EMR
Change Data
Capture
Adwords
Email
Batch
Importer
...
Batch
Exporter
hourly job
MongoDB
Data
modelling
cron jobs
Spark
Streaming
API
4s batches
Current pipeline: 2017-2018
Team: 4-5 data platform engineers
Tech:
● We have gone NRT by default, there’s no batch layer
● We’ve introduced Airflow for batch job orchestration
● We’ve got rid of S3 to comply with GDPR without having to fiddle with files
Sources Ingest Process Store Serve
Website
Event
Collector
Redshift
MongoDB
Spark
Streaming
Change Data
Capture
Adwords
Email
Batch
Importer
...
Batch
Exporter
3min batches
MongoDB
Data
modelling
Airflow
Spark
Streaming
API
4s batches
Potential changes in the near future
Migrate from Spark Streaming to Kafka Streams:
● Streaming-native API, much more powerful than Spark’s
● No need for external storage for stateful operations
● No need to have a YARN or Mesos cluster, any JVM app can have a streaming
component
● Can expose APIs to other services!
Potential changes in the near future
Migrate from Redshift to Snowflake:
● Decoupling storage from processing
● Handles semi-structured data natively
● Allows to isolate workloads much better
● Near-instant scaling, including stopping it when no one is using the cluster
● Infinite storage!
Potential changes in the near future
Migrate from EMR to Databricks for Spark batch jobs:
● Would allow us to have a dedicated cluster per app
● Easier to upgrade to newer Spark versions
● No cluster maintenance required, they’re transient
Sources Ingest Process Store Serve
Website
Event
Collector
Snowflake
MongoDB
Kafka
Streams
Change Data
Capture
Adwords
Email
Batch
Importer
...
3min batches
Data
modelling
Airflow
Kafka Streams + API
Batch
Exporter
4. Cool stuff we’ve done
Not everything is infrastructure!
Full Contact - A Kafka Streams App
Full Contact is the brain behind the decisions related to calling Simply Business
customers and prospects. It decides:
● If we need to call someone
● The reason to call someone
● The importance of a call (priority)
● When to make the call (scheduling)
Visualization made with https://zz85.github.io/kafka-streams-viz/
Visitor graphs analysis
We used GraphFrames to understand customer behaviour. We found/understood:
● Cross-device customer behaviour
● How people refer Simply Business to their friends
● That we have some brokers that buy on behalf of customers
Visualization made with gephi.org
Lead scoring
● We developed a lead scoring algorithm using AdaBoost which, using customer
behaviour, predicts how likely they are to convert
● This approach notably improved retargeting efficiency
● We are now developing a streaming version using LightGBM to plug it to Full
Contact and improve call centre efficiency
● We can tune it to not bother at all people who we think aren’t interested in buying
5. Lessons learned
Remember, this are our lessons
Distributed FS aren’t for everyone
Distributed FS have a set of properties that in many cases aren’t unique or that useful:
● Immutability: really cool until you need to mutate data
● Distributed: there are many options for distributed storage
● Schema-less data ingestion: you need to know what are you storing, especially if it
contains PII
● Files: do you really want to manage files?
● Other quirks: eventual consistency (S3), managing backups (HDFS), ...
Schemas everywhere!
Schemas are key to:
● Enforce data quality across multiple systems, right when it is created
● Allow multiple groups of people to talk and collaborate around data
● Make the data discoverable
Plan for flexibility and agility
Using the right tools, or our love-hate relationship with SQL:
● It’s great for querying, testing stuff and hacking things together quickly
● Not so good for building complex logic: lots of repetition and difficult to test
Make your architecture loosely coupled so that you can change bits at a time:
● Use Kafka to decouple real-time applications
● Use S3/HDFS/DB to decouple batch applications
6. Peeking into the future
Will probably get it wrong
Size doesn’t matter, so let’s go big
● Setting up and using “big data” tools is getting easier and easier
● Cloud providers and vendors host them for you
● Most tools are fine with little data volumes and scale horizontally
● CPU, storage and network are getting cheaper faster than (our) data needs
● Examples:
○ Spark: from a local notebook to processing petabytes
○ Kafka Streams: useful regardless of volume
Machine learning is commoditized
● Everyone is giving their algorithms for free: Tensorflow, Keras, MLFlow,…
● Cloud providers even provide infrastructure to train and serve models
● Invest in the things that will make a difference:
○ Skills
○ Data
Data and analytics are transactional
● Long gone are the days when data warehousing was done overnight and isolated
from the transactional systems
● Many products require real-time, reliable access to data systems:
○ Visible: Twitter reactions, bank account spending, ...
○ Invisible: marketing warehouses, transportation, recommenders, ...
The best is yet to come
● Data is one of the most effective competitive advantages, everyone will invest in it
● Data will be used to self-optimize pretty much everything that can be optimized
● Data-centric ways of thinking about software engineering:
○ Software changes constantly, but data survives much longer
○ Event driven architectures and microservices
● Make sure you learn how to teach machines :)
7. References
Learning from the best
References
● The Art of Platform Thinking - ThoughtWorks
● Sharing is Caring: Multi-tenancy in Distributed Data Systems - Jay Kreps
● Machine Learning: The High-Interest Credit Card of Technical Debt - Google
● Ways to think about machine learning - Benedict Evans
Questions?

Simply Business' Data Platform

  • 1.
  • 2.
    1. Introductions 2. Somecontext 3. Data platform evolution 4. Cool stuff we’ve done 5. Lessons learned 6. Peeking into the future 7. References Table of contents
  • 3.
  • 4.
  • 5.
    This is SimplyBusiness ● Largest UK business insurance provider ● Over 450,000 policy holders ● Using BML, tech and data to disrupt the business insurance market ● Acquired in 2016 (£120M) and again by Travellers in 2017 (£402M) ● #1 best company to work for in 2015 and 2016 among other awards ● Certified B Corporation since 2017
  • 10.
    2. Context, context,co...! Is everything
  • 11.
    Mission: To enable SimplyBusiness to create value through data
  • 12.
    Data Environment -The 5Vs ● ⏬ Low volume: about ~1M events/day ● High variety: nearly 100 event types and growing ● High velocity: sub-second for apps that need it ● ⏫ High veracity: using strong schemas for most data points ● ⏫ High value: as a data-driven company, all departments use data on a daily basis
  • 13.
    Data and Analyticsteam values ● Simplicity: simple is easier to maintain and understand (it’s hard!) ● Adaptability: data tools and techniques change very fast, don’t fight it ● Empowerment and self-serve: we provide a platform to do the easy things easy ● Pioneering: we push the boundaries of what’s possible with data
  • 14.
    Data Platform Capabilities ●KPIs and MI: obviously ● Product Analytics: understand how our products perform ● Customer Analytics: understand how our customers behave ● Experimentation Tools: to test all our assumptions ● Data Integration: bringing all our data in one place ● Customer Comms: it’s very data intensive ● Machine Learning: because understanding the present is not enough!
  • 15.
  • 16.
    3. Data platformevolution “Change is the only constant” - A data engineer
  • 17.
    The batch days:2014-2015 Team: 2-3 data platform engineers Tech: ● Vainilla Snowplow Analytics for the event pipeline that ran on EMR ● Homegrown Change Data Capture (CDC) pipeline to flatten MongoDB collections ● Looker for web and product analytics, SQL Server for top-level KPIs
  • 18.
    Sources Ingest ProcessStore Serve Website Event Collector Redshift MongoDB S3 Scalding on EMR Change Data Capture Adwords Email Batch Importer ... Batch Exporter hourly job Data modelling cron jobs
  • 19.
    NRT first steps:2016-2017 Team: 3-4 data platform engineers Changes: ● We added a NRT pipeline in order to expose event data back to transactional apps ● We used Kinesis as message bus; we didn’t want to manage anything ● The data is stored in MongoDB for real-time access
  • 20.
    Sources Ingest ProcessStore Serve Website Event Collector Redshift MongoDB S3 Scalding on EMR Change Data Capture Adwords Email Batch Importer ... Batch Exporter hourly job MongoDB Data modelling cron jobs Spark Streaming API 4s batches
  • 21.
    Current pipeline: 2017-2018 Team:4-5 data platform engineers Tech: ● We have gone NRT by default, there’s no batch layer ● We’ve introduced Airflow for batch job orchestration ● We’ve got rid of S3 to comply with GDPR without having to fiddle with files
  • 22.
    Sources Ingest ProcessStore Serve Website Event Collector Redshift MongoDB Spark Streaming Change Data Capture Adwords Email Batch Importer ... Batch Exporter 3min batches MongoDB Data modelling Airflow Spark Streaming API 4s batches
  • 23.
    Potential changes inthe near future Migrate from Spark Streaming to Kafka Streams: ● Streaming-native API, much more powerful than Spark’s ● No need for external storage for stateful operations ● No need to have a YARN or Mesos cluster, any JVM app can have a streaming component ● Can expose APIs to other services!
  • 24.
    Potential changes inthe near future Migrate from Redshift to Snowflake: ● Decoupling storage from processing ● Handles semi-structured data natively ● Allows to isolate workloads much better ● Near-instant scaling, including stopping it when no one is using the cluster ● Infinite storage!
  • 25.
    Potential changes inthe near future Migrate from EMR to Databricks for Spark batch jobs: ● Would allow us to have a dedicated cluster per app ● Easier to upgrade to newer Spark versions ● No cluster maintenance required, they’re transient
  • 26.
    Sources Ingest ProcessStore Serve Website Event Collector Snowflake MongoDB Kafka Streams Change Data Capture Adwords Email Batch Importer ... 3min batches Data modelling Airflow Kafka Streams + API Batch Exporter
  • 27.
    4. Cool stuffwe’ve done Not everything is infrastructure!
  • 28.
    Full Contact -A Kafka Streams App Full Contact is the brain behind the decisions related to calling Simply Business customers and prospects. It decides: ● If we need to call someone ● The reason to call someone ● The importance of a call (priority) ● When to make the call (scheduling)
  • 29.
    Visualization made withhttps://zz85.github.io/kafka-streams-viz/
  • 30.
    Visitor graphs analysis Weused GraphFrames to understand customer behaviour. We found/understood: ● Cross-device customer behaviour ● How people refer Simply Business to their friends ● That we have some brokers that buy on behalf of customers
  • 31.
  • 32.
    Lead scoring ● Wedeveloped a lead scoring algorithm using AdaBoost which, using customer behaviour, predicts how likely they are to convert ● This approach notably improved retargeting efficiency ● We are now developing a streaming version using LightGBM to plug it to Full Contact and improve call centre efficiency ● We can tune it to not bother at all people who we think aren’t interested in buying
  • 33.
    5. Lessons learned Remember,this are our lessons
  • 34.
    Distributed FS aren’tfor everyone Distributed FS have a set of properties that in many cases aren’t unique or that useful: ● Immutability: really cool until you need to mutate data ● Distributed: there are many options for distributed storage ● Schema-less data ingestion: you need to know what are you storing, especially if it contains PII ● Files: do you really want to manage files? ● Other quirks: eventual consistency (S3), managing backups (HDFS), ...
  • 35.
    Schemas everywhere! Schemas arekey to: ● Enforce data quality across multiple systems, right when it is created ● Allow multiple groups of people to talk and collaborate around data ● Make the data discoverable
  • 36.
    Plan for flexibilityand agility Using the right tools, or our love-hate relationship with SQL: ● It’s great for querying, testing stuff and hacking things together quickly ● Not so good for building complex logic: lots of repetition and difficult to test Make your architecture loosely coupled so that you can change bits at a time: ● Use Kafka to decouple real-time applications ● Use S3/HDFS/DB to decouple batch applications
  • 37.
    6. Peeking intothe future Will probably get it wrong
  • 38.
    Size doesn’t matter,so let’s go big ● Setting up and using “big data” tools is getting easier and easier ● Cloud providers and vendors host them for you ● Most tools are fine with little data volumes and scale horizontally ● CPU, storage and network are getting cheaper faster than (our) data needs ● Examples: ○ Spark: from a local notebook to processing petabytes ○ Kafka Streams: useful regardless of volume
  • 39.
    Machine learning iscommoditized ● Everyone is giving their algorithms for free: Tensorflow, Keras, MLFlow,… ● Cloud providers even provide infrastructure to train and serve models ● Invest in the things that will make a difference: ○ Skills ○ Data
  • 40.
    Data and analyticsare transactional ● Long gone are the days when data warehousing was done overnight and isolated from the transactional systems ● Many products require real-time, reliable access to data systems: ○ Visible: Twitter reactions, bank account spending, ... ○ Invisible: marketing warehouses, transportation, recommenders, ...
  • 41.
    The best isyet to come ● Data is one of the most effective competitive advantages, everyone will invest in it ● Data will be used to self-optimize pretty much everything that can be optimized ● Data-centric ways of thinking about software engineering: ○ Software changes constantly, but data survives much longer ○ Event driven architectures and microservices ● Make sure you learn how to teach machines :)
  • 42.
  • 43.
    References ● The Artof Platform Thinking - ThoughtWorks ● Sharing is Caring: Multi-tenancy in Distributed Data Systems - Jay Kreps ● Machine Learning: The High-Interest Credit Card of Technical Debt - Google ● Ways to think about machine learning - Benedict Evans
  • 44.