SlideShare a Scribd company logo
www.scout24.com
The Scout24 Data Platform
A Technical Deep Dive
Munich | March 20, 2019 |Christian Dietze, Olalekan Elesin, Raffael Dzikowski
5
Core Geographies
and an overall presence
in 18 countries
80m
Household Reach
2
Major Household Brand Names
Scout24 AG
• MDAX
• € 531.7 million revenue (2018)
Our technical evolution
Production
Database
Monolith app
Data Warehouse Data
Warehouse
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
Microservice
Persistence
Persistence
PersistencePersistence
Persistence
PersistencePersistence
Our data warehouse was a bottle neck
Scout24 wants to become a truly data-driven company
Fast & easy data-driven
product development…
…supported by
Data & Analytics
Scout24 wants to become a truly data-driven company
Everywhere in the company... ...without bloating up
Data & Analytics
Our solution:
Build an internal “platform” for data
What is the Scout24 Data “Platform”?
We think of our Data Platform as a Product
• Just like AWS, Salesforce, etc. – the platform is a generic layer upon
which Scout24’s products can be built
• BUT, we have a very, very small number of customers.
• That means, product teams get personalized support and there is lots
of opportunity for collaboration.
We don’t dictate anything.
We just try to make certain things easier by offering a “paved path”
Product teams are fully empowered to make their own choices about
what is the best use of their resources.
“In almost all cases, we will
not mandate that internal
team use these platforms
and services—these
platform teams will have
to win over and satisfy
their internal customers,
sometimes even
competing with external
vendors.”
Guiding principle of the platform
• Autonomy for producers and consumers
Self-service Analytics
Self-service Data Ingestion
Self-service ETL
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto – Purpose
Best Practices
Guidelines
Alignment
Principles
Responsi-
bilities
Data Landscape Manifesto – Purpose
Watch the Video | Read the Blog Post on Medium
Necessary Cultural Changes during Migration
from Data Warehouse to Cloud Data Platform
Dr. Sean Gustafson, Scout24 AG
Data Festival 2018
10,000 Foot Architecture Overview
Personal
Analytics Cluster
Datalake
AWS Cloud
Amazon
Aurora
Microservices
OneScout Hive Metastore
Data Center
Core DB
DataWario
Firehose /
Kafka / Rest
API
PM / Leader / Analyst
Data Scientist / Analyst
Engineer
Engineer
10,000 Foot Architecture Overview
Personal
Analytics Cluster
Datalake
AWS Cloud
Amazon
Aurora
Microservices
OneScout Hive Metastore
Data Center
Core DB
DataWario
Firehose /
Kafka / Rest
API
PM / Leader / Analyst
Data Scientist / Analyst
Engineer
Engineer
Section #3
Self-Service Analytics Section #1
Self-Service Ingestion
Data Producer /
Engineer
PM / Leader /
Analyst
Section #2
Self-Service ETL
Engineer
Self-Service Ingestion
Multi-Account Setting
Product
Account
Product
Account
…
Product
Account
Product
Account
Product
Account
…
Product
Account
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
Data Lake Bucket Types
Regular Restricted Personal RestrictedRegular Personal
Raw Processed
Visibility Level
Data Lake Access Roles
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
Regular DL Bucket Access Role
Restricted DL Bucket Access Roles
Personal DL Bucket Access Roles
Data Lake Ingestion Goals
Microservices
Ingestion
Goals
Batch Support
Streaming Support
Rest APIScalability
Ingestion Options
Amazon
Kinesis Data
Firehose
Kafka Connect
Hadoop Rest API
{ }
Ingestion Options
Amazon
Kinesis Data
Firehose
Kafka Connect
Hadoop Rest API
{ }
Simple
Config
Simple
Config
Configuration Free
Kinesis Data Firehose Ingestion Mechanism
Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Send Data
Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Send Data
Write Data
Kafka Connect Ingestion Mechanism
Kafka Connect
Kafka Connect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS
…
Kafka Connect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS
…
Kafka
Connect
Cluster
Kafka Connect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS
…
Kafka
Connect
Cluster
Read Data from
Topic(s)
Kafka Connect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS
…
Kafka
Connect
Cluster
Read Data from
Topic(s)
Write
Data
Scout24 Infinity Cluster
Amazon ECS
Infinity Service
Scout24 Infinity Cluster
Amazon ECS
Infinity Service
Simple
Config
Scout24 Infinity Cluster
Amazon ECS
Infinity Service
Simple
Config
Central Logging to
Elasticsearch
Monitoring in Datadog
Managed by Scout24
Cloud Platform
Engineering
Get the Slide Deck
To Infinity and Beyond Handling
Heterogenous Container Clusters in AWS
Christine Trahe, Scout24 AG
AWS Summit Berlin 2019
Kafka Connect on Infinity
Simple
Config
(Infinity)
Amazon ECS
Infinity Service
Kafka Connect on Infinity
Kafka Connect on Infinity
Kafka Connect Service
Infinity Service
Kafka Connect on Infinity
Simple
Config
(Kafka Connect)
Kafka Connect Service
Infinity Service
Kafka Connect Deployment
Write
Data
Read Data from
Topic(s)
Kafka
Connect
Cluster
Amazon S3
Elasticsearch
ActiveMQ
Cassandra
Kafka
RDMBS
…
Kafka Connect Deployment
Write
Data
Read Data from
Topic(s)
Kafka
Connect
Cluster
Amazon S3
Kafka Connect Deployment
Write
Data
Read Data from
Topic(s)
Amazon S3
Kafka Connect Deployment
Write
Data
Read Data from
Topic(s)
Amazon S3
Kafka
Connect
Service
Kafka Connect Deployment
Kafka Account Infinity Account Datalake Account
Datalake Bucket
Kafka Connect
Ingestion Role
EC2 Instances
Kafka Cluster
Amazon ECS
Infinity Cluster
Kafka Connect Deployment
Kafka Account Infinity Account Datalake Account
Datalake Bucket
Kafka Connect
Ingestion Role
EC2 Instances
Kafka Cluster
Kafka Connect
Container
Amazon ECS
Kafka Connect Service
(On Infinity)
Kafka Connect Deployment
Kafka Account Infinity Account Datalake Account
Datalake Bucket
Kafka Connect
Ingestion Role
EC2 Instances
Kafka Cluster
STS Assume
Role
Kafka Connect
Container
Amazon ECS
Kafka Connect Service
(On Infinity)
Kafka Connect Deployment
Kafka Account Infinity Account Datalake Account
Datalake Bucket
Kafka Connect
Ingestion Role
EC2 Instances
Kafka Cluster
STS Assume
Role
Kafka Connect
Container
Amazon ECS
Kafka Connect Service
(On Infinity)
Read Topic
Kafka Connect Deployment
Kafka Account Infinity Account Datalake Account
Datalake Bucket
Kafka Connect
Ingestion Role
EC2 Instances
Kafka Cluster
STS Assume
Role
Kafka Connect
Container
Amazon ECS
Kafka Connect Service
(On Infinity)
Read Topic
Write Topic
Hadoop Rest API Ingestion Mechanism
Hadoop Rest API – A Motivation
Hadoop Rest API
{ }
Unified Set of
Ingestion Operations
Performs Event
Validation
Adds Reception
Timestamp Field
Stable User-Facing
Endpoints, Varying
Backend
Hadoop Rest API – Operation Overview
Hadoop Rest API
{ }
Add
Event
Batch
Add
Single
Event
PUT
PUT
Automatic
Partitioning by
Ingestion Date
Persisting to Disk
Initially, Periodically
Flushing to Target
Store
Also Publishing to
Kinesis Stream
Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Send Events
Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Send Events
Ingest
Hadoop Rest API – Architecture Evolution (Phase 2)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Ingest
Send Events
Ingest
Hadoop Rest API – Architecture Evolution (Phase 2)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Ingest
Send Events
Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
Send Events
Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
Send Events
Ingest
Self-Service ETL
Self Service ETL Requirements
Data Landscape Manifesto Principle #5: Data consumers are responsible for the definition and visualization of
metrics and for driving the implementation and maintenance of these metrics.
èData Consumers get a lot responsibility
èThe Data Platform must empower users to fulfill this responsibility
èIt must be easy to develop, test and maintain data processes
Multi-Account setup: Data processes must run in data consumer’s account
èNo centrally run processing tool
èAWS Managed service is preferred
AWS Data Pipeline
Team Account
Extract
Transform
Load
ETL runs in team account
AWS Data Pipeline
Team Account
Extract
Transform
Load
Teams run ETL in their accounts It‘s easy …. Or is it?
DataWario
to the rescue
DataWario
1. Client for AWS Data Pipeline
2. Transforms YAML into JSON format understood by Data Pipeline
3. Integrates with the Scout24 account setup and sets sensible configuration defaults
4. Augments Data Pipeline with additional features
5. Makes testing and debugging faster
DataWario – Architecture
Poll for Tasks
Start
Post Pipeline
Definition
Upload Artifacts
Any Scout24 Account
Data Center
Client with
DataWario
Server
Artifact Bucket
EC2 Instance
• Java Application, distributed as a JAR
• Shell wrapper (dw) for easy usage from the
command line
Deployment of a pipeline
1. User issues dw deploy
2. DataWario
1. uploads artifacts from user‘s machine to S3
2. generates JSON definition from YAML
3. creates data pipeline
3. AWS Data Pipeline runs the defined steps
Sensible defaults
EMR
release?
EC2
instance
type?
IAM role?
Subnet ID?
Sensible defaults
EMR
release?
EC2
instance
type?
IAM role?
Subnet ID?
emr:
emr-5.21.0
m5.xlarge
DataWarioResourceRole
subnet-12345
Augmenting data pipeline
ID Make Model Price Mileage
1 VW Golf 22000 15000
2 BMW 320d 35000 40000
„Built-In“ Copy Activity
1,VW,Golf,22000,15000
2,BMW,320d,35000,40000
• CSV (or TSV)
• No schema
Augmenting data pipeline
ID Make Model Price Mileage
1 VW Golf 22000 15000
2 BMW 320d 35000 40000
„Built-In“ Copy Activity
• CSV (or TSV)
• No schema
„Custom“ Copy Activity
+
1,VW,Golf,22000,15000
2,BMW,320d,35000,40000
Speed up pipeline development
starts
runs
fails
takes ca. 10 min + job runtime
dw deploy
Typical development cycle
Speed up test and debug cycle
Without test cluster
starts
runs
fails
takes ca. 10 min + job runtime
With test cluster
dw test-cluster
create
dw deploy
-- test-cluster
uses
fails
Typical development cycle
takes job runtime
dw deploy
runs
DataWario brought us from here …
photo by Dirk van der Made (https://commons.wikimedia.org/wiki/File:DirkvdM_cloudforest-
jungle.jpg), „DirkvdM cloudforest-jungle“, https://creativecommons.org/licenses/by-
sa/3.0/legalcode
to here
So, how do we get here?
Paving the path further
• We built a lot of pipelines, so did our users
• We see repetitive tasks emerge in these pipelines
• Reimplementing solutions over and over again is waste
• We don‘t like waste
è Let‘s build reusable tools for common tasks
Rob Pike; Brian W. Kernighan (October 1984). "Program Design in the UNIX Environment"
Much of the power of the UNIX operating system comes from a style of
program design that makes programs easy to use and, more important, easy
to combine with other programs. This style has been called the use of
software tools, and depends more on how the programs fit into the
programming environment and how they can be used with other programs
than on how they are designed internally. [...] This style was based on the use
of tools: using programs separately or in combination to get a job done,
rather than doing it by hand, by monolithic self-sufficient subsystems, or by
special-purpose, one-time programs.
Title | Your name
92
Tools that do one thing, and do that well
• Advantages:
− Easy to configure, only a few settings
− Easy to test, because there are less branches to cover
− Easy to debug, output of each tool can be checked independently
• Drawbacks:
− More intermittent data written
− longer runtimes
Event snapshotter
• Typical questions: What was the headline of a listing on Feb 22, 2016? Was it published? How about March 1?
• Input data: Change events for an entity (e.g. listing)
− User changed headline of a listing, number of pictures
− Listing got published/unpublished
• Snapshotter
1. takes snapshot from the day before
2. merges all change events from the day, keeps last
3. writes new snapshot for day
• Simple config (input and output path, name of ID and timestamp column, output path, partition schema)
• No business logic, agnostic of type of entity
Event Snapshotter Example
{"id": 1, "changed": "2019-01-01T15:31:00.000+01:00", "headline": „nice appartment", "published": false}
{"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false}
{"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true}
{"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true}
January 1, 2019 (s3://snapshotBucket/snapshot_day=2019-01-01/):
{"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false}
January 2, 2019 (s3://snapshotBucket/snapshot_day=2019-01-02/):
{"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true}
{"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true}
Aggregator
• Typical Questions:
− How many published listings did we have on Feb 1 2018?
− What‘s the daily average number of pageviews per car make?
• Input data:
− Interaction events of an entity ( e.g. page viewed, email sent, number called)
− output of Event Snapshotter
• Simple config (input and output paths, timestamp column, grouping column, aggregation method and column)
Spark Utils
• Collection of simple tools (usually one class per tool)
• Can be used
− standalone in a pipeline step
− as a dependency of your code
• Examples:
− Create a Hive table on-top of existing data
− Convert data from one Format to another (e.g. from JSON to Parquet for faster querying)
− Flatten nested structures
− publish data quality metrics to CloudWatch
− Materializing views
Current State / Outlook
• Data Pipeline with DataWario widely used throughout the company
• Challenges:
− Data Pipeline not getting much love from AWS (not deprecated, but not getting many new features)
− Coordination between data pipelines only rudimentary
• Possible solutions:
− AWS Step Functions in combination with AWS Glue
− Introduction of a Workflow tool (e.g. AirFlow)
Self-Service Analytics
Query Challenges
What’s Ahead
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
What’s Ahead
OneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
What’s Ahead
Personal Analytics ClusterOneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
What’s Ahead
Automatic Hive Partition DetectionPersonal Analytics ClusterOneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
OneScout Hive Metastore – A Schematic View
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
OneScout Hive Metastore – Recap of Ecosystem
Product
Account
Product
Account
…
Product
Account
Product
Account
Product
Account
…
Product
Account
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
ConnectConnect
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
ConnectConnect
Forward
Request
Forward
Request
The Scout24 Hive Metastore Proxy – A Motivation
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Table Access
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Table Access
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Table Access
Automatic Partition Detection
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Table Access
Automatic Partition Detection
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Trigger
Publish
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Transfer to CloudWatch
Logs
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Transfer to CloudWatch
Logs
Apply Rule
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Transfer to CloudWatch
Logs
Apply Rule
Send to Partition
Detector
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Transfer to CloudWatch
Logs
Apply Rule
Send to Partition
Detector
Get Bucket
Notification
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Transfer to CloudWatch
Logs
Apply Rule
Send to Partition
Detector
Get Bucket
Notification Add Partition
Get Partition
Info
Personal Analytics Cluster
The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
AWS
CloudFormation
The Personal Analytics Cluster – An Overview
The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
AWS
CloudFormation
The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
Simple
Config
AWS
CloudFormation
The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
Simple
Config
Easy Access via Web
Interface
Zeppelin and Jupyter
Notebook Restore
OneClick Deployment
Managed Scaling and
Shutdown
Support for Pre-baked
AMIs and Configs
AWS
CloudFormation
The Personal Analytics Cluster – Configuration Setup
• Cluster Name (default is usually username)
• AWS Account Name
• Market segment
• Use Case
• Shutdown time
• Instance Type
• Bid Price
• Key Pair
• EBS Volume Size
• Tear Down Existing Cluster (optional)
$ ./deploy-cluster --cluster-name <> 
--account-name <> 
--market-segment <> 
--usecase <> 
--shutdown-time <> 
--instance-type <> 
--bid-price <> 
--key-pair <> 
--ebs-volume-size <> 
--teardown-existing <>
The Personal Analytics Cluster – Configuration Setup
The Personal Analytics Cluster – Data Lake Access
Regular Restricted Personal RestrictedRegular Personal
Raw Processed
Visibility Level
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
Regular DL Bucket Access Role
Restricted DL Bucket Access Roles
Personal DL Bucket Access Roles
The Personal Analytics Cluster – Data Lake Access
• Managed with AWS EMR FS Security Configuration
• Integrates with EMR Applications internally; Spark, Presto, etc
• Easy to define
• Maps a defined IAM role to S3 bucket being accessed
The Personal Analytics Cluster – Data Lake Access
Scenario One:
Non-restricted Data Lake Access
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
AWS Cloud
Regular Restricted Personal
Non-restricted
Role
Restricted Role
Personal Role
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
Non-restricted
Role
Restricted Role
Personal Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
Non-restricted
Role
Restricted Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
Non-restricted
Role
Restricted Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Scenario Two:
Restricted/Personal Data Lake Access
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
Restricted Role
Non-restricted
Role
Personal Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Restricted Role
Non-restricted
Role
Personal Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment A
Account
AWS Cloud
Restricted Role
Non-restricted
Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment B
Account
AWS Cloud
Restricted Role
Non-restricted
Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
Data Platform SQL Access
Our Journey to Presto
Our Journey to Presto
On EMR
Cross Account Support
(OneScout Hive
Metastore)
Leverages Datalake
Access Roles (EMRFS)
Scheduled Scaling
Configurations
Fits our GDPR Concept
(multiple isolated
Clusters)
Our Journey to Presto
Average Presto Queries per Day
360Queries per Day
Number of Queries
We hope to throw out
most of the custom components
we build.
Our data warehouse was a bottle neck
Our data platform holds nothing back
Christian Dietze Olalekan Elesin Raffael Dzikowski
Senior Data Engineer Data Engineer Senior Data Engineer
Scout24 Group Scout24 Group Scout24 Group
Thank you for your
attention!

More Related Content

What's hot

Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Digital banking on AWS
Digital banking on AWSDigital banking on AWS
Digital banking on AWS
Pham Anh Vu
 
ENT203-Building a Solid Business Case for Cloud Migration.pdf
ENT203-Building a Solid Business Case for Cloud Migration.pdfENT203-Building a Solid Business Case for Cloud Migration.pdf
ENT203-Building a Solid Business Case for Cloud Migration.pdf
Amazon Web Services
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
DATAVERSITY
 
Cloud Operating Model Design
Cloud Operating Model DesignCloud Operating Model Design
Cloud Operating Model Design
Joseph Schwartz
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
With events to a modern integration architecture
With events to a modern integration architectureWith events to a modern integration architecture
With events to a modern integration architecture
confluent
 
Rise with SAP
Rise with SAPRise with SAP
Rise with SAP
AGSanePLDTCompany
 
Platform & Application Modernization
Platform & Application ModernizationPlatform & Application Modernization
Platform & Application Modernization
JK Tech
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
ConfluentInc1
 
Accenture Cloud Platform: Control, Manage and Govern the Enterprise Cloud
Accenture Cloud Platform: Control, Manage and Govern the Enterprise CloudAccenture Cloud Platform: Control, Manage and Govern the Enterprise Cloud
Accenture Cloud Platform: Control, Manage and Govern the Enterprise Cloud
accenture
 
What’s new in OpenText Extended ECM & OpenText Content Suite Release 16 EP6
What’s new in OpenText Extended ECM & OpenText Content Suite Release 16 EP6 What’s new in OpenText Extended ECM & OpenText Content Suite Release 16 EP6
What’s new in OpenText Extended ECM & OpenText Content Suite Release 16 EP6
OpenText
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
DATAVERSITY
 
Managed Feature Store for Machine Learning
Managed Feature Store for Machine LearningManaged Feature Store for Machine Learning
Managed Feature Store for Machine Learning
Logical Clocks
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
Thomas Sykes
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Accenture-Cloud-Data-Migration-POV-Final.pdf
Accenture-Cloud-Data-Migration-POV-Final.pdfAccenture-Cloud-Data-Migration-POV-Final.pdf
Accenture-Cloud-Data-Migration-POV-Final.pdf
Rajvir Kaushal
 
Lift & Shift to Azure
Lift & Shift to AzureLift & Shift to Azure
Lift & Shift to Azure
Jochen Kirstätter
 
A Comparison of Cloud based ERP Systems
A Comparison of Cloud based ERP SystemsA Comparison of Cloud based ERP Systems
A Comparison of Cloud based ERP SystemsNakul Patel
 

What's hot (20)

Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Digital banking on AWS
Digital banking on AWSDigital banking on AWS
Digital banking on AWS
 
ENT203-Building a Solid Business Case for Cloud Migration.pdf
ENT203-Building a Solid Business Case for Cloud Migration.pdfENT203-Building a Solid Business Case for Cloud Migration.pdf
ENT203-Building a Solid Business Case for Cloud Migration.pdf
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Cloud Operating Model Design
Cloud Operating Model DesignCloud Operating Model Design
Cloud Operating Model Design
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
With events to a modern integration architecture
With events to a modern integration architectureWith events to a modern integration architecture
With events to a modern integration architecture
 
Rise with SAP
Rise with SAPRise with SAP
Rise with SAP
 
Platform & Application Modernization
Platform & Application ModernizationPlatform & Application Modernization
Platform & Application Modernization
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
Accenture Cloud Platform: Control, Manage and Govern the Enterprise Cloud
Accenture Cloud Platform: Control, Manage and Govern the Enterprise CloudAccenture Cloud Platform: Control, Manage and Govern the Enterprise Cloud
Accenture Cloud Platform: Control, Manage and Govern the Enterprise Cloud
 
What’s new in OpenText Extended ECM & OpenText Content Suite Release 16 EP6
What’s new in OpenText Extended ECM & OpenText Content Suite Release 16 EP6 What’s new in OpenText Extended ECM & OpenText Content Suite Release 16 EP6
What’s new in OpenText Extended ECM & OpenText Content Suite Release 16 EP6
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Managed Feature Store for Machine Learning
Managed Feature Store for Machine LearningManaged Feature Store for Machine Learning
Managed Feature Store for Machine Learning
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Accenture-Cloud-Data-Migration-POV-Final.pdf
Accenture-Cloud-Data-Migration-POV-Final.pdfAccenture-Cloud-Data-Migration-POV-Final.pdf
Accenture-Cloud-Data-Migration-POV-Final.pdf
 
Lift & Shift to Azure
Lift & Shift to AzureLift & Shift to Azure
Lift & Shift to Azure
 
A Comparison of Cloud based ERP Systems
A Comparison of Cloud based ERP SystemsA Comparison of Cloud based ERP Systems
A Comparison of Cloud based ERP Systems
 

Similar to The Scout24 Data Platform (A Technical Deep Dive)

AWS Startup Insights Kuala Lumpur
AWS Startup Insights Kuala LumpurAWS Startup Insights Kuala Lumpur
AWS Startup Insights Kuala Lumpur
Amazon Web Services
 
Vancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam ElmalakVancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam Elmalak
Amazon Web Services
 
Operating Microservices at Hyperscale — Tech in Asia PDC 2019
Operating Microservices at Hyperscale — Tech in Asia PDC 2019Operating Microservices at Hyperscale — Tech in Asia PDC 2019
Operating Microservices at Hyperscale — Tech in Asia PDC 2019
Donnie Prakoso
 
20141021 AWS Cloud Taekwon - Big Data on AWS
20141021 AWS Cloud Taekwon - Big Data on AWS20141021 AWS Cloud Taekwon - Big Data on AWS
20141021 AWS Cloud Taekwon - Big Data on AWS
Amazon Web Services Korea
 
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
DATAVERSITY
 
AWS Startup Insights Singapore
AWS Startup Insights SingaporeAWS Startup Insights Singapore
AWS Startup Insights Singapore
Amazon Web Services
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
confluent
 
Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017
Amazon Web Services
 
Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017
Amazon Web Services
 
AWSome Day Manchester 2105 - Intro/Close
AWSome Day Manchester 2105 - Intro/CloseAWSome Day Manchester 2105 - Intro/Close
AWSome Day Manchester 2105 - Intro/Close
Ian Massingham
 
Intro Presentation at AWS AWSome Day London September 2015
Intro Presentation at AWS AWSome Day London September 2015Intro Presentation at AWS AWSome Day London September 2015
Intro Presentation at AWS AWSome Day London September 2015
Ian Massingham
 
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...Semplificare l'analisi dei dati con architetture "Serverless": architetture e...
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...
Amazon Web Services
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
Amazon Web Services
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
Amazon Web Services
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
Amazon Web Services
 
AWS Summit Atlanta Keynote
AWS Summit Atlanta KeynoteAWS Summit Atlanta Keynote
AWS Summit Atlanta Keynote
Kristana Kane
 
Confluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdfConfluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdf
Ahmed791434
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
confluent
 
AWS AWSome Day London October 2015
AWS AWSome Day London October 2015 AWS AWSome Day London October 2015
AWS AWSome Day London October 2015
Ian Massingham
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Amazon Web Services
 

Similar to The Scout24 Data Platform (A Technical Deep Dive) (20)

AWS Startup Insights Kuala Lumpur
AWS Startup Insights Kuala LumpurAWS Startup Insights Kuala Lumpur
AWS Startup Insights Kuala Lumpur
 
Vancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam ElmalakVancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam Elmalak
 
Operating Microservices at Hyperscale — Tech in Asia PDC 2019
Operating Microservices at Hyperscale — Tech in Asia PDC 2019Operating Microservices at Hyperscale — Tech in Asia PDC 2019
Operating Microservices at Hyperscale — Tech in Asia PDC 2019
 
20141021 AWS Cloud Taekwon - Big Data on AWS
20141021 AWS Cloud Taekwon - Big Data on AWS20141021 AWS Cloud Taekwon - Big Data on AWS
20141021 AWS Cloud Taekwon - Big Data on AWS
 
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
 
AWS Startup Insights Singapore
AWS Startup Insights SingaporeAWS Startup Insights Singapore
AWS Startup Insights Singapore
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017
 
Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017
 
AWSome Day Manchester 2105 - Intro/Close
AWSome Day Manchester 2105 - Intro/CloseAWSome Day Manchester 2105 - Intro/Close
AWSome Day Manchester 2105 - Intro/Close
 
Intro Presentation at AWS AWSome Day London September 2015
Intro Presentation at AWS AWSome Day London September 2015Intro Presentation at AWS AWSome Day London September 2015
Intro Presentation at AWS AWSome Day London September 2015
 
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...Semplificare l'analisi dei dati con architetture "Serverless": architetture e...
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
AWS Summit Atlanta Keynote
AWS Summit Atlanta KeynoteAWS Summit Atlanta Keynote
AWS Summit Atlanta Keynote
 
Confluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdfConfluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdf
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
AWS AWSome Day London October 2015
AWS AWSome Day London October 2015 AWS AWSome Day London October 2015
AWS AWSome Day London October 2015
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
 

Recently uploaded

Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 

Recently uploaded (20)

Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 

The Scout24 Data Platform (A Technical Deep Dive)

  • 1. www.scout24.com The Scout24 Data Platform A Technical Deep Dive Munich | March 20, 2019 |Christian Dietze, Olalekan Elesin, Raffael Dzikowski
  • 2. 5 Core Geographies and an overall presence in 18 countries 80m Household Reach 2 Major Household Brand Names Scout24 AG • MDAX • € 531.7 million revenue (2018)
  • 3. Our technical evolution Production Database Monolith app Data Warehouse Data Warehouse Microservice MicroserviceMicroservice Microservice MicroserviceMicroservice Microservice MicroserviceMicroservice Microservice MicroserviceMicroservice Microservice Microservice Persistence Persistence PersistencePersistence Persistence PersistencePersistence
  • 4. Our data warehouse was a bottle neck
  • 5. Scout24 wants to become a truly data-driven company Fast & easy data-driven product development… …supported by Data & Analytics
  • 6. Scout24 wants to become a truly data-driven company Everywhere in the company... ...without bloating up Data & Analytics
  • 7. Our solution: Build an internal “platform” for data
  • 8. What is the Scout24 Data “Platform”?
  • 9. We think of our Data Platform as a Product • Just like AWS, Salesforce, etc. – the platform is a generic layer upon which Scout24’s products can be built • BUT, we have a very, very small number of customers. • That means, product teams get personalized support and there is lots of opportunity for collaboration.
  • 10. We don’t dictate anything. We just try to make certain things easier by offering a “paved path” Product teams are fully empowered to make their own choices about what is the best use of their resources.
  • 11. “In almost all cases, we will not mandate that internal team use these platforms and services—these platform teams will have to win over and satisfy their internal customers, sometimes even competing with external vendors.”
  • 12. Guiding principle of the platform • Autonomy for producers and consumers Self-service Analytics Self-service Data Ingestion Self-service ETL
  • 22. Data Landscape Manifesto – Purpose Best Practices Guidelines Alignment Principles Responsi- bilities
  • 24. Watch the Video | Read the Blog Post on Medium Necessary Cultural Changes during Migration from Data Warehouse to Cloud Data Platform Dr. Sean Gustafson, Scout24 AG Data Festival 2018
  • 25. 10,000 Foot Architecture Overview Personal Analytics Cluster Datalake AWS Cloud Amazon Aurora Microservices OneScout Hive Metastore Data Center Core DB DataWario Firehose / Kafka / Rest API PM / Leader / Analyst Data Scientist / Analyst Engineer Engineer
  • 26. 10,000 Foot Architecture Overview Personal Analytics Cluster Datalake AWS Cloud Amazon Aurora Microservices OneScout Hive Metastore Data Center Core DB DataWario Firehose / Kafka / Rest API PM / Leader / Analyst Data Scientist / Analyst Engineer Engineer Section #3 Self-Service Analytics Section #1 Self-Service Ingestion Data Producer / Engineer PM / Leader / Analyst Section #2 Self-Service ETL Engineer
  • 28.
  • 30. Data Lake Bucket Types Regular Restricted Personal RestrictedRegular Personal Raw Processed Visibility Level
  • 31. Data Lake Access Roles Data Lake Account ImmobilienScout24 Data Lake Account AutoScout24 Regular DL Bucket Access Role Restricted DL Bucket Access Roles Personal DL Bucket Access Roles
  • 32. Data Lake Ingestion Goals Microservices Ingestion Goals Batch Support Streaming Support Rest APIScalability
  • 34. Ingestion Options Amazon Kinesis Data Firehose Kafka Connect Hadoop Rest API { } Simple Config Simple Config Configuration Free
  • 35. Kinesis Data Firehose Ingestion Mechanism
  • 36. Firehose Ingestion Architecture Data Lake Account Amazon Kinesis Data Firehose Producer to Firehose Role Producer Account Data Producer Firehose to Datalake Role Datalake Bucket
  • 37. Firehose Ingestion Architecture Data Lake Account Amazon Kinesis Data Firehose Producer to Firehose Role Producer Account Data Producer Firehose to Datalake Role Datalake Bucket STS Assume Role STS Assume Role
  • 38. Firehose Ingestion Architecture Data Lake Account Amazon Kinesis Data Firehose Producer to Firehose Role Producer Account Data Producer Firehose to Datalake Role Datalake Bucket STS Assume Role STS Assume Role Send Data
  • 39. Firehose Ingestion Architecture Data Lake Account Amazon Kinesis Data Firehose Producer to Firehose Role Producer Account Data Producer Firehose to Datalake Role Datalake Bucket STS Assume Role STS Assume Role Send Data Write Data
  • 46. Scout24 Infinity Cluster Amazon ECS Infinity Service
  • 47. Scout24 Infinity Cluster Amazon ECS Infinity Service Simple Config
  • 48. Scout24 Infinity Cluster Amazon ECS Infinity Service Simple Config Central Logging to Elasticsearch Monitoring in Datadog Managed by Scout24 Cloud Platform Engineering
  • 49. Get the Slide Deck To Infinity and Beyond Handling Heterogenous Container Clusters in AWS Christine Trahe, Scout24 AG AWS Summit Berlin 2019
  • 50. Kafka Connect on Infinity Simple Config (Infinity) Amazon ECS Infinity Service
  • 51. Kafka Connect on Infinity
  • 52. Kafka Connect on Infinity Kafka Connect Service Infinity Service
  • 53. Kafka Connect on Infinity Simple Config (Kafka Connect) Kafka Connect Service Infinity Service
  • 54. Kafka Connect Deployment Write Data Read Data from Topic(s) Kafka Connect Cluster Amazon S3 Elasticsearch ActiveMQ Cassandra Kafka RDMBS …
  • 55. Kafka Connect Deployment Write Data Read Data from Topic(s) Kafka Connect Cluster Amazon S3
  • 56. Kafka Connect Deployment Write Data Read Data from Topic(s) Amazon S3
  • 57. Kafka Connect Deployment Write Data Read Data from Topic(s) Amazon S3 Kafka Connect Service
  • 58. Kafka Connect Deployment Kafka Account Infinity Account Datalake Account Datalake Bucket Kafka Connect Ingestion Role EC2 Instances Kafka Cluster Amazon ECS Infinity Cluster
  • 59. Kafka Connect Deployment Kafka Account Infinity Account Datalake Account Datalake Bucket Kafka Connect Ingestion Role EC2 Instances Kafka Cluster Kafka Connect Container Amazon ECS Kafka Connect Service (On Infinity)
  • 60. Kafka Connect Deployment Kafka Account Infinity Account Datalake Account Datalake Bucket Kafka Connect Ingestion Role EC2 Instances Kafka Cluster STS Assume Role Kafka Connect Container Amazon ECS Kafka Connect Service (On Infinity)
  • 61. Kafka Connect Deployment Kafka Account Infinity Account Datalake Account Datalake Bucket Kafka Connect Ingestion Role EC2 Instances Kafka Cluster STS Assume Role Kafka Connect Container Amazon ECS Kafka Connect Service (On Infinity) Read Topic
  • 62. Kafka Connect Deployment Kafka Account Infinity Account Datalake Account Datalake Bucket Kafka Connect Ingestion Role EC2 Instances Kafka Cluster STS Assume Role Kafka Connect Container Amazon ECS Kafka Connect Service (On Infinity) Read Topic Write Topic
  • 63. Hadoop Rest API Ingestion Mechanism
  • 64. Hadoop Rest API – A Motivation Hadoop Rest API { } Unified Set of Ingestion Operations Performs Event Validation Adds Reception Timestamp Field Stable User-Facing Endpoints, Varying Backend
  • 65. Hadoop Rest API – Operation Overview Hadoop Rest API { } Add Event Batch Add Single Event PUT PUT Automatic Partitioning by Ingestion Date Persisting to Disk Initially, Periodically Flushing to Target Store Also Publishing to Kinesis Stream
  • 66. Hadoop Rest API – Architecture Evolution (Phase 1) AWS Cloud Data Center Load Balancer HRA Service Generation 1 Microservice Datalake
  • 67. Hadoop Rest API – Architecture Evolution (Phase 1) AWS Cloud Data Center Load Balancer HRA Service Generation 1 Microservice Datalake Send Events
  • 68. Hadoop Rest API – Architecture Evolution (Phase 1) AWS Cloud Data Center Load Balancer HRA Service Generation 1 Microservice Datalake Send Events Ingest
  • 69. Hadoop Rest API – Architecture Evolution (Phase 2) AWS Cloud Data Center Load Balancer HRA Service Generation 1 Microservice Datalake Ingest Send Events Ingest
  • 70. Hadoop Rest API – Architecture Evolution (Phase 2) AWS Cloud Data Center Load Balancer HRA Service Generation 1 Microservice Datalake Ingest Send Events
  • 71. Hadoop Rest API – Architecture Evolution (Phase 3) AWS Cloud Data Center Microservice HRA Service Generation 2 Infinity Service Datalake
  • 72. Hadoop Rest API – Architecture Evolution (Phase 3) AWS Cloud Data Center Microservice HRA Service Generation 2 Infinity Service Datalake Send Events
  • 73. Hadoop Rest API – Architecture Evolution (Phase 3) AWS Cloud Data Center Microservice HRA Service Generation 2 Infinity Service Datalake Send Events Ingest
  • 75. Self Service ETL Requirements Data Landscape Manifesto Principle #5: Data consumers are responsible for the definition and visualization of metrics and for driving the implementation and maintenance of these metrics. èData Consumers get a lot responsibility èThe Data Platform must empower users to fulfill this responsibility èIt must be easy to develop, test and maintain data processes Multi-Account setup: Data processes must run in data consumer’s account èNo centrally run processing tool èAWS Managed service is preferred
  • 76. AWS Data Pipeline Team Account Extract Transform Load ETL runs in team account
  • 77. AWS Data Pipeline Team Account Extract Transform Load Teams run ETL in their accounts It‘s easy …. Or is it?
  • 78.
  • 80. DataWario 1. Client for AWS Data Pipeline 2. Transforms YAML into JSON format understood by Data Pipeline 3. Integrates with the Scout24 account setup and sets sensible configuration defaults 4. Augments Data Pipeline with additional features 5. Makes testing and debugging faster
  • 81. DataWario – Architecture Poll for Tasks Start Post Pipeline Definition Upload Artifacts Any Scout24 Account Data Center Client with DataWario Server Artifact Bucket EC2 Instance • Java Application, distributed as a JAR • Shell wrapper (dw) for easy usage from the command line Deployment of a pipeline 1. User issues dw deploy 2. DataWario 1. uploads artifacts from user‘s machine to S3 2. generates JSON definition from YAML 3. creates data pipeline 3. AWS Data Pipeline runs the defined steps
  • 83. Sensible defaults EMR release? EC2 instance type? IAM role? Subnet ID? emr: emr-5.21.0 m5.xlarge DataWarioResourceRole subnet-12345
  • 84. Augmenting data pipeline ID Make Model Price Mileage 1 VW Golf 22000 15000 2 BMW 320d 35000 40000 „Built-In“ Copy Activity 1,VW,Golf,22000,15000 2,BMW,320d,35000,40000 • CSV (or TSV) • No schema
  • 85. Augmenting data pipeline ID Make Model Price Mileage 1 VW Golf 22000 15000 2 BMW 320d 35000 40000 „Built-In“ Copy Activity • CSV (or TSV) • No schema „Custom“ Copy Activity + 1,VW,Golf,22000,15000 2,BMW,320d,35000,40000
  • 86. Speed up pipeline development starts runs fails takes ca. 10 min + job runtime dw deploy Typical development cycle
  • 87. Speed up test and debug cycle Without test cluster starts runs fails takes ca. 10 min + job runtime With test cluster dw test-cluster create dw deploy -- test-cluster uses fails Typical development cycle takes job runtime dw deploy runs
  • 88. DataWario brought us from here … photo by Dirk van der Made (https://commons.wikimedia.org/wiki/File:DirkvdM_cloudforest- jungle.jpg), „DirkvdM cloudforest-jungle“, https://creativecommons.org/licenses/by- sa/3.0/legalcode
  • 90. So, how do we get here?
  • 91. Paving the path further • We built a lot of pipelines, so did our users • We see repetitive tasks emerge in these pipelines • Reimplementing solutions over and over again is waste • We don‘t like waste è Let‘s build reusable tools for common tasks
  • 92. Rob Pike; Brian W. Kernighan (October 1984). "Program Design in the UNIX Environment" Much of the power of the UNIX operating system comes from a style of program design that makes programs easy to use and, more important, easy to combine with other programs. This style has been called the use of software tools, and depends more on how the programs fit into the programming environment and how they can be used with other programs than on how they are designed internally. [...] This style was based on the use of tools: using programs separately or in combination to get a job done, rather than doing it by hand, by monolithic self-sufficient subsystems, or by special-purpose, one-time programs. Title | Your name 92
  • 93. Tools that do one thing, and do that well • Advantages: − Easy to configure, only a few settings − Easy to test, because there are less branches to cover − Easy to debug, output of each tool can be checked independently • Drawbacks: − More intermittent data written − longer runtimes
  • 94. Event snapshotter • Typical questions: What was the headline of a listing on Feb 22, 2016? Was it published? How about March 1? • Input data: Change events for an entity (e.g. listing) − User changed headline of a listing, number of pictures − Listing got published/unpublished • Snapshotter 1. takes snapshot from the day before 2. merges all change events from the day, keeps last 3. writes new snapshot for day • Simple config (input and output path, name of ID and timestamp column, output path, partition schema) • No business logic, agnostic of type of entity
  • 95. Event Snapshotter Example {"id": 1, "changed": "2019-01-01T15:31:00.000+01:00", "headline": „nice appartment", "published": false} {"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false} {"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true} {"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true} January 1, 2019 (s3://snapshotBucket/snapshot_day=2019-01-01/): {"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false} January 2, 2019 (s3://snapshotBucket/snapshot_day=2019-01-02/): {"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true} {"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true}
  • 96. Aggregator • Typical Questions: − How many published listings did we have on Feb 1 2018? − What‘s the daily average number of pageviews per car make? • Input data: − Interaction events of an entity ( e.g. page viewed, email sent, number called) − output of Event Snapshotter • Simple config (input and output paths, timestamp column, grouping column, aggregation method and column)
  • 97. Spark Utils • Collection of simple tools (usually one class per tool) • Can be used − standalone in a pipeline step − as a dependency of your code • Examples: − Create a Hive table on-top of existing data − Convert data from one Format to another (e.g. from JSON to Parquet for faster querying) − Flatten nested structures − publish data quality metrics to CloudWatch − Materializing views
  • 98. Current State / Outlook • Data Pipeline with DataWario widely used throughout the company • Challenges: − Data Pipeline not getting much love from AWS (not deprecated, but not getting many new features) − Coordination between data pipelines only rudimentary • Possible solutions: − AWS Step Functions in combination with AWS Glue − Introduction of a Workflow tool (e.g. AirFlow)
  • 101. What’s Ahead Unlock the Datalake for Scout24’s Toolset and Users with Different Skillsets Data Analysis for Various User Groups Provide a Timely and Accurate Update of the Metadata Layer
  • 102. What’s Ahead OneScout Hive Metastore Unlock the Datalake for Scout24’s Toolset and Users with Different Skillsets Data Analysis for Various User Groups Provide a Timely and Accurate Update of the Metadata Layer
  • 103. What’s Ahead Personal Analytics ClusterOneScout Hive Metastore Unlock the Datalake for Scout24’s Toolset and Users with Different Skillsets Data Analysis for Various User Groups Provide a Timely and Accurate Update of the Metadata Layer
  • 104. What’s Ahead Automatic Hive Partition DetectionPersonal Analytics ClusterOneScout Hive Metastore Unlock the Datalake for Scout24’s Toolset and Users with Different Skillsets Data Analysis for Various User Groups Provide a Timely and Accurate Update of the Metadata Layer
  • 105. OneScout Hive Metastore – A Schematic View Personal Analytics Cluster Hive Tables and Presto Views Datalake
  • 106. OneScout Hive Metastore – Recap of Ecosystem Product Account Product Account … Product Account Product Account Product Account … Product Account Data Lake Account ImmobilienScout24 Data Lake Account AutoScout24
  • 107. The Scout24 Hive Metastore Proxy – A Motivation Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting
  • 108. The Scout24 Hive Metastore Proxy – A Motivation Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account
  • 109. The Scout24 Hive Metastore Proxy – A Motivation Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account Connect Connect
  • 110. Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account Connect Connect Read Data Read Data The Scout24 Hive Metastore Proxy – A Motivation
  • 111. Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account Connect Connect Read Data Read Data Amazon EMR Proxy Instance Amazon Aurora Proxy Instance Product Account Product Account … Metastore Account Amazon EMR The Scout24 Hive Metastore Proxy – A Motivation
  • 112. Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account Connect Connect Read Data Read Data Amazon EMR Proxy Instance Amazon Aurora Proxy Instance Product Account Product Account … Metastore Account Amazon EMR ConnectConnect The Scout24 Hive Metastore Proxy – A Motivation
  • 113. Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account Connect Connect Read Data Read Data Amazon EMR Proxy Instance Amazon Aurora Proxy Instance Product Account Product Account … Metastore Account Amazon EMR ConnectConnect Forward Request Forward Request The Scout24 Hive Metastore Proxy – A Motivation
  • 114. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake
  • 115. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table
  • 116. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion
  • 117. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion
  • 118. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion Table Access
  • 119. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion Table Access
  • 120. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion Table Access Automatic Partition Detection
  • 121. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion Table Access Automatic Partition Detection
  • 122. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table
  • 123. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Trigger Publish
  • 124. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table
  • 125. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events
  • 126. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events Transfer to CloudWatch Logs
  • 127. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events Transfer to CloudWatch Logs Apply Rule
  • 128. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events Transfer to CloudWatch Logs Apply Rule Send to Partition Detector
  • 129. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events Transfer to CloudWatch Logs Apply Rule Send to Partition Detector Get Bucket Notification
  • 130. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events Transfer to CloudWatch Logs Apply Rule Send to Partition Detector Get Bucket Notification Add Partition Get Partition Info
  • 132. The Personal Analytics Cluster – An Overview Amazon EMR Personal Analytics Cluster AWS CloudFormation
  • 133. The Personal Analytics Cluster – An Overview
  • 134. The Personal Analytics Cluster – An Overview Amazon EMR Personal Analytics Cluster AWS CloudFormation
  • 135. The Personal Analytics Cluster – An Overview Amazon EMR Personal Analytics Cluster Simple Config AWS CloudFormation
  • 136. The Personal Analytics Cluster – An Overview Amazon EMR Personal Analytics Cluster Simple Config Easy Access via Web Interface Zeppelin and Jupyter Notebook Restore OneClick Deployment Managed Scaling and Shutdown Support for Pre-baked AMIs and Configs AWS CloudFormation
  • 137. The Personal Analytics Cluster – Configuration Setup • Cluster Name (default is usually username) • AWS Account Name • Market segment • Use Case • Shutdown time • Instance Type • Bid Price • Key Pair • EBS Volume Size • Tear Down Existing Cluster (optional)
  • 138. $ ./deploy-cluster --cluster-name <> --account-name <> --market-segment <> --usecase <> --shutdown-time <> --instance-type <> --bid-price <> --key-pair <> --ebs-volume-size <> --teardown-existing <> The Personal Analytics Cluster – Configuration Setup
  • 139.
  • 140. The Personal Analytics Cluster – Data Lake Access Regular Restricted Personal RestrictedRegular Personal Raw Processed Visibility Level Data Lake Account ImmobilienScout24 Data Lake Account AutoScout24 Regular DL Bucket Access Role Restricted DL Bucket Access Roles Personal DL Bucket Access Roles
  • 141. The Personal Analytics Cluster – Data Lake Access • Managed with AWS EMR FS Security Configuration • Integrates with EMR Applications internally; Spark, Presto, etc • Easy to define • Maps a defined IAM role to S3 bucket being accessed
  • 142. The Personal Analytics Cluster – Data Lake Access Scenario One: Non-restricted Data Lake Access
  • 143. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud emrfs-site.xml AWS Cloud Regular Restricted Personal Non-restricted Role Restricted Role Personal Role my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”) Data Lake Account
  • 144. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud emrfs-site.xml my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping Non-restricted Role Restricted Role Personal Role Data Lake Account
  • 145. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud Non-restricted Role Restricted Role Personal Role emrfs-site.xml my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping STS Assume Role Data Lake Account
  • 146. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud Non-restricted Role Restricted Role Personal Role emrfs-site.xml my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping STS Assume Role Data Lake Account
  • 147. The Personal Analytics Cluster – Data Lake Access Scenario Two: Restricted/Personal Data Lake Access
  • 148. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud emrfs-site.xml my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping Restricted Role Non-restricted Role Personal Role Data Lake Account
  • 149. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud emrfs-site.xml my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping STS Assume Role Restricted Role Non-restricted Role Personal Role Data Lake Account
  • 150. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment A Account AWS Cloud Restricted Role Non-restricted Role Personal Role emrfs-site.xml my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping STS Assume Role Data Lake Account
  • 151. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment B Account AWS Cloud Restricted Role Non-restricted Role Personal Role emrfs-site.xml my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping STS Assume Role Data Lake Account
  • 152. Data Platform SQL Access
  • 153. Our Journey to Presto
  • 154. Our Journey to Presto On EMR Cross Account Support (OneScout Hive Metastore) Leverages Datalake Access Roles (EMRFS) Scheduled Scaling Configurations Fits our GDPR Concept (multiple isolated Clusters)
  • 155. Our Journey to Presto Average Presto Queries per Day 360Queries per Day Number of Queries
  • 156. We hope to throw out most of the custom components we build.
  • 157. Our data warehouse was a bottle neck
  • 158. Our data platform holds nothing back
  • 159. Christian Dietze Olalekan Elesin Raffael Dzikowski Senior Data Engineer Data Engineer Senior Data Engineer Scout24 Group Scout24 Group Scout24 Group Thank you for your attention!