The document provides an overview of the Scout24 Data Platform and its evolution towards becoming a truly data-driven company. Some key points:
- Scout24 operates various household brands across 18 countries with 80 million household reach.
- Historically, Scout24's technical architecture included a monolithic application and data warehouse that acted as a bottleneck.
- To address this, Scout24 built an internal "data platform" consisting of a microservices architecture, data lake, self-service analytics, and data ingestion tools to enable fast, easy product development supported by data and analytics.
- The data platform is thought of as a product in itself that provides generic layers for Scout24's products to be built upon
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
This presentation will cover Cloud history and Microsoft Azure Data Analytics capabilities. Moreover, it has a real-world example of DW modernization. Finally, we will check the alternative solution on Azure using Snowflake and Matillion ETL.
Digital Transformation: Leveraging AWS as a Launchpad (CMP205-S) - AWS re:Inv...Amazon Web Services
Learn how DXC helped a Fortune 1000 client drive digital transformation with AWS. An APN Premier Consulting Partner, DXC illustrates through customer testimonials, use cases, and architectures how enterprise clients have transformed their business with the AWS Cloud. This session is brought to you by AWS partner, DXC Technology.
An overview of Microsoft Dynamics NAV, an easily adaptable ERP solution. it helps all small and medium sized business to automate and connect their sales, purchasing, operation, accounting and inventory management.
Determining the Deployment Model that Fits Your Organization's NeedsCelonis
The Celonis Intelligent Business Cloud (IBC) is the first transformation platform developed from the ground up specifically for the cloud. It allows you to utilize all benefits of a full cloud platform on a multi-tenant architecture, but keep the data in your local on-premise systems as well. By default, the IBC provides enterprise-class security without the additional effort, complexity and management that traditional solutions require. And the combination of on-premise extractor technology and the leave-in-place deployment allows you to decide which setup fits best for your IT infrastructure and allows you to maintain full data governance over your IT landscape. This session will introduce you to the different scenarios available with the IBC and help you decide which might be best for your next transformation project.
Presenter:
Manuel Haug, Head of Product Management Core, Celonis
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
This presentation will cover Cloud history and Microsoft Azure Data Analytics capabilities. Moreover, it has a real-world example of DW modernization. Finally, we will check the alternative solution on Azure using Snowflake and Matillion ETL.
Digital Transformation: Leveraging AWS as a Launchpad (CMP205-S) - AWS re:Inv...Amazon Web Services
Learn how DXC helped a Fortune 1000 client drive digital transformation with AWS. An APN Premier Consulting Partner, DXC illustrates through customer testimonials, use cases, and architectures how enterprise clients have transformed their business with the AWS Cloud. This session is brought to you by AWS partner, DXC Technology.
An overview of Microsoft Dynamics NAV, an easily adaptable ERP solution. it helps all small and medium sized business to automate and connect their sales, purchasing, operation, accounting and inventory management.
Determining the Deployment Model that Fits Your Organization's NeedsCelonis
The Celonis Intelligent Business Cloud (IBC) is the first transformation platform developed from the ground up specifically for the cloud. It allows you to utilize all benefits of a full cloud platform on a multi-tenant architecture, but keep the data in your local on-premise systems as well. By default, the IBC provides enterprise-class security without the additional effort, complexity and management that traditional solutions require. And the combination of on-premise extractor technology and the leave-in-place deployment allows you to decide which setup fits best for your IT infrastructure and allows you to maintain full data governance over your IT landscape. This session will introduce you to the different scenarios available with the IBC and help you decide which might be best for your next transformation project.
Presenter:
Manuel Haug, Head of Product Management Core, Celonis
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
ENT203-Building a Solid Business Case for Cloud Migration.pdfAmazon Web Services
Favorable economics are the starting point for a compelling business case to move to the cloud, but it is only part of the total picture. The cloud can provide benefits in additional areas such as technology optimization, cost of change, and business value. In this session, you will learn a framework and the tools available to create a compelling business case for a large-scale migration to AWS.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
Data Mesh is a trending approach to building a decentralized data architecture by leveraging a domain-oriented, self-service design. However, the pure definition of Data Mesh lacks a center of excellence or central data team and doesn’t address the need for a common approach for sharing data products across teams. The semantic layer is emerging as a key component to supporting a Hub and Spoke style of organizing data teams by introducing data model sharing, collaboration, and distributed ownership controls.
This session will explain how data teams can define common models and definitions with a semantic layer to decentralize analytics product creation using a Hub and Spoke architecture.
Attend this session to learn about:
- The role of a Data Mesh in the modern cloud architecture.
- How a semantic layer can serve as the binding agent to support decentralization.
- How to drive self service with consistency and control.
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
Accenture Cloud Platform: Control, Manage and Govern the Enterprise Cloudaccenture
The Accenture Cloud Platform is a multi-cloud management platform that enables organizations to manage all of their enterprise cloud
resources—public and private—and automate and accelerate solution delivery.
What’s new in OpenText Extended ECM & OpenText Content Suite Release 16 EP6 OpenText
OpenText Content Suite and OpenText ECM are a new generation of content services platforms. Content Suite facilitates and manages the flow of information across your enterprise, providing the infrastructure for secure, dependable management of content throughout its lifecycle—without compromising usability and productivity. Extended ECM delivers an information advantage to empower business processes by extending ECM into lead business applications to add context to content, for improved efficiency and decision-making power.
OpenText Content Suite 16 EP6 and OpenText Extended ECM 16 EP6 deliver business value in 3 key areas:
1. Content enhanced business processes
2. Usability and productivity for end users
3. Enterprise readiness to operate ECM securely and efficiently in a distributed environment
For more information www.opentext.com/ECM
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
Developing a Data Strategy for your organization can seem like a daunting task – but it’s worth the effort. Getting your Data Strategy right can provide significant value, as data drives many of the key initiatives in today’s marketplace – from digital transformation, to marketing, to customer centricity, to population health, and more. This webinar will help demystify Data Strategy and its relationship to Data Architecture and will provide concrete, practical ways to get started.
Managed Feature Store for Machine LearningLogical Clocks
All hyperscale AI companies build their machine learning platforms around a Feature Store.
A feature is a measurable property of some data-sample. It could be for example an image-pixel, a word from a piece of text, the age of a person, a coordinate emitted from a sensor, or an aggregate value like the average number of purchases within the last hour. A Feature Store is a central place to store curated features within an organization.
Feature Stores are a fuel for AI systems as we use them to train machine learning models so that we can make predictions for feature values that we have never seen before.
During this presentation you learn:
- About the concept of a Feature Store and how it can help manage feature data for Enterprises and ease the path of data from backend systems and data-lakes to Data Scientists.
- Our take on Feature Stores, including best practices and use cases and:
- How to ensure Consistent Features in both Training and Serving
Governance, Access-Control, and Versioning
- To create Training Data in the File Format of your Choice
Eliminate Inconsistency between Features in Training and Inferencing
Watch the webinar with a demo: https://www.logicalclocks.com/webinars
What to consider when moving to Microsoft Azure
The slide deck is about the observations, discoveries and obstacles we had to deal with while lifting and shifting the IT infrastructure of a few clients from either on-premise or from other hosting providers to Microsoft Azure.
This talk was given at the Azure Accelerator Conference by Liquid Telecom in Mauritius.
https://www.liquidtelecom.com/azure-accelerator-event-mauritius-2019.html
What is Innovation? How can cloud computing help you innovate? How can you make your applications smarter? Predictive? How can you interpret data and anticipate trends? With AWS Artificial Intelligence Solutions: Machine Learning, Rekognition, Polly; with serverless - Lambda, Step Functions.
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
ENT203-Building a Solid Business Case for Cloud Migration.pdfAmazon Web Services
Favorable economics are the starting point for a compelling business case to move to the cloud, but it is only part of the total picture. The cloud can provide benefits in additional areas such as technology optimization, cost of change, and business value. In this session, you will learn a framework and the tools available to create a compelling business case for a large-scale migration to AWS.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
Data Mesh is a trending approach to building a decentralized data architecture by leveraging a domain-oriented, self-service design. However, the pure definition of Data Mesh lacks a center of excellence or central data team and doesn’t address the need for a common approach for sharing data products across teams. The semantic layer is emerging as a key component to supporting a Hub and Spoke style of organizing data teams by introducing data model sharing, collaboration, and distributed ownership controls.
This session will explain how data teams can define common models and definitions with a semantic layer to decentralize analytics product creation using a Hub and Spoke architecture.
Attend this session to learn about:
- The role of a Data Mesh in the modern cloud architecture.
- How a semantic layer can serve as the binding agent to support decentralization.
- How to drive self service with consistency and control.
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
Accenture Cloud Platform: Control, Manage and Govern the Enterprise Cloudaccenture
The Accenture Cloud Platform is a multi-cloud management platform that enables organizations to manage all of their enterprise cloud
resources—public and private—and automate and accelerate solution delivery.
What’s new in OpenText Extended ECM & OpenText Content Suite Release 16 EP6 OpenText
OpenText Content Suite and OpenText ECM are a new generation of content services platforms. Content Suite facilitates and manages the flow of information across your enterprise, providing the infrastructure for secure, dependable management of content throughout its lifecycle—without compromising usability and productivity. Extended ECM delivers an information advantage to empower business processes by extending ECM into lead business applications to add context to content, for improved efficiency and decision-making power.
OpenText Content Suite 16 EP6 and OpenText Extended ECM 16 EP6 deliver business value in 3 key areas:
1. Content enhanced business processes
2. Usability and productivity for end users
3. Enterprise readiness to operate ECM securely and efficiently in a distributed environment
For more information www.opentext.com/ECM
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
Developing a Data Strategy for your organization can seem like a daunting task – but it’s worth the effort. Getting your Data Strategy right can provide significant value, as data drives many of the key initiatives in today’s marketplace – from digital transformation, to marketing, to customer centricity, to population health, and more. This webinar will help demystify Data Strategy and its relationship to Data Architecture and will provide concrete, practical ways to get started.
Managed Feature Store for Machine LearningLogical Clocks
All hyperscale AI companies build their machine learning platforms around a Feature Store.
A feature is a measurable property of some data-sample. It could be for example an image-pixel, a word from a piece of text, the age of a person, a coordinate emitted from a sensor, or an aggregate value like the average number of purchases within the last hour. A Feature Store is a central place to store curated features within an organization.
Feature Stores are a fuel for AI systems as we use them to train machine learning models so that we can make predictions for feature values that we have never seen before.
During this presentation you learn:
- About the concept of a Feature Store and how it can help manage feature data for Enterprises and ease the path of data from backend systems and data-lakes to Data Scientists.
- Our take on Feature Stores, including best practices and use cases and:
- How to ensure Consistent Features in both Training and Serving
Governance, Access-Control, and Versioning
- To create Training Data in the File Format of your Choice
Eliminate Inconsistency between Features in Training and Inferencing
Watch the webinar with a demo: https://www.logicalclocks.com/webinars
What to consider when moving to Microsoft Azure
The slide deck is about the observations, discoveries and obstacles we had to deal with while lifting and shifting the IT infrastructure of a few clients from either on-premise or from other hosting providers to Microsoft Azure.
This talk was given at the Azure Accelerator Conference by Liquid Telecom in Mauritius.
https://www.liquidtelecom.com/azure-accelerator-event-mauritius-2019.html
What is Innovation? How can cloud computing help you innovate? How can you make your applications smarter? Predictive? How can you interpret data and anticipate trends? With AWS Artificial Intelligence Solutions: Machine Learning, Rekognition, Polly; with serverless - Lambda, Step Functions.
Operating Microservices at Hyperscale — Tech in Asia PDC 2019Donnie Prakoso
Presented at Tech in Asia PDC 2019 in Jakarta.
Most developers today are adopting a microservices based application design. Microservices can provide higher system reliability, fine-grained scalability, and faster development cycles. At hyperscale (thousands to millions of requests per second), however, additional thought, careful design, and greater operational rigor are required. In this session, learn fundamental design principles and best practices for hyperscale applications.
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...DATAVERSITY
Mainframes continue to perform mission-critical transaction processing and contain massive amounts of core business data. But digital transformation initiatives and cloud computing have created both opportunities and challenges for unlocking and utilizing this data. Qlik and AWS will share some of the proven strategies from successful customer deployments across a range of different mainframe to cloud use cases, including legacy application modernization, data analytics, and data migrations.
In this presentation, you will learn how to:
• Replicate very large volumes of mainframe data in real-time to the cloud
• Automate the creation of analytics-ready data lakes and data warehouses
• Achieve a 30% reduction in cost of compute
Data Streaming with Apache Kafka & MongoDBconfluent
Explore the use-cases and architecture for Apache Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWSAmazon Web Services
What if you were told that within three months, you had to scale your existing platform from 1,000 req/sec (requests per second) to handle 300,000 req/sec with an average latency of 25 milliseconds? And that you had to accomplish this with a tight budget, expand globally, and keep the project confidential until officially announced by well-known global mobile device manufacturers? That’s what exactly happened to us. This session explains how The Weather Company partnered with AWS to scale our data distribution platform to prepare for unpredictable global demand. We cover the many challenges that we faced as we worked on architecture design, technology and tools selection, load testing, deployment and monitoring, and how we solved these challenges using AWS.
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)Amazon Web Services
Join us for this general session where AWS big data experts present an in-depth look at the current state of big data. Learn about the latest big data trends and industry use cases. Hear how other organizations are using the AWS big data platform to innovate and remain competitive. Take a look at some of the most recent AWS big data announcements, as we kick off the Big Data re:Source Mini Con.
Build real-time streaming data pipelines to AWS with Confluentconfluent
Traditional data pipelines often face scalability issues and challenges related to cost, their monolithic design, and reliance on batch data processing. They also typically operate under the premise that all data needs to be stored in a single centralized data source before it's put to practical use. Confluent Cloud on Amazon Web Services (AWS) provides a fully managed cloud-native platform that helps you simplify the way you build real-time data flows using streaming data pipelines and Apache Kafka.
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashAmazon Web Services
Version 7 of the Elastic Stack adds powerful new features to the popular open source platform for search, logging, and analytics. Come hear directly from Elastic engineers and architecture team members on powerful new additions like GIS functionality and frozen-tier search. Plus, hear about the full range of orchestration options for getting the most out of your deployments, however and wherever you choose to run them. This session is sponsored by Elastic.
Similar to The Scout24 Data Platform (A Technical Deep Dive) (20)
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
1. www.scout24.com
The Scout24 Data Platform
A Technical Deep Dive
Munich | March 20, 2019 |Christian Dietze, Olalekan Elesin, Raffael Dzikowski
2. 5
Core Geographies
and an overall presence
in 18 countries
80m
Household Reach
2
Major Household Brand Names
Scout24 AG
• MDAX
• € 531.7 million revenue (2018)
3. Our technical evolution
Production
Database
Monolith app
Data Warehouse Data
Warehouse
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
Microservice
Persistence
Persistence
PersistencePersistence
Persistence
PersistencePersistence
9. We think of our Data Platform as a Product
• Just like AWS, Salesforce, etc. – the platform is a generic layer upon
which Scout24’s products can be built
• BUT, we have a very, very small number of customers.
• That means, product teams get personalized support and there is lots
of opportunity for collaboration.
10. We don’t dictate anything.
We just try to make certain things easier by offering a “paved path”
Product teams are fully empowered to make their own choices about
what is the best use of their resources.
11. “In almost all cases, we will
not mandate that internal
team use these platforms
and services—these
platform teams will have
to win over and satisfy
their internal customers,
sometimes even
competing with external
vendors.”
12. Guiding principle of the platform
• Autonomy for producers and consumers
Self-service Analytics
Self-service Data Ingestion
Self-service ETL
24. Watch the Video | Read the Blog Post on Medium
Necessary Cultural Changes during Migration
from Data Warehouse to Cloud Data Platform
Dr. Sean Gustafson, Scout24 AG
Data Festival 2018
25. 10,000 Foot Architecture Overview
Personal
Analytics Cluster
Datalake
AWS Cloud
Amazon
Aurora
Microservices
OneScout Hive Metastore
Data Center
Core DB
DataWario
Firehose /
Kafka / Rest
API
PM / Leader / Analyst
Data Scientist / Analyst
Engineer
Engineer
26. 10,000 Foot Architecture Overview
Personal
Analytics Cluster
Datalake
AWS Cloud
Amazon
Aurora
Microservices
OneScout Hive Metastore
Data Center
Core DB
DataWario
Firehose /
Kafka / Rest
API
PM / Leader / Analyst
Data Scientist / Analyst
Engineer
Engineer
Section #3
Self-Service Analytics Section #1
Self-Service Ingestion
Data Producer /
Engineer
PM / Leader /
Analyst
Section #2
Self-Service ETL
Engineer
30. Data Lake Bucket Types
Regular Restricted Personal RestrictedRegular Personal
Raw Processed
Visibility Level
31. Data Lake Access Roles
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
Regular DL Bucket Access Role
Restricted DL Bucket Access Roles
Personal DL Bucket Access Roles
32. Data Lake Ingestion Goals
Microservices
Ingestion
Goals
Batch Support
Streaming Support
Rest APIScalability
36. Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
37. Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
38. Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Send Data
39. Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Send Data
Write Data
48. Scout24 Infinity Cluster
Amazon ECS
Infinity Service
Simple
Config
Central Logging to
Elasticsearch
Monitoring in Datadog
Managed by Scout24
Cloud Platform
Engineering
49. Get the Slide Deck
To Infinity and Beyond Handling
Heterogenous Container Clusters in AWS
Christine Trahe, Scout24 AG
AWS Summit Berlin 2019
50. Kafka Connect on Infinity
Simple
Config
(Infinity)
Amazon ECS
Infinity Service
64. Hadoop Rest API – A Motivation
Hadoop Rest API
{ }
Unified Set of
Ingestion Operations
Performs Event
Validation
Adds Reception
Timestamp Field
Stable User-Facing
Endpoints, Varying
Backend
65. Hadoop Rest API – Operation Overview
Hadoop Rest API
{ }
Add
Event
Batch
Add
Single
Event
PUT
PUT
Automatic
Partitioning by
Ingestion Date
Persisting to Disk
Initially, Periodically
Flushing to Target
Store
Also Publishing to
Kinesis Stream
66. Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
67. Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Send Events
68. Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Send Events
Ingest
69. Hadoop Rest API – Architecture Evolution (Phase 2)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Ingest
Send Events
Ingest
70. Hadoop Rest API – Architecture Evolution (Phase 2)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Ingest
Send Events
71. Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
72. Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
Send Events
73. Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
Send Events
Ingest
75. Self Service ETL Requirements
Data Landscape Manifesto Principle #5: Data consumers are responsible for the definition and visualization of
metrics and for driving the implementation and maintenance of these metrics.
èData Consumers get a lot responsibility
èThe Data Platform must empower users to fulfill this responsibility
èIt must be easy to develop, test and maintain data processes
Multi-Account setup: Data processes must run in data consumer’s account
èNo centrally run processing tool
èAWS Managed service is preferred
80. DataWario
1. Client for AWS Data Pipeline
2. Transforms YAML into JSON format understood by Data Pipeline
3. Integrates with the Scout24 account setup and sets sensible configuration defaults
4. Augments Data Pipeline with additional features
5. Makes testing and debugging faster
81. DataWario – Architecture
Poll for Tasks
Start
Post Pipeline
Definition
Upload Artifacts
Any Scout24 Account
Data Center
Client with
DataWario
Server
Artifact Bucket
EC2 Instance
• Java Application, distributed as a JAR
• Shell wrapper (dw) for easy usage from the
command line
Deployment of a pipeline
1. User issues dw deploy
2. DataWario
1. uploads artifacts from user‘s machine to S3
2. generates JSON definition from YAML
3. creates data pipeline
3. AWS Data Pipeline runs the defined steps
84. Augmenting data pipeline
ID Make Model Price Mileage
1 VW Golf 22000 15000
2 BMW 320d 35000 40000
„Built-In“ Copy Activity
1,VW,Golf,22000,15000
2,BMW,320d,35000,40000
• CSV (or TSV)
• No schema
85. Augmenting data pipeline
ID Make Model Price Mileage
1 VW Golf 22000 15000
2 BMW 320d 35000 40000
„Built-In“ Copy Activity
• CSV (or TSV)
• No schema
„Custom“ Copy Activity
+
1,VW,Golf,22000,15000
2,BMW,320d,35000,40000
86. Speed up pipeline development
starts
runs
fails
takes ca. 10 min + job runtime
dw deploy
Typical development cycle
87. Speed up test and debug cycle
Without test cluster
starts
runs
fails
takes ca. 10 min + job runtime
With test cluster
dw test-cluster
create
dw deploy
-- test-cluster
uses
fails
Typical development cycle
takes job runtime
dw deploy
runs
88. DataWario brought us from here …
photo by Dirk van der Made (https://commons.wikimedia.org/wiki/File:DirkvdM_cloudforest-
jungle.jpg), „DirkvdM cloudforest-jungle“, https://creativecommons.org/licenses/by-
sa/3.0/legalcode
91. Paving the path further
• We built a lot of pipelines, so did our users
• We see repetitive tasks emerge in these pipelines
• Reimplementing solutions over and over again is waste
• We don‘t like waste
è Let‘s build reusable tools for common tasks
92. Rob Pike; Brian W. Kernighan (October 1984). "Program Design in the UNIX Environment"
Much of the power of the UNIX operating system comes from a style of
program design that makes programs easy to use and, more important, easy
to combine with other programs. This style has been called the use of
software tools, and depends more on how the programs fit into the
programming environment and how they can be used with other programs
than on how they are designed internally. [...] This style was based on the use
of tools: using programs separately or in combination to get a job done,
rather than doing it by hand, by monolithic self-sufficient subsystems, or by
special-purpose, one-time programs.
Title | Your name
92
93. Tools that do one thing, and do that well
• Advantages:
− Easy to configure, only a few settings
− Easy to test, because there are less branches to cover
− Easy to debug, output of each tool can be checked independently
• Drawbacks:
− More intermittent data written
− longer runtimes
94. Event snapshotter
• Typical questions: What was the headline of a listing on Feb 22, 2016? Was it published? How about March 1?
• Input data: Change events for an entity (e.g. listing)
− User changed headline of a listing, number of pictures
− Listing got published/unpublished
• Snapshotter
1. takes snapshot from the day before
2. merges all change events from the day, keeps last
3. writes new snapshot for day
• Simple config (input and output path, name of ID and timestamp column, output path, partition schema)
• No business logic, agnostic of type of entity
95. Event Snapshotter Example
{"id": 1, "changed": "2019-01-01T15:31:00.000+01:00", "headline": „nice appartment", "published": false}
{"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false}
{"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true}
{"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true}
January 1, 2019 (s3://snapshotBucket/snapshot_day=2019-01-01/):
{"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false}
January 2, 2019 (s3://snapshotBucket/snapshot_day=2019-01-02/):
{"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true}
{"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true}
96. Aggregator
• Typical Questions:
− How many published listings did we have on Feb 1 2018?
− What‘s the daily average number of pageviews per car make?
• Input data:
− Interaction events of an entity ( e.g. page viewed, email sent, number called)
− output of Event Snapshotter
• Simple config (input and output paths, timestamp column, grouping column, aggregation method and column)
97. Spark Utils
• Collection of simple tools (usually one class per tool)
• Can be used
− standalone in a pipeline step
− as a dependency of your code
• Examples:
− Create a Hive table on-top of existing data
− Convert data from one Format to another (e.g. from JSON to Parquet for faster querying)
− Flatten nested structures
− publish data quality metrics to CloudWatch
− Materializing views
98. Current State / Outlook
• Data Pipeline with DataWario widely used throughout the company
• Challenges:
− Data Pipeline not getting much love from AWS (not deprecated, but not getting many new features)
− Coordination between data pipelines only rudimentary
• Possible solutions:
− AWS Step Functions in combination with AWS Glue
− Introduction of a Workflow tool (e.g. AirFlow)
101. What’s Ahead
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
102. What’s Ahead
OneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
103. What’s Ahead
Personal Analytics ClusterOneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
104. What’s Ahead
Automatic Hive Partition DetectionPersonal Analytics ClusterOneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
105. OneScout Hive Metastore – A Schematic View
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
106. OneScout Hive Metastore – Recap of Ecosystem
Product
Account
Product
Account
…
Product
Account
Product
Account
Product
Account
…
Product
Account
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
107. The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
108. The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
109. The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
110. Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
The Scout24 Hive Metastore Proxy – A Motivation
111. Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
The Scout24 Hive Metastore Proxy – A Motivation
112. Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
ConnectConnect
The Scout24 Hive Metastore Proxy – A Motivation
113. Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
ConnectConnect
Forward
Request
Forward
Request
The Scout24 Hive Metastore Proxy – A Motivation
134. The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
AWS
CloudFormation
135. The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
Simple
Config
AWS
CloudFormation
136. The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
Simple
Config
Easy Access via Web
Interface
Zeppelin and Jupyter
Notebook Restore
OneClick Deployment
Managed Scaling and
Shutdown
Support for Pre-baked
AMIs and Configs
AWS
CloudFormation
137. The Personal Analytics Cluster – Configuration Setup
• Cluster Name (default is usually username)
• AWS Account Name
• Market segment
• Use Case
• Shutdown time
• Instance Type
• Bid Price
• Key Pair
• EBS Volume Size
• Tear Down Existing Cluster (optional)
140. The Personal Analytics Cluster – Data Lake Access
Regular Restricted Personal RestrictedRegular Personal
Raw Processed
Visibility Level
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
Regular DL Bucket Access Role
Restricted DL Bucket Access Roles
Personal DL Bucket Access Roles
141. The Personal Analytics Cluster – Data Lake Access
• Managed with AWS EMR FS Security Configuration
• Integrates with EMR Applications internally; Spark, Presto, etc
• Easy to define
• Maps a defined IAM role to S3 bucket being accessed
142. The Personal Analytics Cluster – Data Lake Access
Scenario One:
Non-restricted Data Lake Access
143. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
AWS Cloud
Regular Restricted Personal
Non-restricted
Role
Restricted Role
Personal Role
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
Data Lake
Account
144. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
Non-restricted
Role
Restricted Role
Personal Role
Data Lake
Account
145. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
Non-restricted
Role
Restricted Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
146. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
Non-restricted
Role
Restricted Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
147. The Personal Analytics Cluster – Data Lake Access
Scenario Two:
Restricted/Personal Data Lake Access
148. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
Restricted Role
Non-restricted
Role
Personal Role
Data Lake
Account
149. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Restricted Role
Non-restricted
Role
Personal Role
Data Lake
Account
150. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment A
Account
AWS Cloud
Restricted Role
Non-restricted
Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
151. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment B
Account
AWS Cloud
Restricted Role
Non-restricted
Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
159. Christian Dietze Olalekan Elesin Raffael Dzikowski
Senior Data Engineer Data Engineer Senior Data Engineer
Scout24 Group Scout24 Group Scout24 Group
Thank you for your
attention!