Torns and roses of realtime data platform, Boris Trofimov

TORNS AND ROSES
OF REALTIME DATA PLATFORM
BORIS TROFIMOV @ SIGMA SOFTWARE

Leading DWH @ AOL. / Vidible division /
Major expertise Big Data and Enterprise
Cofounder of Odessa JUG
Passionate follower of Scala
Associate professor at ONPU
ABOUT ME

THE WORLD OF BIG DATA
DATA MANAGEMENT DATA ANALYSIS
INGESTION & ETL
§ Flexible data
pipelines
INTEGRATION
§ Multiple 3rd party
sources
WAREHOUSING
§ Efficient data
organization
REPORTING
§ Organizing data into
informational
summaries
DATA ANALYTICS
§ Find meaningful
correlations between
data
DATA MININIG
§ Extract new knowledge
DATA SCIENCE
§ Insights
§ Modes & Predictions
§ Machine Learning
VISUALISATION
§ Get insights
INFRASTRUCTURE
RELIABLE SERVICES
§ Private vs public
clouds
§ Quick scale out/down
§ Instant deployments
§ Efficient maintenance
MONITORING
§ Big Picture and total
control on every
service
§ Metrics and Alerts

API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
INEGRATION
POINTS
INTRODUCING DATA PLATFORM

DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE

DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE

DOMAIN SERVICE
API
BIG DATA ?
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE

API
DATA PLATFORM
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE

DATA PLATFORM

DATA PLATFORM
3rd PARTY
PROVIDERS
PLATFORM
COMPONENTS

DATA PLATFORM
3rd PARTY
PROVIDERS
PLATFORM
COMPONENTS
REPORTING
ANALYTICS

TYPICAL DATA PLATFORM
INGESTION
MODULE
REPORTING
SERVICE
WAREHOUSE
VALIDATION
ENRICHMENT
MODULE
RAW DATA
AGGREGATIONS
MODULE
RAW DATA
RAW DATA
DIMENSIONS
ANALYTICS
MODULE
CONFIGURATION
MODULE
DIMENSION
UPDATER

BIG PICTURE
CORE
PLATFORM
DATA
PLATFORM
VIDEO PLAYERS
CONTENT OWNERS
END USERS

BATCH DRIVEN DATA PLATFORM
VERTICAS3 HADOOP
DATA PLATFORM
NGINX
DATA LAG ~1h
REPORTING
SERVICE

BOSS DEFINES TASK
NON FUNCTIONAL REQUIREMENTS
• Business delay ~2 minutes
• 100K events per second
• High Availability
CONSTRAINTS
• SQL complaint database
• Reporting Queries should be the same
• Reuse the same enrichment/validation logic developed for
Hadoop

UNITED PLATFORM
DATA PLATFORM
KAFKA SPARK MEMSQL
VERTICAS3 HADOOPNGINX
REPORTING
SERVICE
DATA LAG ~2m
DATA LAG ~1h

val lines : RDD[String] = <from kafka>
lines.map { line =>
val event = parser.parse(line)
...
}
HELLO JAVA BEANS

HELLO JAVA BEANS
• Creating beans within RDD.foreachPartition
• Calling hardcoded singletons from executor’s code
• Passing proxy wrappers which know how to create specific bean

• Spark checkpoints keep all serialized execution tree to keep exactly
once promise
• If application has been changed then it fails with deserialization issue
CHECKPOINTS MANAGEMENT

CHECKPOINTS MANAGEMENT
• When started all our applications use
custom logic
• Side effect -- having offsets in kafka
allows generic monitoring of any spark
streaming we have
• Also we use Consul to choose which
strategy forcibly to use for next run
(latest, kafka, spark)
Create
Audience
FETCH PREVIOUS
APP VERSION
FROM CONSUL
VERSION
CHANGED?
USE SPARK
CHECKPOINTS
USE KAFKA
OFFSETS
YES
NO
READ BATHC FROM
KAFKA
PROCESS IT
COMMIT OFFSETS
TO KAFKA
KAFKA
MAJOR APPLICATION LOOP

WILD ENRICHMENT
• Decoupled enrichment system
• Every enricher accepts Avro.GenericRecord
• Some enrichers require dimensions (e.g.
billing)
• We use Couchbase cluster to keep
dimensions we need.
• Enrichers might interact to Couchbase
• We update couchbase data on streaming
basis
• We use aggressive caching on spark side
COUCHBASE
CLUSTER
KAFKA
SPARK
Enricher 1
Enricher 1
Enricher 1

MULTIPLE FORMATS
• Every Integration point has own event types.
• Every event type has own fields list with
validation and enrichment rules.
• Every field has own type and optional flag.
• We have developed dedicated service which
keeps this meta information and provides on
demand (auto-converted to avro schemas)
• This service cares about schema for
corresponding memsql table.
• Schema can be modified and applied on a fly
DATA
PLATFORM
OUR PLAYER
Event A
Event B
PARTNER 1
Event C
Event D
PARTNER N

SIMPLIFIED DEPLOYMENT VIEW
KAFKA
SPARK
KAFKA
KAFKA
MEMSQL
TOPIC 1 TABLE 1
SPARKTOPIC N TABLE N
CONFIG
MANAGEMENT
INTEGRATION POINT 1
INTEGRATION POINT N
COUCHBASE

NGINX, LUA AND OTHERS
Using LUA for Nginx and modified lua-resty-kafka module
Modifications
• S3 fallback support
• Monitoring (Success handler)
• Updating internal buffers as default values are not high load friendly

FALLBACK SUPPORT
NGINX
NGINX
NGINX
NGINX KAFKA CLUSTER
Fallback
S3 NiFi

SHARD AND OTHERS
• Spark writes to column-store table.
• Key is batch identifier
• Shard key is auto increment value. Other options caused
huge data skew.
• Collocated deployment schema: child aggregator and leaf
on the same machine.
• Increased connection pool on child aggregators. Default
value caused locks
• Retention policy is 24h data.
• Removed collocation logic from spark memsql connector
SPARK DATA
MEMSQL
DIMENSIONS

KEY MONITORING METRICS
• Incoming rate (events/s)
• Consuming rate (events/s)
• Lag (measured as # unprocessed events)
• Processing delay (in minutes)
• # Failed/Passed events

SPARK STREAMING MONITORING
• We use Datadog to collect and aggregate all metrics we have
• Datadog-Spark plugin installed across Yarn cluster
• We have developed standalone service that tracks application
performance by group-id and commit metrics to Datadog

DOCKER
RANCHER
SPARK DRIVER
(docker container)
YARN
EXECUTOR EXECUTOR EXECUTOR…
SPARK DRIVER
(docker container) EXECUTOR EXECUTOR EXECUTOR…

INFRA DETAILS
• Docker is used to run only spark driver application
• URL with Yarn configuration files (http for cdh or S3 for emr) is passed to
docker container
• Spark binaries is part of docker image
• Rancher ensures HA out of the box.

CDH vs EMR
E M RC D H
Cannot scale out/in on demand Is able to scale out/in on demand
No extra cost (for community
license)
Extra ~30% to EC2 costs
Adding machines to CDH
requires restarting Yarn
No Yarn restart
Easy configuration management
via CM
Limited configuration available
during EMR creation
Classic Yarn cluster Usual Yarn under hood, imposes
EMR-driven way to deploy apps
Single CDH per region EMR cluster on demand as unit
of clustering

MULTI REGIONAL WORLD
• We have multiple regions. One of them
is chosen as central region.
• Every region used regional kafka
cluster and Mirror makers to transfer
data from regional to central kafka
cluster.
• Every Ingestion Spark application
consumes from several topics (per
region)
• After certain time we grabbed entire
VPN channel and still was not enough
to transfer data between regions.
Afterwards, for problematic regions we
used S3 -> NiFi approach instead.
KAFKA
SPARK
NGINX
NGINX
INTEGRATION POINT
TOPIC 1

cluster.
region)
US-WEST
NGINX KAFKA MK
NGINX KAFKA MK
KAFKA
TOPIC 2
SPARK
NGINX
NGINX
US-EAST
TOPIC 1

cluster.
region)
US-WEST
NGINX KAFKA MK
NGINX KAFKA MK
EU-WEST
NGINX KAFKA MK
NGINX KAFKA MK
AP-SOUTH-EAST
NGINX KAFKA MK
NGINX KAFKA MK
KAFKA
TOPIC 2
TOPIC 3
TOPIC 4
SPARK
NGINX
NGINX
US-EAST
TOPIC 1

НЕЛЬЗЯ ПРОСТО ТАК ВЗЯТЬ
И ВЫРАСТИ В 10 РАЗ

SURVIVE AFTER 10X SCALE
• Decisions good for one scale might be disaster for another
• Memsql is good till 1.5M events/s

MATH BEHIND DECISIONS
• How many executors, vcores, yarn nodes?
• How many kafka partitions?
• What is memsql and kafka cluster size?

MATH BEHIND DECISIONS: FACTS
• We run one executor per yarn node (m4.4xlarge) for better cpu and cache utilization,
using 16 vcores
• Use 4-6 permanent network connections per memsql machine (m4.4xlarge)
• Memsql databases should still be created with a low number of partitions per node.
1, 2, or 4 at very most. We use 2 partitions per node.
• Split processing time interval on specific responsibility zones
FETCH FROM KAFKA ENRICHMENT WRITE TO MEMSQL
1 minute
8 seconds 25 seconds 15 seconds
STUFF
12 seconds

MATH BEHIND DECISONS
• If enrichment is CPU intensive – align RDD partitions to (E * C)
• If possible, align # topic partitions to (E * C)
• Before write to memsql repartition to (M * 4)
• Do not be afraid to repartition/coalesce between these stages if needed.
• Choose memsql and kafka cluster size respecting USE method and corresponding
responsibility zone
E – # executors
C – # vcores
M – # machines in memsql cluster

Torns and roses of realtime data platform, Boris Trofimov

Recommended

Recommended

More Related Content

Similar to Torns and roses of realtime data platform, Boris Trofimov

Similar to Torns and roses of realtime data platform, Boris Trofimov (20)

More from Sigma Software

More from Sigma Software (20)

Recently uploaded

Recently uploaded (20)

Torns and roses of realtime data platform, Boris Trofimov