2. Leading DWH @ AOL. / Vidible division /
Major expertise Big Data and Enterprise
Cofounder of Odessa JUG
Passionate follower of Scala
Associate professor at ONPU
ABOUT ME
8. THE WORLD OF BIG DATA
DATA MANAGEMENT DATA ANALYSIS
INGESTION & ETL
§ Flexible data
pipelines
INTEGRATION
§ Multiple 3rd party
sources
WAREHOUSING
§ Efficient data
organization
REPORTING
§ Organizing data into
informational
summaries
DATA ANALYTICS
§ Find meaningful
correlations between
data
DATA MININIG
§ Extract new knowledge
DATA SCIENCE
§ Insights
§ Modes & Predictions
§ Machine Learning
VISUALISATION
§ Get insights
INFRASTRUCTURE
RELIABLE SERVICES
§ Private vs public
clouds
§ Quick scale out/down
§ Instant deployments
§ Efficient maintenance
MONITORING
§ Big Picture and total
control on every
service
§ Metrics and Alerts
30. CHECKPOINTS MANAGEMENT
• When started all our applications use
custom logic
• Side effect -- having offsets in kafka
allows generic monitoring of any spark
streaming we have
• Also we use Consul to choose which
strategy forcibly to use for next run
(latest, kafka, spark)
Create
Audience
FETCH PREVIOUS
APP VERSION
FROM CONSUL
VERSION
CHANGED?
USE SPARK
CHECKPOINTS
USE KAFKA
OFFSETS
YES
NO
READ BATHC FROM
KAFKA
PROCESS IT
COMMIT OFFSETS
TO KAFKA
KAFKA
MAJOR APPLICATION LOOP
31. WILD ENRICHMENT
• Decoupled enrichment system
• Every enricher accepts Avro.GenericRecord
• Some enrichers require dimensions (e.g.
billing)
• We use Couchbase cluster to keep
dimensions we need.
• Enrichers might interact to Couchbase
• We update couchbase data on streaming
basis
• We use aggressive caching on spark side
COUCHBASE
CLUSTER
KAFKA
SPARK
Enricher 1
Enricher 1
Enricher 1
38. SHARD AND OTHERS
• Spark writes to column-store table.
• Key is batch identifier
• Shard key is auto increment value. Other options caused
huge data skew.
• Collocated deployment schema: child aggregator and leaf
on the same machine.
• Increased connection pool on child aggregators. Default
value caused locks
• Retention policy is 24h data.
• Removed collocation logic from spark memsql connector
SPARK DATA
MEMSQL
DIMENSIONS
46. CDH vs EMR
E M RC D H
Cannot scale out/in on demand Is able to scale out/in on demand
No extra cost (for community
license)
Extra ~30% to EC2 costs
Adding machines to CDH
requires restarting Yarn
No Yarn restart
Easy configuration management
via CM
Limited configuration available
during EMR creation
Classic Yarn cluster Usual Yarn under hood, imposes
EMR-driven way to deploy apps
Single CDH per region EMR cluster on demand as unit
of clustering
56. MATH BEHIND DECISONS
• If enrichment is CPU intensive – align RDD partitions to (E * C)
• If possible, align # topic partitions to (E * C)
• Before write to memsql repartition to (M * 4)
• Do not be afraid to repartition/coalesce between these stages if needed.
• Choose memsql and kafka cluster size respecting USE method and corresponding
responsibility zone
E – # executors
C – # vcores
M – # machines in memsql cluster