An adaptive and eventually self-healing framework for geo-
distributed real-time data ingestion
Angad Singh
InMobi
The problem domain
Scale
● 15 billion events per day (post filtering)
● 1.5+ billion users, 200 million per day
● 4 geographically distributed data centers (DCs)
● User’s request may land on non-local DC
Ingestion requirements
● multiple tenants, multiple schemas per tenant
● batch, stream, micro-batch and on-demand ingestion
● 20+ streams, 100+ data types
● need to ingest, transform, validate and aggregate this data
● need to ingest streaming data in real-time (<1 min) for ad-serving/targeting use
cases (strict SLA)
The problem domain
Usage/serving requirements
● need to pivot this data by user, activity type and other primary keys
● serve an aggregated view (profile) at the end in < 5ms p99 latency
● need both real-time serving of the view
● as well as batch summaries for analytics, inference algorithms, feedback loops
● need to be resilient to failure, absolutely no room for data loss/lag in ingestion
Data arrival, volume and velocity
● data may be received out of order, or duplicated
● data can arrive in periodic batches or real-time/streaming or once in a while
● data may arrive in bursts or trickle slowly in some streams (autoscale)
● user data may be received in any DC, but needs to be collectively available in a
single DC
The problem domain
Multi-tenancy
● Quotas
● Rate limiting/SLAs
● Isolation
Manageability
● need to be self-serve, flexible for specific changes in the flow, easily deployable
● may need online migration, reprocessing, etc. of data
● hassle-free schema evolution across the stack
● monitoring, visibility, operability aspects for all of the above
The architecture
Serving layer
(user store)
aerospike
cluster
API
dedup, aggregate, business rules
Ratelimiting/quotas
API
Ad serving
<5ms,99.95%success
notifications
pubsub
(kafka)
notification
listeners (storm)
periodic
dumps
streaming
offline snapshot
store (HDFS)
batch inference jobs
(MR/spark)
analytics engine
(cubes, lens)
real-time
enrichment
on user
engagement
Ingestion layer
globaldcglobaldc
offline snapshot
store (HDFS)
globaldc
localdc
localdc
localdc
upstreamingestionsources
batchsources
streamingsources
adaptors
localdc
adaptors
adaptors
(MR/storm)
localdclocaldc
routers
localdc
routers
routers
(MR/storm)
localdclocaldc
sinkssinks
sinks
(MR/storm)
remotedc
Ingestion Service
orchestrate/manage
remotedc remotedc
Architecture
DC1 (global)
DC2 (slave)
DC3 (slave)
adaptors
(MR)
adaptors
(storm)
adaptors
(storm)
routers
(MR)
routers
(storm)
routers
(storm)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
custom
replication
(contains userid)
(contains userid)
(contains userid)
sinks
getColo(userid)
sinks
sinks
Kafka-data-replicator
(stormtopology)
global colo tagger
(storm)
tag not found
tagged
data
tag found
write tag
User Store
API
History
Profile
User Store
API
History
Profile
User Store
API
History
Profile
XDR
(profile)
Cross-DC
architecture
custom
replication
DC1 (global)
DC2 (slave)
DC3 (slave)
adaptors
(MR)
adaptors
(storm)
adaptors
(storm)
routers
(storm)
routers
(storm)
routers
(storm)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
XDRXDR
(contains userid)
(contains userid)
(contains userid)
sinks
getColo(userid)
sinks
sinks
Kafka-data-replicator
(stormtopology)
global colo tagger
(storm)
tag not found
tagged
data
tag found
write tag
User Store
API
History
Profile
User Store
API
History
Profile
User Store
API
History
Profile
XDR
(profile)
Mapper
Comparison to
map-reducePartitioner Shuffler Reducer
The ingestion layer
Current Features
Business-agnostic APIs
● Built on simple RESTful APIs: Schema, Feed, Sink, Source, Flow, Driver, Data router, Adaptor
● Unified APIs for doing batch, streaming, micro-batch ingestion.
● Self-serve system which provides rule validation, metrics, etc. and makes the expression of sources,
sinks and flows easy with custom DSL.
Platform-agnostic Flow Execution
● Pluggable execution engine (storm, hadoop, spark) - provides a Driver API
● Uses falcon for batch scheduling, in-built scheduler for streaming drivers (storm, etc.)
Serialization support
● Pluggable schema serde support (thrift, avro)
Current Features
Schema management
● Schema is a first class citizen.
● Contract between source, sink and flow all based on and validated against schema
● Schema versioning and compatibility checks.
● Error-free schema evolution across data flows
● Clean abstractions to centrally manage all the schemas, data sources/feeds, sinks (key value store,
HDFS, etc.) and data flows (storm topologies, MR jobs) which are part of the ingestion pipelines
Manageability, operability
● All entities - schemas, sinks and flows - can be updated online without any downtime.
● Retries, error handling, metrics, orchestration hooks, etc. come standard
Out of the box support for
● Cross-colo flow chaining
● Data routing
● Transformation, validation, conversion
● All based on pluggable code
The problems we’ve seen
Storm
● as usual, lot of knobs to tune based on lot of metrics: workers, threads, tasks, acks, max
spout pending, buffer sizes, xmx, num slots, execute/process/ack latency, capacity, etc.
● debugging storm topology’s isn’t easy: threads, workers, shared logs, shuffling of data
between workers, netty, the ack system, etc.
● storm (0.9.x) doesn’t like heterogenous load: unbalanced distribution between supervisors.
heavy topologies can choke each other. rebalancing not fully resource aware (1.x tries to
solve this)
● no rolling upgrades, supervisor failures cause unrecoverable errors
● zookeeper issues: too many executors leads to worker heartbeat update failure to zk.
● storm-kafka issue: storm-kafka spout unaware of purging (earliestOffset update)
● storm-kafka issue: invisible data loss
● retries should done cautiously
● etc
Kafka
● topic deletion asynchronous, slow
● tuning num partitions manually
● bad consumers can cause excessive logging on brokers
Features under development
● Autoscaling flows - rebalance storm topology based on spout lag, priority and
current throughput (or bolt capacity) - runtime metrics or linear regression on
historical metrics
● Streaming and batch compaction / dedup of data based on domain specific
rules
● Automatic fallback from streaming to batch ingestion in case of huge
backlogs, for low priority ingestions
● Dynamic rerouting / sharding of data between DCs for load balancing cross-
DC flows
● Eventual self-correction of data based on validations on the aggregated view
(data received from multiple streams)
● Data lineage/auditing
● Backfill management

An adaptive and eventually self healing framework for geo-distributed real-time data ingestion

  • 1.
    An adaptive andeventually self-healing framework for geo- distributed real-time data ingestion Angad Singh InMobi
  • 2.
    The problem domain Scale ●15 billion events per day (post filtering) ● 1.5+ billion users, 200 million per day ● 4 geographically distributed data centers (DCs) ● User’s request may land on non-local DC Ingestion requirements ● multiple tenants, multiple schemas per tenant ● batch, stream, micro-batch and on-demand ingestion ● 20+ streams, 100+ data types ● need to ingest, transform, validate and aggregate this data ● need to ingest streaming data in real-time (<1 min) for ad-serving/targeting use cases (strict SLA)
  • 3.
    The problem domain Usage/servingrequirements ● need to pivot this data by user, activity type and other primary keys ● serve an aggregated view (profile) at the end in < 5ms p99 latency ● need both real-time serving of the view ● as well as batch summaries for analytics, inference algorithms, feedback loops ● need to be resilient to failure, absolutely no room for data loss/lag in ingestion Data arrival, volume and velocity ● data may be received out of order, or duplicated ● data can arrive in periodic batches or real-time/streaming or once in a while ● data may arrive in bursts or trickle slowly in some streams (autoscale) ● user data may be received in any DC, but needs to be collectively available in a single DC
  • 4.
    The problem domain Multi-tenancy ●Quotas ● Rate limiting/SLAs ● Isolation Manageability ● need to be self-serve, flexible for specific changes in the flow, easily deployable ● may need online migration, reprocessing, etc. of data ● hassle-free schema evolution across the stack ● monitoring, visibility, operability aspects for all of the above
  • 5.
  • 6.
    Serving layer (user store) aerospike cluster API dedup,aggregate, business rules Ratelimiting/quotas API Ad serving <5ms,99.95%success notifications pubsub (kafka) notification listeners (storm) periodic dumps streaming offline snapshot store (HDFS) batch inference jobs (MR/spark) analytics engine (cubes, lens) real-time enrichment on user engagement Ingestion layer globaldcglobaldc offline snapshot store (HDFS) globaldc localdc localdc localdc upstreamingestionsources batchsources streamingsources adaptors localdc adaptors adaptors (MR/storm) localdclocaldc routers localdc routers routers (MR/storm) localdclocaldc sinkssinks sinks (MR/storm) remotedc Ingestion Service orchestrate/manage remotedc remotedc Architecture
  • 7.
    DC1 (global) DC2 (slave) DC3(slave) adaptors (MR) adaptors (storm) adaptors (storm) routers (MR) routers (storm) routers (storm) User-Colo Metadata (aerospike) User-Colo Metadata (aerospike) User-Colo Metadata (aerospike) custom replication (contains userid) (contains userid) (contains userid) sinks getColo(userid) sinks sinks Kafka-data-replicator (stormtopology) global colo tagger (storm) tag not found tagged data tag found write tag User Store API History Profile User Store API History Profile User Store API History Profile XDR (profile) Cross-DC architecture custom replication
  • 8.
    DC1 (global) DC2 (slave) DC3(slave) adaptors (MR) adaptors (storm) adaptors (storm) routers (storm) routers (storm) routers (storm) User-Colo Metadata (aerospike) User-Colo Metadata (aerospike) User-Colo Metadata (aerospike) XDRXDR (contains userid) (contains userid) (contains userid) sinks getColo(userid) sinks sinks Kafka-data-replicator (stormtopology) global colo tagger (storm) tag not found tagged data tag found write tag User Store API History Profile User Store API History Profile User Store API History Profile XDR (profile) Mapper Comparison to map-reducePartitioner Shuffler Reducer
  • 9.
  • 11.
    Current Features Business-agnostic APIs ●Built on simple RESTful APIs: Schema, Feed, Sink, Source, Flow, Driver, Data router, Adaptor ● Unified APIs for doing batch, streaming, micro-batch ingestion. ● Self-serve system which provides rule validation, metrics, etc. and makes the expression of sources, sinks and flows easy with custom DSL. Platform-agnostic Flow Execution ● Pluggable execution engine (storm, hadoop, spark) - provides a Driver API ● Uses falcon for batch scheduling, in-built scheduler for streaming drivers (storm, etc.) Serialization support ● Pluggable schema serde support (thrift, avro)
  • 12.
    Current Features Schema management ●Schema is a first class citizen. ● Contract between source, sink and flow all based on and validated against schema ● Schema versioning and compatibility checks. ● Error-free schema evolution across data flows ● Clean abstractions to centrally manage all the schemas, data sources/feeds, sinks (key value store, HDFS, etc.) and data flows (storm topologies, MR jobs) which are part of the ingestion pipelines Manageability, operability ● All entities - schemas, sinks and flows - can be updated online without any downtime. ● Retries, error handling, metrics, orchestration hooks, etc. come standard Out of the box support for ● Cross-colo flow chaining ● Data routing ● Transformation, validation, conversion ● All based on pluggable code
  • 15.
    The problems we’veseen Storm ● as usual, lot of knobs to tune based on lot of metrics: workers, threads, tasks, acks, max spout pending, buffer sizes, xmx, num slots, execute/process/ack latency, capacity, etc. ● debugging storm topology’s isn’t easy: threads, workers, shared logs, shuffling of data between workers, netty, the ack system, etc. ● storm (0.9.x) doesn’t like heterogenous load: unbalanced distribution between supervisors. heavy topologies can choke each other. rebalancing not fully resource aware (1.x tries to solve this) ● no rolling upgrades, supervisor failures cause unrecoverable errors ● zookeeper issues: too many executors leads to worker heartbeat update failure to zk. ● storm-kafka issue: storm-kafka spout unaware of purging (earliestOffset update) ● storm-kafka issue: invisible data loss ● retries should done cautiously ● etc Kafka ● topic deletion asynchronous, slow ● tuning num partitions manually ● bad consumers can cause excessive logging on brokers
  • 16.
    Features under development ●Autoscaling flows - rebalance storm topology based on spout lag, priority and current throughput (or bolt capacity) - runtime metrics or linear regression on historical metrics ● Streaming and batch compaction / dedup of data based on domain specific rules ● Automatic fallback from streaming to batch ingestion in case of huge backlogs, for low priority ingestions ● Dynamic rerouting / sharding of data between DCs for load balancing cross- DC flows ● Eventual self-correction of data based on validations on the aggregated view (data received from multiple streams) ● Data lineage/auditing ● Backfill management