An adaptive and eventually self healing framework for geo-distributed real-time data ingestion

An adaptive and eventually self-healing framework for geo-
distributed real-time data ingestion
Angad Singh
InMobi

The problem domain
Scale
● 15 billion events per day (post filtering)
● 1.5+ billion users, 200 million per day
● 4 geographically distributed data centers (DCs)
● User’s request may land on non-local DC
Ingestion requirements
● multiple tenants, multiple schemas per tenant
● batch, stream, micro-batch and on-demand ingestion
● 20+ streams, 100+ data types
● need to ingest, transform, validate and aggregate this data
● need to ingest streaming data in real-time (<1 min) for ad-serving/targeting use
cases (strict SLA)

The problem domain
Usage/serving requirements
● need to pivot this data by user, activity type and other primary keys
● serve an aggregated view (profile) at the end in < 5ms p99 latency
● need both real-time serving of the view
● as well as batch summaries for analytics, inference algorithms, feedback loops
● need to be resilient to failure, absolutely no room for data loss/lag in ingestion
Data arrival, volume and velocity
● data may be received out of order, or duplicated
● data can arrive in periodic batches or real-time/streaming or once in a while
● data may arrive in bursts or trickle slowly in some streams (autoscale)
● user data may be received in any DC, but needs to be collectively available in a
single DC

The problem domain
Multi-tenancy
● Quotas
● Rate limiting/SLAs
● Isolation
Manageability
● need to be self-serve, flexible for specific changes in the flow, easily deployable
● may need online migration, reprocessing, etc. of data
● hassle-free schema evolution across the stack
● monitoring, visibility, operability aspects for all of the above

Serving layer
(user store)
aerospike
cluster
API
dedup, aggregate, business rules
Ratelimiting/quotas
API
Ad serving
<5ms,99.95%success
notifications
pubsub
(kafka)
notification
listeners (storm)
periodic
dumps
streaming
offline snapshot
store (HDFS)
batch inference jobs
(MR/spark)
analytics engine
(cubes, lens)
real-time
enrichment
on user
engagement
Ingestion layer
globaldcglobaldc
offline snapshot
store (HDFS)
globaldc
localdc
localdc
localdc
upstreamingestionsources
batchsources
streamingsources
adaptors
localdc
adaptors
adaptors
(MR/storm)
localdclocaldc
routers
localdc
routers
routers
(MR/storm)
localdclocaldc
sinkssinks
sinks
(MR/storm)
remotedc
Ingestion Service
orchestrate/manage
remotedc remotedc
Architecture

DC1 (global)
DC2 (slave)
DC3 (slave)
adaptors
(MR)
adaptors
(storm)
adaptors
(storm)
routers
(MR)
routers
(storm)
routers
(storm)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
custom
replication
(contains userid)
(contains userid)
(contains userid)
sinks
getColo(userid)
sinks
sinks
Kafka-data-replicator
(stormtopology)
global colo tagger
(storm)
tag not found
tagged
data
tag found
write tag
User Store
API
History
Profile
User Store
API
History
Profile
User Store
API
History
Profile
XDR
(profile)
Cross-DC
architecture
custom
replication

DC1 (global)
DC2 (slave)
DC3 (slave)
adaptors
(MR)
adaptors
(storm)
adaptors
(storm)
routers
(storm)
routers
(storm)
routers
(storm)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
XDRXDR
(contains userid)
(contains userid)
(contains userid)
sinks
getColo(userid)
sinks
sinks
Kafka-data-replicator
(stormtopology)
global colo tagger
(storm)
tag not found
tagged
data
tag found
write tag
User Store
API
History
Profile
User Store
API
History
Profile
User Store
API
History
Profile
XDR
(profile)
Mapper
Comparison to
map-reducePartitioner Shuffler Reducer

Current Features
Business-agnostic APIs
● Built on simple RESTful APIs: Schema, Feed, Sink, Source, Flow, Driver, Data router, Adaptor
● Unified APIs for doing batch, streaming, micro-batch ingestion.
● Self-serve system which provides rule validation, metrics, etc. and makes the expression of sources,
sinks and flows easy with custom DSL.
Platform-agnostic Flow Execution
● Pluggable execution engine (storm, hadoop, spark) - provides a Driver API
● Uses falcon for batch scheduling, in-built scheduler for streaming drivers (storm, etc.)
Serialization support
● Pluggable schema serde support (thrift, avro)

Current Features
Schema management
● Schema is a first class citizen.
● Contract between source, sink and flow all based on and validated against schema
● Schema versioning and compatibility checks.
● Error-free schema evolution across data flows
● Clean abstractions to centrally manage all the schemas, data sources/feeds, sinks (key value store,
HDFS, etc.) and data flows (storm topologies, MR jobs) which are part of the ingestion pipelines
Manageability, operability
● All entities - schemas, sinks and flows - can be updated online without any downtime.
● Retries, error handling, metrics, orchestration hooks, etc. come standard
Out of the box support for
● Cross-colo flow chaining
● Data routing
● Transformation, validation, conversion
● All based on pluggable code

The problems we’ve seen
Storm
● as usual, lot of knobs to tune based on lot of metrics: workers, threads, tasks, acks, max
spout pending, buffer sizes, xmx, num slots, execute/process/ack latency, capacity, etc.
● debugging storm topology’s isn’t easy: threads, workers, shared logs, shuffling of data
between workers, netty, the ack system, etc.
● storm (0.9.x) doesn’t like heterogenous load: unbalanced distribution between supervisors.
heavy topologies can choke each other. rebalancing not fully resource aware (1.x tries to
solve this)
● no rolling upgrades, supervisor failures cause unrecoverable errors
● zookeeper issues: too many executors leads to worker heartbeat update failure to zk.
● storm-kafka issue: storm-kafka spout unaware of purging (earliestOffset update)
● storm-kafka issue: invisible data loss
● retries should done cautiously
● etc
Kafka
● topic deletion asynchronous, slow
● tuning num partitions manually
● bad consumers can cause excessive logging on brokers

Features under development
● Autoscaling flows - rebalance storm topology based on spout lag, priority and
current throughput (or bolt capacity) - runtime metrics or linear regression on
historical metrics
● Streaming and batch compaction / dedup of data based on domain specific
rules
● Automatic fallback from streaming to batch ingestion in case of huge
backlogs, for low priority ingestions
● Dynamic rerouting / sharding of data between DCs for load balancing cross-
DC flows
● Eventual self-correction of data based on validations on the aggregated view
(data received from multiple streams)
● Data lineage/auditing
● Backfill management

An adaptive and eventually self healing framework for geo-distributed real-time data ingestion

More Related Content

What's hot

Viewers also liked

Similar to An adaptive and eventually self healing framework for geo-distributed real-time data ingestion

More from Angad Singh

Recently uploaded

An adaptive and eventually self healing framework for geo-distributed real-time data ingestion