Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Moving Towards a Streaming Architecture

At improve digital we collect and store large volumes of machine generated and behavioural data from our fleet of ad servers. For some time we have performed mostly batch processing through a data warehouse that combines traditional RDBMs (MySQL), columnar stores (Infobright, impala+parquet) and Hadoop.

We wish to share our experiences in enhancing this capability with systems and techniques that process the data as streams in near-realtime. In particular we will cover:

• The architectural need for an approach to data collection and distribution as a first-class capability

• The different needs of the ingest pipeline required by streamed realtime data, the challenges faced in building these pipelines and how they forced us to start thinking about the concept of production-ready data.

• The tools we used, in particular Apache Kafka as the message broker, Apache Samza for stream processing and Apache Avro to allow schema evolution; an essential element to handle data whose formats will change over time.

• The unexpected capabilities enabled by this approach, including the value in using realtime alerting as a strong adjunct to data validation and testing.

• What this has meant for our approach to analytics and how we are moving to online learning and realtime simulation.

This is still a work in progress at Improve Digital with differing levels of production-deployed capability across the topics above. We feel our experiences can help inform others embarking on a similar journey and hopefully allow them to learn from our initiative in this space.

  • Be the first to comment

  • Be the first to like this

Moving Towards a Streaming Architecture

  1. 1. Moving Towards a Streaming Architecture Garry Turkington (CTO), Gabriele Modena (Data Scientist) Improve Digital
  2. 2. Real Time Adtech SSP @ImproveDigital
  3. 3. Data sources  A geographically distributed fleet of ad servers push out several types of data, primarily: - 1 minute metrics packets - 15 minute full log files  Previously used ad hoc system to process the former, Hadoop for the latter
  4. 4. Conceptualizing the data  We think of event streams when considering the operational system  But have this sliced into many log files for practical reasons  Recreating the stream-based view gets us back to what is happening  And brings data creation and processing closer together
  5. 5. How did we get here  Need to process more data, faster - Reporting - Decision Making  Evolution of the system to meet scalability requirements  Stream processing by definition  It is changing the way we think about data collection and distribution  Not everything is obvious  It opened new doors
  6. 6. Let’s take a step back  Aggregates grow in number and size  Keep a copy in HDFS  Run job, then go and make coffee  Generate new / other aggregates / ad hoc datasets  Analytics datawarehouse (column store), denormalized data model  BI, reporting  Hadoop for logs and event processing (eg. ETL, simulation, learning, optimization)
  7. 7. Hadoop workflows
  8. 8. Incremental processing  Ingest data every eg. 24 hours (/data/raw/event/d=YYYY-MM-DD)  Scheduling and partitioning are straightforward  Schema migrations: Avro + HCatalog  Fixed length time windows - What is a good length? - Ok if it takes 1 hour to process 15 minutes of data  Need to reprocess data  It can be made efficient (eg. DataFu Hourglass)  MapReduce
  9. 9. A more interactive system We added tools to circumvent MapReduce - Latency, really - Still a “batch” mindset Cloudera Impala (+ Parquet) - Fast, familiar Move computation closer to data - Spark (among other things) for prototyping - Python/R integration, batch and “streaming”
  10. 10. Outcome  Multiple overlapping data sources, processing tools - point-to-point  Different pipelines for analytics and production models - Divide - Where is the “correct” data ?
  11. 11. All in all  Data collection and distribution is critical  We are still looking at “yesterday’s” data  We need a different level of abstraction
  12. 12. Challenges of near-realtime data  Instructive to ask why near-realtime data analysis is hard  In Hadoop batch world we’d be happy processing 24hrs of data in 4 hrs  So why do we flinch to process 24hrs in 24hrs?  What allows batch workloads to process in sub-linear time: - Scheduling - Parallelization - Partitioning
  13. 13. The streaming view  Consider these from the streaming perspective: - Scheduling is challenging because of unknown and variable data rates - Parallelization breaks our nice logical model of a single data stream - Partitioning can’t be simple file size/no of blocks - The method of partitioning the stream is fundamental
  14. 14. Kafka  We use Kafka ( for all data we view as streams  It is resilient, performant and provides a very clear topic-based interface  Producers write to topics, consumers read from topics  Topics can be partitioned based on a producer-provided key  Provides at-least once message delivery semantics  Messages are persistently stored – topics can be re-read
  15. 15. Samza  We use Samza ( for stream processing  Analogous to MapReduce atop YARN atop HDFS  We have Samza atop YARN atop Kafka E L E M O D E N A L E A R N IN G H A D O O P 2 A p a c h e S a m z a afka for streaming ! arn for resource anagement and exec! amzaAPI for ocessing! weet spot: second, inutes Y A R N S a m z a A P I K a fk a
  16. 16. Samza API A simple call-back API that should be familiar to MapReduce developers: public interface StreamTask{ void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator) ; } public interface WindowableTask{ void window(MessageCollector collector, StreamCoordinator coordinator); }
  17. 17. Kafka/Samza integration  A Samza job is comprised of multiple tasks  Each Kafka partition topic is consumed by only one Samza task  Each YARN container runs multiple Samza tasks  The output of a task is often written to a different topic  Overall applications are then constructed by composition of jobs
  18. 18. Stream job output  You’ve processed the data, now what?  Need consider how the output of each stream partition will be processed  A downstream destination with an idempotent interface is ideal  Otherwise try to have expressions of unique state  Hardest is when the outputs need applied in sequence
  19. 19. Example: log data  It’s a common model: server process rolls a log every x minutes  Each resultant file is ingested into Kafka and read by multiple Samza jobs  Each Samza task receives a series of potentially interleaved log chunks  Consider a task receiving data from servers S1, S2 and S3 across time periods T1..T3  It’s view may become something like: S1[T1..T2], S2[T1..T2], S1[T2..T3], S3[T1..T2]…
  20. 20. Example - Streaming aggregation  Input is a Stream of records with the form (key, value)  Output is a series of (key, aggregate)  3 approaches to producing these outputs: - Each task pushes local counts downstream - Each task pushes to another stream aggregated by a different job - Each task uses local state to produce final aggregate values
  21. 21. Comparing stream aggregation  Option 1: each stream partition outputs local values for each key it sees across a time window - Each partial result is an update record (e.g. increment key k at time t by x) - Downstream consumer has the responsibility of combining partial results  Option 2 : aggregate partial results in a second stream processing job - The output becomes a series of state changes (set the value of key k at time t to x)
  22. 22. Comparing stream aggregation contd  Option 3 - push state into the state processors - Samza tasks can hold persistent local state - Creates a key/value store modelled as a Kafka topic - Ensure that all records for each key are sent to the same partition - State processing task can then create unique state aggregates locally
  23. 23. A side effect of Kafka  A firehose for event data - Can tap into with Samza - Can tap into with Spark  Unifies how we access and publish data - Streams are defined by application / use case - These can come from other teams - Single entry point, multiple outputs
  24. 24. Analytics on streams Incremental processing pipelines as a basis Tried to reuse logic  Not that simple Conceptualization vs. implementation
  25. 25. Time windows  Temporal constraints  When is data ready? - The key is defining “ready”, which is weaker than “complete” - Enough to make an informed decision (this varies case by case) - We can update the dataset to improve a confidence interval  When are we done processing? - Tradeoff between timeliness and accuracy
  26. 26. Schema evolution  Schema changes appear within / across time windows - Not a solved problem  Stop streaming, propagate changes, replay or resume  SAMZA-317 - Avro SerDe - Schema stream
  27. 27. Challenges  Data completeness  Windows length  Tradeoff between finding answers / satisfying temporal constraints  Need to adjust our approach (representation and data structures) - approximate - probabilistic - iterations to convergence  This is a work in progress
  28. 28. Model based outlier detection Previously: Batch and real time datasets = different approaches Now: Same dataset Model based anomaly detection  Reuse knowledge and infrastructure What is normal?  How do we set thresholds ?  Literature review (q-digest, t-digest)
  29. 29. Testing and alerting  Some features are hard to test  Functionally correct but unexpected consequences  unexpected ~ hard to predict  advertisment is a complex system  Bugs are outliers we can detect (sort of)
  30. 30. Testing as predictive modelling  Monitoring (next to testing)  Make software monitorable  Think of testing as an experiment  Establish short, continuous, feedback loops
  31. 31. Implications  How do we use real time output - “live” dashboards - Instrumentation and alerting (predict failure) - feed optimization models - back propagating methods into ETL  How we think about feedback loops  Make data more consumable
  32. 32. Conclusion  Streaming is part of the broader system  We are fine tuning how the two fit together  Data collection and distribution is important  publish results follows  Kafka + Samza  data and processing model that fits our applications well  Conceptualization vs. implementation  Real time != incremental  A certain degree of uncertainty  Tradeoffs