Stream processing

159 views

Published on

Slides for my talk at Oredev 2016. Introduces stream processing, some techniques, and example uses. Also introduces technologies like Kafka, Cassandra, Spark, with their pros and cons.

Video available at https://vimeo.com/191056269 .

Published in: Software
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
159
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Stream processing

  1. 1. STREAM PROCESSING @ASHIC HTTP://WWW.HEARTYSOFT.COM
  2. 2. BIG DATA • What?
  3. 3. BIG DATA • Hadoop • Map-Reduce • Spark
  4. 4. BIG DATA • Optimisations • Parquet, etc.
  5. 5. BIG DATA • Problems?
  6. 6. BIG DATA • Problems?
  7. 7. BIG DATA • Problems?
  8. 8. STREAMING DATA • What?
  9. 9. STREAMING DATA • Cheaper? • Timely results? • Approximations?
  10. 10. STREAMING DATA
  11. 11. EXAMPLES • Statistical Summaries Mean, Standard Deviation
  12. 12. EXAMPLES • Statistical Summaries Hold n, sum, and sum of squares => Mean, Standard Deviation
  13. 13. EXAMPLES • Statistical Summaries Approximation of Median
  14. 14. EXAMPLES • Statistical Summaries * Start with a value * If item > value, add learning rate * If item < value, subtract learning rate => Approximation of Median
  15. 15. EXAMPLES • Taking Representative Samples - From weblogs (i.e. ip-timestamp tuples) approximate average percentage of users who have revisited.
  16. 16. EXAMPLES • Filtering Streams Filter Out (or In) Things That May Not Be Needed
  17. 17. EXAMPLES • Filtering Streams Bloom Filter • Hash based on criterion • Matching hash means entry may be in there • Non matching hash means it’s definitely not
  18. 18. EXAMPLES How Many Distinct Things Did We Get?
  19. 19. EXAMPLES • Approximate Distinct Elements Flajolet-Martin Algorithm • Hash element (or identifier) to longs using many hash functions. Count trailing zeroes of hash. Let it be r. • Approximation for distinct elements = 2^R where R = max(r) • Combine groups of hashes: Take average for each group, then take median of the averages.
  20. 20. EXAMPLES • Clustering • Bradley, Fattad, Reina (BFR) approach. • BDMO algorithm.
  21. 21. BACK TO…
  22. 22. USEFUL TECHNOLOGY • Apache Kafka • Apache Cassandra • Apache Spark
  23. 23. KAFKA • Scale out, clustered, durable message broker. • Fault tolerant, replicated. • Uses topics, which have partitions. • Messages within partitions have guaranteed ordering.
  24. 24. KAFKA • Kafka Streams: Lightweight Kafka => [x] library • Kafka Connect: Enables streaming large amounts of data reliability between Kafka and other systems • Schema Registry: Well…registry for schemas
  25. 25. KAFKA
  26. 26. KAFKA - GOTCHAS • Messages in a partition are ordered, message processing may not be. • At least once… downstream idempotence required. • Disk. • Rebalances.
  27. 27. CASSANDRA • Partitioned row store. • Fault tolerant, Masterless. • Very fast writes, fast reads. • Tunable consistency. • Multi-datacentre aware. • OLTP + OLAP (via Spark).
  28. 28. CASSANDRA - DATACENTRES
  29. 29. CASSANDRA – SCHEMA • Collection Types • User defined types • Static Columns • Materialised Views
  30. 30. CASSANDRA - CQL
  31. 31. CASSANDRA – DATA MODELLING • NOT a relational database • KNOW YOUR QUERIES • Model for queries, not normalisation • Consolidate to minimal number of tables that get the job done • Unbound partition growth will bring down nodes, then quorum
  32. 32. CASSANDRA + SPARK
  33. 33. SPARK • General purpose data processing • Ability to cache things in memory, and re-use across steps.
  34. 34. SPARK
  35. 35. SPARK STREAMING • Microbatches • Similar API to non-streaming Spark
  36. 36. SPARK STREAMING WC
  37. 37. SPARK + KAFKA Kafka Direct Stream
  38. 38. SPARK + CASSANDRA * rdd.saveToCassandra * sc.cassandraTable
  39. 39. KAFKA + CASSANDRA * Cassandra Sink * Cassandra Connect
  40. 40. STREAM PROCESSING • Lots of open problems • RISE Labs (Real-time, Intelligent, and Secure Execution
  41. 41. THANK YOU @ashic http://github/Heartysoft/cassy-up

×