Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High-Volume Data Collection and Real Time Analytics Using Redis

In this talk, we describe using Redis, an open source, in-memory key-value store, to capture large volumes of data from numerous remote sources while also allowing real-time monitoring and analytics. With this approach, we were able to capture a high volume of continuous data from numerous remote environmental sensors while consistently querying our database for real time monitoring and analytics.

* See more of my work at http://www.codehenge.net

  • Login to see the comments

High-Volume Data Collection and Real Time Analytics Using Redis

  1. 1. Large-Scale Data Collection Using Redis C. Aaron Cois, Ph.D. -- Tim Palko CMU Software Engineering Institute © 2011 Carnegie Mellon University
  2. 2. UsC. Aaron Cois, Ph.D. Tim PalkoSoftware Architect, Team Lead Senior Software EngineerCMU Software Engineering CMU Software EngineeringInstitute InstituteDigital Intelligence and Digital Intelligence andInvestigations Directorate Investigations Directorate@aaroncois © 2011 Carnegie Mellon University
  3. 3. Overview• Problem Statement• Sensor Hardware & System Requirements• System Overview – Data Collection – Data Modeling – Data Access – Event Monitoring and Notification• Conclusions and Future Work
  4. 4. The GoalCritical infrastructure/facility protection viaEnvironmental Monitoring
  5. 5. Why?Stuxnet• Two major components: 1) Send centrifuges spinning wildly out of control 2) Record ‘normal operations’ and play them back to operators during the attack 1• Environmental monitoring provides secondary indicators, such as abnormal heat/motion/sound 1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2&
  6. 6. The Broader VisionQuick, flexible out-of-band monitoring• Set up monitoring in minutes• Versatile sensors, easily repurposed• Data communication is secure (P2P VPN) and requires no existing systems other than outbound networking
  7. 7. The PlatformA CMU research project called Sensor Andrew• Features: – Open-source sensor platform – Scalable and generalist system supporting a wide variety of applications – Extensible architecture • Can integrate diverse sensor types
  8. 8. Sensor Andrew
  9. 9. End Users NodesGateway ServerGateway Sensor Andrew Overview
  10. 10. What is a Node?A node collects data and sends it to a collector, or gateway Environment Node Power Node Radiation Node Sensors Sensors Sensors • Light • Current • Alpha particle • Voltage count per minute • Audio • Humidity • True Power • Pressure • Energy Particulate • Motion Node Sensors • Temperature • Small Part. Count • Acceleration • Large Part. Count
  11. 11. What is a Gateway?• A gateway receives UDP data from all nodes registered to Gateway it• An internal service: – Receives data continuously – Opens a server on a specified port – Continually transmits UDP data over this port
  12. 12. RequirementsWe need to..1. Collect data from nodes once per second2. Scale to 100 gateways each with 64 nodes3. Detect events in real-time4. Notify users about events in real-time5. Retain all data collected for years, at least
  13. 13. What Is Big Data?
  14. 14. What Is Big Data? “When your data sets become solarge that you have to start innovating around how to collect, store, organize, analyze and share it.”
  15. 15. ProblemsSize TransmissionRate Storage
  16. 16. ProblemsSize TransmissionRate Storage
  17. 17. ProblemsSize TransmissionRate Storage
  18. 18. ProblemsSize TransmissionRate Storage
  19. 19. ProblemsSize TransmissionRate Storage
  20. 20. ProblemsSize TransmissionRate Storage Retrieval
  21. 21. Collecting Data Problem: Store and retrieve immense amounts of data at a high rate.Constraints: Data cannot remain on the nodes or gateways due to security concerns. Limited infrastructure. 8 GB / hour Gateway ?
  22. 22. We Tried PostgreSQL…• Advantages: – Reliable, tested and scalable – Relational => complex queries => analytics• Problems: – Performance problems reading while writing at a high rate; real-time event detection suffers – ‘COPY FROM’ doesn’t permit horizontal scaling
  23. 23. Q: How can we decrease I/O load?
  24. 24. Q: How can we decrease I/O load?A: Read and write collected data directly from memory
  25. 25. Enter RedisRedis is an in-memoryNoSQL databaseCommonly used as a web application cache orpub/sub server
  26. 26. Redis• Created in 2009• Fully In-memory key-value store – Fast I/O: R/W operations are equally fast – Advanced data structures• Publish/Subscribe Functionality – In addition to data store functions – Separate from stored key-value data
  27. 27. Persistence• Snapshotting – Data is asynchronously transferred from memory to disk• AOF (Append Only File) – Each modifying operation is written to a file – Can recreate data store by replaying operations – Without interrupting service, will rebuild AOF as the shortest sequence of commands needed to rebuild the current dataset in memory
  28. 28. Replication• Redis supports master-slave replication• Master-slave replication can be chained• Be careful: – Slaves are writeable! – Potential for data inconsistency• Fully compatible with Pub/Sub features
  29. 29. Redis Features Advanced Data Structures List Set Sorted Set Hash A:3 field1 “A” “A” A field2 “B” “B” B C:1 B:4 D field3 “C” “C” C D:2 “D” field4 “D” {value:score} {key:value}[A, B, C, D] {A, B, C, D} {C:1, D:2, A:3, D:4} {field1:“A”, field2:“B”…}
  30. 30. Our Data Model
  31. 31. ConstraintsOur data store must:– Hold time-series data– Be flexible in querying (by time, node, sensor)– Allow efficient querying of many records– Accept data out of order
  32. 32. Tradeoffs: Efficiency vs. FlexibilityOne record per One record per timestamp sensor data type VS Motion Light Motion Audio Light Temperature Pressure Humidity Audio Humidity Acceleration Temperature Pressure Acceleration A
  33. 33. Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  34. 34. Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  35. 35. Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  36. 36. Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  37. 37. Sorted Set1357542004000: {“temp”:523,..}1357542005000: {“temp”:523,..}1357542007000: {“temp”:530,..}1357542008000: {“temp”:531,..}1357542009000: {“temp”:540,..}1357542001000: {“temp”:545,..}…
  38. 38. Sorted Set1357542004000: {“temp”:523,..}1357542005000: {“temp”:523,..}1357542006000: {“temp”:527,..} <- fits nicely1357542007000: {“temp”:530,..}1357542008000: {“temp”:531,..}1357542009000: {“temp”:540,..}1357542001000: {“temp”:545,..}…
  39. 39. Know your data structure! A set is still a set… DatapointScore 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  40. 40. Requirement SatisfiedGateway Redis
  41. 41. There is a disturbance in the Force..
  42. 42. Collecting DataGateway Redis
  43. 43. “In Memory” Means Many Things• The data store capacity is aggressively capped – Redis can only store as much data as the server has RAM
  44. 44. Collecting Big DataGateway Redis
  45. 45. We could throw away data…• If we only cared about current values• However, our data – Must be stored for 1+ years for compliance – Must be able to be queried for historical/trend analysis
  46. 46. We Still Need Long-term Data Storage Solution? Migrate data to an archive with expansive storage capacity
  47. 47. Winning RedisGateway Archiver Postgre SQL
  48. 48. Winning? RedisGateway Archiver ? ? Postgre Some Poor Client SQL ?
  49. 49. Yes, Winning RedisGateway A Archiver P I Postgre Some Happy Client SQL
  50. 50. Gateway Redi Best of both worlds s Redis allows quick access to A real-time data, for Archiver P monitoring and event I detection Postg PostgreSQL allows complex reSQL queries and scalable storage for deep and historical analysis
  51. 51. We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events
  52. 52. We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”?
  53. 53. We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”? What about new data types?
  54. 54. Gateway Redis A Archiver P I Postgre motion > x SQL && pressure < y && audio > z New guy: provide a way App Django to read the data and DB App create rules
  55. 55. Gateway Redis A Archiver P I motion > x Postgre SQLAll true? pressure < y audio > z New guy: Eventread the rules and Event App Django Monitor data, trigger Monitor DB App alarms
  56. 56. Gateway Redis A Archiver P I Postgre SQLEvent monitorservices can bescaledindependently Event Event App Django Monitor Monitor DB App
  57. 57. Getting The Message Out
  58. 58. Getting The Message OutConsiderations• Event monitor already has a job, avoid re- tasking as a notification engine
  59. 59. Getting The Message OutConsiderations• Event monitor already has a job, avoid re- tasking as a notification engine• Notifications most efficiently should be a “push” instead of needing to poll
  60. 60. Getting The Message OutConsiderations• Event monitor already has a job, avoid re- tasking as a notification engine• Notifications most efficiently should be a “push” instead of needing to poll• Notification system should be generalized, e.g. SMTP, SMS
  61. 61. If only…
  62. 62. Pub/Sub with synchronized workers is an optimal solution to real-time event notifications. No need to add another system, Redis Data Redis offersGateway pub/sub services Redis as well! Pub/Sub A Archiver P I Worker Postgre Worker Notificatio SQL n Worker Event Event App Django SMTP Monitor Monitor DB App
  63. 63. Conclusions• Redis is a powerful tool for collecting large amounts of data in real-time• In addition to maintaining a rapid pace of data insertion, we were able to concurrently query, monitor, and detect events on our Redis data collection system• Bonus: Redis also enabled a robust, scalable real-time notification system using pub/sub
  64. 64. Things to watch • Data persistence – if Redis needs to restart, it takes 10-20 seconds per gigabyte to re-load all data into memory 1 – Redis is unresponsive during startup1 http://oldblog.antirez.com/post/redis-persistence-demystified.html
  65. 65. Future Work• Improve scalability through: – Data encoding – Data compression – Parallel batch inserts for all nodes on a gateway• Deep historical data analytics
  66. 66. Acknowledgements• Project engineers Chris Taschner and Jeff Hamed @ CMU SEI• Prof. Anthony Rowe & CMU ECE WiSE Lab http://wise.ece.cmu.edu/• Our organizations CMU https://www.cmu.edu CERT http://www.cert.org SEI http://www.sei.cmu.edu Cylab https://www.cylab.cmu.edu
  67. 67. Thank You
  68. 68. Thank YouQuestions?
  69. 69. Slides of Live Redis Demo
  70. 70. A Closer Look at Redis Data redis> keys * 1)"sensor:environment:f80” 2)"sensor:environment:f81” 3)"sensor:environment:f82" 4)"sensor:environment:f83" 5)"sensor:environment:f84" 6)"sensor:power:f85" 7)"sensor:power:f86" 8)"sensor:radiation:f87" 9)"sensor:particulate:f88"
  71. 71. A Closer Look at Redis Data redis> keys sensor:power:* 1)"sensor:power:f85" 2)"sensor:power:f86”
  72. 72. A Closer Look at Redis Dataredis> zcount sensor:power:f85 –inf +inf(integer) 3565958(45.38s)
  73. 73. A Closer Look at Redis Dataredis> zcount sensor:power:f85 1359728113000 +inf(integer) 47
  74. 74. A Closer Look at Redis Dataredis> zrange sensor:power:f85 -1000 -11) "{"long_energy1": 73692453, "total_secs": 6784, "energy": [49, 175, 62, 0, 0, 0], "c2_center": 485, "socket_state": 1, "node_type": "power", "c_p2p_low2": 437, "socket_state1": 0, "mac_address": "103", "c_p2p_low": 494, "rms_current": 6, "true_power": 1158, "timestamp": 1359728143000, "v_p2p_low": 170, "c_p2p_high": 511, "rms_current1": 113, "freq": 60, "long_energy": 4108081, "v_center": 530, "c_p2p_high2": 719, "energy1": [37, 117, 100, 4, 0, 0], "v_p2p_high": 883, "c_center": 509, "rms_voltage": 255, "true_power1": 23235}”2) …
  75. 75. Redis Python APIimport redispool = redis.ConnectionPool(host=127.0.0.1, port=6379, db=0)r = redis.Redis(connection_pool=pool)byindex = r.zrange(“sensor:env:f85”, -50, -1)# [{"acc_z":663,"bat":0,"gpio_state":1,"temp":663,"light”:…byscore = r.zrangebyscore(“sensor:env:f85”, 1361423071000, 1361423072000)# [{"acc_z":734,"bat":0,"gpio_state":1,"temp":734,"light”:…size = r.zcount(“sensor:env:f85”, "-inf", "+inf")# 237327L

×