High-Volume Data Collection and Real Time Analytics Using Redis

21,938 views

Published on

In this talk, we describe using Redis, an open source, in-memory key-value store, to capture large volumes of data from numerous remote sources while also allowing real-time monitoring and analytics. With this approach, we were able to capture a high volume of continuous data from numerous remote environmental sensors while consistently querying our database for real time monitoring and analytics.

* See more of my work at http://www.codehenge.net

4 Comments
57 Likes
Statistics
Notes
No Downloads
Views
Total views
21,938
On SlideShare
0
From Embeds
0
Number of Embeds
575
Actions
Shares
0
Downloads
425
Comments
4
Likes
57
Embeds 0
No embeds

No notes for slide
  • START AARON
  • Welcome. <Introductions, who we are, where we’re from>
  • AARON LAST SLIDELet’s start with some background. We’ve been working with a CMU research group on applications of a research project called Sensor Andrew. The vision of Sensor Andrew is to provide a generalized environmental sensor network, capable of being leveraged for a wide variety of applications, both academic and commercial.
  • START TIM
  • A Sensor Andrew system consists primarily of nodes, like this <hold one up if possible>, each of which contains a variety of embedded sensors, and a gateway with a specialized receiver, allowing it to receive wireless messages from each of up to 64 nodes concurrently. Our collaborators have provided hardware design and gear, firmware on all embedded components, and some baseline software to work from when interfacing with the hardware systems.
  • Let’s look at some more detail on the type of data we are collecting. We currently have two types of nodes, environmental and power nodes <show samples>. Environmental nodes can be set anywhere, and will detect measures of light, audio, humidity, pressure, motion, temperature, and acceleration (in x,y,z components) relative to the environment immediately surrounding the node. Power nodes must be plugged in to a wall outlet, with a current-drawing device using it to draw power. This allows the power node to detect and transmit numerous measures of data involving current, voltage, power, etc. Data is transmitted from the nodes in UDP format. For reference, an environmental data packet is ____ bytes in size, and a power data packet is ____ bytes.
  • Packets are UDP and the information is stored as an encoded string, so the network load is already pretty small.Compression, in addition to the encoding of the data, might be an option in the future, but that’s a small hurdle if we need it.
  • A terabyte per week isn’t tera-bly big, but it adds up when the data needs to stick around for a long time. Compression can ease the pain. Again, not expensive to implement if necessary.
  • This is an interesting part of the architecture. The nodes are pinging only once per second, and even at the gateway and collector stage, we’re actually limited to 64 pings per second. This pushes the the point of convergence to..
  • .. storage. We need fast writing, but we also need fast reading.
  • and we also need fast reading, simultaneously.
  • 120 loaded gateways = 7680 nodes. 1 record/sec => 27.2 million records / hour. 300 kb / record => 8GB/hour / 184GB/day
  • There are two primary I/O bottlenecks in all network applications: 1) Network I/O and 2) Filesystem I/O. In general, we will have no control over the network infrastructures of deployment sites, so we really can’t do anything about Network I/O. That leaves Filesystem I/O.
  • The best way to mitigate the Filesystem I/O bottleneck is to avoid the filesystem altogether.
  • TIM LAST SLIDE
  • START AARON
  • AARON LAST SLIDEWe originally tried separating out each data value into a separate key (you can talk more about this on the next slide, when you have the example in datapoint front of you). This allowed extremely efficient querying, as we could query ‘motion’ data independently from ‘audio’ data. However, the overhead was significant in two respects:We had to store metadata (timestamp, nodetype, node mac address, etc) with each record, so a lot more data duplication and space inefficiency.The number of inserts per second skyrocketed. E.g. x7 inserts per second for environmental nodes.
  • START TIM
  • If two data packets had exactly the same environmental values, but with a different score, redis would update the existing set member with the new score, instead of creating a new set member. This leads to some data duplication, which adds up over millions of records.
  • TIM LAST SLIDE
  • START AARON
  • Which can make Redis burst at the seams…
  • AARON LAST SLIDE
  • START TIM
  • AARON
  • High-Volume Data Collection and Real Time Analytics Using Redis

    1. 1. Large-Scale Data Collection Using Redis C. Aaron Cois, Ph.D. -- Tim Palko CMU Software Engineering Institute © 2011 Carnegie Mellon University
    2. 2. UsC. Aaron Cois, Ph.D. Tim PalkoSoftware Architect, Team Lead Senior Software EngineerCMU Software Engineering CMU Software EngineeringInstitute InstituteDigital Intelligence and Digital Intelligence andInvestigations Directorate Investigations Directorate@aaroncois © 2011 Carnegie Mellon University
    3. 3. Overview• Problem Statement• Sensor Hardware & System Requirements• System Overview – Data Collection – Data Modeling – Data Access – Event Monitoring and Notification• Conclusions and Future Work
    4. 4. The GoalCritical infrastructure/facility protection viaEnvironmental Monitoring
    5. 5. Why?Stuxnet• Two major components: 1) Send centrifuges spinning wildly out of control 2) Record ‘normal operations’ and play them back to operators during the attack 1• Environmental monitoring provides secondary indicators, such as abnormal heat/motion/sound 1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2&
    6. 6. The Broader VisionQuick, flexible out-of-band monitoring• Set up monitoring in minutes• Versatile sensors, easily repurposed• Data communication is secure (P2P VPN) and requires no existing systems other than outbound networking
    7. 7. The PlatformA CMU research project called Sensor Andrew• Features: – Open-source sensor platform – Scalable and generalist system supporting a wide variety of applications – Extensible architecture • Can integrate diverse sensor types
    8. 8. Sensor Andrew
    9. 9. End Users NodesGateway ServerGateway Sensor Andrew Overview
    10. 10. What is a Node?A node collects data and sends it to a collector, or gateway Environment Node Power Node Radiation Node Sensors Sensors Sensors • Light • Current • Alpha particle • Voltage count per minute • Audio • Humidity • True Power • Pressure • Energy Particulate • Motion Node Sensors • Temperature • Small Part. Count • Acceleration • Large Part. Count
    11. 11. What is a Gateway?• A gateway receives UDP data from all nodes registered to Gateway it• An internal service: – Receives data continuously – Opens a server on a specified port – Continually transmits UDP data over this port
    12. 12. RequirementsWe need to..1. Collect data from nodes once per second2. Scale to 100 gateways each with 64 nodes3. Detect events in real-time4. Notify users about events in real-time5. Retain all data collected for years, at least
    13. 13. What Is Big Data?
    14. 14. What Is Big Data? “When your data sets become solarge that you have to start innovating around how to collect, store, organize, analyze and share it.”
    15. 15. ProblemsSize TransmissionRate Storage
    16. 16. ProblemsSize TransmissionRate Storage
    17. 17. ProblemsSize TransmissionRate Storage
    18. 18. ProblemsSize TransmissionRate Storage
    19. 19. ProblemsSize TransmissionRate Storage
    20. 20. ProblemsSize TransmissionRate Storage Retrieval
    21. 21. Collecting Data Problem: Store and retrieve immense amounts of data at a high rate.Constraints: Data cannot remain on the nodes or gateways due to security concerns. Limited infrastructure. 8 GB / hour Gateway ?
    22. 22. We Tried PostgreSQL…• Advantages: – Reliable, tested and scalable – Relational => complex queries => analytics• Problems: – Performance problems reading while writing at a high rate; real-time event detection suffers – ‘COPY FROM’ doesn’t permit horizontal scaling
    23. 23. Q: How can we decrease I/O load?
    24. 24. Q: How can we decrease I/O load?A: Read and write collected data directly from memory
    25. 25. Enter RedisRedis is an in-memoryNoSQL databaseCommonly used as a web application cache orpub/sub server
    26. 26. Redis• Created in 2009• Fully In-memory key-value store – Fast I/O: R/W operations are equally fast – Advanced data structures• Publish/Subscribe Functionality – In addition to data store functions – Separate from stored key-value data
    27. 27. Persistence• Snapshotting – Data is asynchronously transferred from memory to disk• AOF (Append Only File) – Each modifying operation is written to a file – Can recreate data store by replaying operations – Without interrupting service, will rebuild AOF as the shortest sequence of commands needed to rebuild the current dataset in memory
    28. 28. Replication• Redis supports master-slave replication• Master-slave replication can be chained• Be careful: – Slaves are writeable! – Potential for data inconsistency• Fully compatible with Pub/Sub features
    29. 29. Redis Features Advanced Data Structures List Set Sorted Set Hash A:3 field1 “A” “A” A field2 “B” “B” B C:1 B:4 D field3 “C” “C” C D:2 “D” field4 “D” {value:score} {key:value}[A, B, C, D] {A, B, C, D} {C:1, D:2, A:3, D:4} {field1:“A”, field2:“B”…}
    30. 30. Our Data Model
    31. 31. ConstraintsOur data store must:– Hold time-series data– Be flexible in querying (by time, node, sensor)– Allow efficient querying of many records– Accept data out of order
    32. 32. Tradeoffs: Efficiency vs. FlexibilityOne record per One record per timestamp sensor data type VS Motion Light Motion Audio Light Temperature Pressure Humidity Audio Humidity Acceleration Temperature Pressure Acceleration A
    33. 33. Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
    34. 34. Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
    35. 35. Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
    36. 36. Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
    37. 37. Sorted Set1357542004000: {“temp”:523,..}1357542005000: {“temp”:523,..}1357542007000: {“temp”:530,..}1357542008000: {“temp”:531,..}1357542009000: {“temp”:540,..}1357542001000: {“temp”:545,..}…
    38. 38. Sorted Set1357542004000: {“temp”:523,..}1357542005000: {“temp”:523,..}1357542006000: {“temp”:527,..} <- fits nicely1357542007000: {“temp”:530,..}1357542008000: {“temp”:531,..}1357542009000: {“temp”:540,..}1357542001000: {“temp”:545,..}…
    39. 39. Know your data structure! A set is still a set… DatapointScore 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
    40. 40. Requirement SatisfiedGateway Redis
    41. 41. There is a disturbance in the Force..
    42. 42. Collecting DataGateway Redis
    43. 43. “In Memory” Means Many Things• The data store capacity is aggressively capped – Redis can only store as much data as the server has RAM
    44. 44. Collecting Big DataGateway Redis
    45. 45. We could throw away data…• If we only cared about current values• However, our data – Must be stored for 1+ years for compliance – Must be able to be queried for historical/trend analysis
    46. 46. We Still Need Long-term Data Storage Solution? Migrate data to an archive with expansive storage capacity
    47. 47. Winning RedisGateway Archiver Postgre SQL
    48. 48. Winning? RedisGateway Archiver ? ? Postgre Some Poor Client SQL ?
    49. 49. Yes, Winning RedisGateway A Archiver P I Postgre Some Happy Client SQL
    50. 50. Gateway Redi Best of both worlds s Redis allows quick access to A real-time data, for Archiver P monitoring and event I detection Postg PostgreSQL allows complex reSQL queries and scalable storage for deep and historical analysis
    51. 51. We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events
    52. 52. We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”?
    53. 53. We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”? What about new data types?
    54. 54. Gateway Redis A Archiver P I Postgre motion > x SQL && pressure < y && audio > z New guy: provide a way App Django to read the data and DB App create rules
    55. 55. Gateway Redis A Archiver P I motion > x Postgre SQLAll true? pressure < y audio > z New guy: Eventread the rules and Event App Django Monitor data, trigger Monitor DB App alarms
    56. 56. Gateway Redis A Archiver P I Postgre SQLEvent monitorservices can bescaledindependently Event Event App Django Monitor Monitor DB App
    57. 57. Getting The Message Out
    58. 58. Getting The Message OutConsiderations• Event monitor already has a job, avoid re- tasking as a notification engine
    59. 59. Getting The Message OutConsiderations• Event monitor already has a job, avoid re- tasking as a notification engine• Notifications most efficiently should be a “push” instead of needing to poll
    60. 60. Getting The Message OutConsiderations• Event monitor already has a job, avoid re- tasking as a notification engine• Notifications most efficiently should be a “push” instead of needing to poll• Notification system should be generalized, e.g. SMTP, SMS
    61. 61. If only…
    62. 62. Pub/Sub with synchronized workers is an optimal solution to real-time event notifications. No need to add another system, Redis Data Redis offersGateway pub/sub services Redis as well! Pub/Sub A Archiver P I Worker Postgre Worker Notificatio SQL n Worker Event Event App Django SMTP Monitor Monitor DB App
    63. 63. Conclusions• Redis is a powerful tool for collecting large amounts of data in real-time• In addition to maintaining a rapid pace of data insertion, we were able to concurrently query, monitor, and detect events on our Redis data collection system• Bonus: Redis also enabled a robust, scalable real-time notification system using pub/sub
    64. 64. Things to watch • Data persistence – if Redis needs to restart, it takes 10-20 seconds per gigabyte to re-load all data into memory 1 – Redis is unresponsive during startup1 http://oldblog.antirez.com/post/redis-persistence-demystified.html
    65. 65. Future Work• Improve scalability through: – Data encoding – Data compression – Parallel batch inserts for all nodes on a gateway• Deep historical data analytics
    66. 66. Acknowledgements• Project engineers Chris Taschner and Jeff Hamed @ CMU SEI• Prof. Anthony Rowe & CMU ECE WiSE Lab http://wise.ece.cmu.edu/• Our organizations CMU https://www.cmu.edu CERT http://www.cert.org SEI http://www.sei.cmu.edu Cylab https://www.cylab.cmu.edu
    67. 67. Thank You
    68. 68. Thank YouQuestions?
    69. 69. Slides of Live Redis Demo
    70. 70. A Closer Look at Redis Data redis> keys * 1)"sensor:environment:f80” 2)"sensor:environment:f81” 3)"sensor:environment:f82" 4)"sensor:environment:f83" 5)"sensor:environment:f84" 6)"sensor:power:f85" 7)"sensor:power:f86" 8)"sensor:radiation:f87" 9)"sensor:particulate:f88"
    71. 71. A Closer Look at Redis Data redis> keys sensor:power:* 1)"sensor:power:f85" 2)"sensor:power:f86”
    72. 72. A Closer Look at Redis Dataredis> zcount sensor:power:f85 –inf +inf(integer) 3565958(45.38s)
    73. 73. A Closer Look at Redis Dataredis> zcount sensor:power:f85 1359728113000 +inf(integer) 47
    74. 74. A Closer Look at Redis Dataredis> zrange sensor:power:f85 -1000 -11) "{"long_energy1": 73692453, "total_secs": 6784, "energy": [49, 175, 62, 0, 0, 0], "c2_center": 485, "socket_state": 1, "node_type": "power", "c_p2p_low2": 437, "socket_state1": 0, "mac_address": "103", "c_p2p_low": 494, "rms_current": 6, "true_power": 1158, "timestamp": 1359728143000, "v_p2p_low": 170, "c_p2p_high": 511, "rms_current1": 113, "freq": 60, "long_energy": 4108081, "v_center": 530, "c_p2p_high2": 719, "energy1": [37, 117, 100, 4, 0, 0], "v_p2p_high": 883, "c_center": 509, "rms_voltage": 255, "true_power1": 23235}”2) …
    75. 75. Redis Python APIimport redispool = redis.ConnectionPool(host=127.0.0.1, port=6379, db=0)r = redis.Redis(connection_pool=pool)byindex = r.zrange(“sensor:env:f85”, -50, -1)# [{"acc_z":663,"bat":0,"gpio_state":1,"temp":663,"light”:…byscore = r.zrangebyscore(“sensor:env:f85”, 1361423071000, 1361423072000)# [{"acc_z":734,"bat":0,"gpio_state":1,"temp":734,"light”:…size = r.zcount(“sensor:env:f85”, "-inf", "+inf")# 237327L

    ×