• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
High-Volume Data Collection and Real Time Analytics Using Redis
 

High-Volume Data Collection and Real Time Analytics Using Redis

on

  • 6,308 views

In this talk, we describe using Redis, an open source, in-memory key-value store, to capture large volumes of data from numerous remote sources while also allowing real-time monitoring and analytics. ...

In this talk, we describe using Redis, an open source, in-memory key-value store, to capture large volumes of data from numerous remote sources while also allowing real-time monitoring and analytics. With this approach, we were able to capture a high volume of continuous data from numerous remote environmental sensors while consistently querying our database for real time monitoring and analytics.

* See more of my work at http://www.codehenge.net

Statistics

Views

Total Views
6,308
Views on SlideShare
6,086
Embed Views
222

Actions

Likes
20
Downloads
80
Comments
1

9 Embeds 222

http://max-makarochkin.blogspot.ru 174
http://max-makarochkin.blogspot.com 37
http://www.linkedin.com 2
http://max-makarochkin.blogspot.ca 2
https://www.linkedin.com 2
http://max-makarochkin.blogspot.co.il 2
http://max-makarochkin.blogspot.jp 1
http://max-makarochkin.blogspot.it 1
http://max-makarochkin.blogspot.tw 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Great preso - ty! Re. slide #29 ('Redis Features Advanced Data Structures'):
    1. I'm not sure that depicting the sorted set as a directed graph is true
    2. While correct, showing the implementation of the hash (i.e. with the hash table in the middle) is a little confusing - I believe a simpler representation would have sufficed.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • START AARON
  • Welcome.
  • AARON LAST SLIDELet’s start with some background. We’ve been working with a CMU research group on applications of a research project called Sensor Andrew. The vision of Sensor Andrew is to provide a generalized environmental sensor network, capable of being leveraged for a wide variety of applications, both academic and commercial.
  • START TIM
  • A Sensor Andrew system consists primarily of nodes, like this , each of which contains a variety of embedded sensors, and a gateway with a specialized receiver, allowing it to receive wireless messages from each of up to 64 nodes concurrently. Our collaborators have provided hardware design and gear, firmware on all embedded components, and some baseline software to work from when interfacing with the hardware systems.
  • Let’s look at some more detail on the type of data we are collecting. We currently have two types of nodes, environmental and power nodes . Environmental nodes can be set anywhere, and will detect measures of light, audio, humidity, pressure, motion, temperature, and acceleration (in x,y,z components) relative to the environment immediately surrounding the node. Power nodes must be plugged in to a wall outlet, with a current-drawing device using it to draw power. This allows the power node to detect and transmit numerous measures of data involving current, voltage, power, etc. Data is transmitted from the nodes in UDP format. For reference, an environmental data packet is ____ bytes in size, and a power data packet is ____ bytes.
  • Packets are UDP and the information is stored as an encoded string, so the network load is already pretty small.Compression, in addition to the encoding of the data, might be an option in the future, but that’s a small hurdle if we need it.
  • A terabyte per week isn’t tera-bly big, but it adds up when the data needs to stick around for a long time. Compression can ease the pain. Again, not expensive to implement if necessary.
  • This is an interesting part of the architecture. The nodes are pinging only once per second, and even at the gateway and collector stage, we’re actually limited to 64 pings per second. This pushes the the point of convergence to..
  • .. storage. We need fast writing, but we also need fast reading.
  • and we also need fast reading, simultaneously.
  • 120 loaded gateways = 7680 nodes. 1 record/sec => 27.2 million records / hour. 300 kb / record => 8GB/hour / 184GB/day
  • There are two primary I/O bottlenecks in all network applications: 1) Network I/O and 2) Filesystem I/O. In general, we will have no control over the network infrastructures of deployment sites, so we really can’t do anything about Network I/O. That leaves Filesystem I/O.
  • The best way to mitigate the Filesystem I/O bottleneck is to avoid the filesystem altogether.
  • TIM LAST SLIDE
  • START AARON
  • AARON LAST SLIDEWe originally tried separating out each data value into a separate key (you can talk more about this on the next slide, when you have the example in datapoint front of you). This allowed extremely efficient querying, as we could query ‘motion’ data independently from ‘audio’ data. However, the overhead was significant in two respects:We had to store metadata (timestamp, nodetype, node mac address, etc) with each record, so a lot more data duplication and space inefficiency.The number of inserts per second skyrocketed. E.g. x7 inserts per second for environmental nodes.
  • START TIM
  • If two data packets had exactly the same environmental values, but with a different score, redis would update the existing set member with the new score, instead of creating a new set member. This leads to some data duplication, which adds up over millions of records.
  • TIM LAST SLIDE
  • START AARON
  • Which can make Redis burst at the seams…
  • AARON LAST SLIDE
  • START TIM
  • AARON

High-Volume Data Collection and Real Time Analytics Using Redis High-Volume Data Collection and Real Time Analytics Using Redis Presentation Transcript

  • Large-Scale Data Collection Using Redis C. Aaron Cois, Ph.D. -- Tim Palko CMU Software Engineering Institute © 2011 Carnegie Mellon University
  • UsC. Aaron Cois, Ph.D. Tim PalkoSoftware Architect, Team Lead Senior Software EngineerCMU Software Engineering CMU Software EngineeringInstitute InstituteDigital Intelligence and Digital Intelligence andInvestigations Directorate Investigations Directorate@aaroncois © 2011 Carnegie Mellon University
  • Overview• Problem Statement• Sensor Hardware & System Requirements• System Overview – Data Collection – Data Modeling – Data Access – Event Monitoring and Notification• Conclusions and Future Work
  • The GoalCritical infrastructure/facility protection viaEnvironmental Monitoring
  • Why?Stuxnet• Two major components: 1) Send centrifuges spinning wildly out of control 2) Record ‘normal operations’ and play them back to operators during the attack 1• Environmental monitoring provides secondary indicators, such as abnormal heat/motion/sound 1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2&
  • The Broader VisionQuick, flexible out-of-band monitoring• Set up monitoring in minutes• Versatile sensors, easily repurposed• Data communication is secure (P2P VPN) and requires no existing systems other than outbound networking
  • The PlatformA CMU research project called Sensor Andrew• Features: – Open-source sensor platform – Scalable and generalist system supporting a wide variety of applications – Extensible architecture • Can integrate diverse sensor types
  • Sensor Andrew
  • End Users NodesGateway ServerGateway Sensor Andrew Overview
  • What is a Node?A node collects data and sends it to a collector, or gateway Environment Node Power Node Radiation Node Sensors Sensors Sensors • Light • Current • Alpha particle • Voltage count per minute • Audio • Humidity • True Power • Pressure • Energy Particulate • Motion Node Sensors • Temperature • Small Part. Count • Acceleration • Large Part. Count
  • What is a Gateway?• A gateway receives UDP data from all nodes registered to Gateway it• An internal service: – Receives data continuously – Opens a server on a specified port – Continually transmits UDP data over this port
  • RequirementsWe need to..1. Collect data from nodes once per second2. Scale to 100 gateways each with 64 nodes3. Detect events in real-time4. Notify users about events in real-time5. Retain all data collected for years, at least
  • What Is Big Data?
  • What Is Big Data? “When your data sets become solarge that you have to start innovating around how to collect, store, organize, analyze and share it.”
  • ProblemsSize TransmissionRate Storage
  • ProblemsSize TransmissionRate Storage
  • ProblemsSize TransmissionRate Storage
  • ProblemsSize TransmissionRate Storage
  • ProblemsSize TransmissionRate Storage
  • ProblemsSize TransmissionRate Storage Retrieval
  • Collecting Data Problem: Store and retrieve immense amounts of data at a high rate.Constraints: Data cannot remain on the nodes or gateways due to security concerns. Limited infrastructure. 8 GB / hour Gateway ?
  • We Tried PostgreSQL…• Advantages: – Reliable, tested and scalable – Relational => complex queries => analytics• Problems: – Performance problems reading while writing at a high rate; real-time event detection suffers – ‘COPY FROM’ doesn’t permit horizontal scaling
  • Q: How can we decrease I/O load?
  • Q: How can we decrease I/O load?A: Read and write collected data directly from memory
  • Enter RedisRedis is an in-memoryNoSQL databaseCommonly used as a web application cache orpub/sub server
  • Redis• Created in 2009• Fully In-memory key-value store – Fast I/O: R/W operations are equally fast – Advanced data structures• Publish/Subscribe Functionality – In addition to data store functions – Separate from stored key-value data
  • Persistence• Snapshotting – Data is asynchronously transferred from memory to disk• AOF (Append Only File) – Each modifying operation is written to a file – Can recreate data store by replaying operations – Without interrupting service, will rebuild AOF as the shortest sequence of commands needed to rebuild the current dataset in memory
  • Replication• Redis supports master-slave replication• Master-slave replication can be chained• Be careful: – Slaves are writeable! – Potential for data inconsistency• Fully compatible with Pub/Sub features
  • Redis Features Advanced Data Structures List Set Sorted Set Hash A:3 field1 “A” “A” A field2 “B” “B” B C:1 B:4 D field3 “C” “C” C D:2 “D” field4 “D” {value:score} {key:value}[A, B, C, D] {A, B, C, D} {C:1, D:2, A:3, D:4} {field1:“A”, field2:“B”…}
  • Our Data Model
  • ConstraintsOur data store must:– Hold time-series data– Be flexible in querying (by time, node, sensor)– Allow efficient querying of many records– Accept data out of order
  • Tradeoffs: Efficiency vs. FlexibilityOne record per One record per timestamp sensor data type VS Motion Light Motion Audio Light Temperature Pressure Humidity Audio Humidity Acceleration Temperature Pressure Acceleration A
  • Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  • Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  • Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  • Our Solution: Sorted Set Datapoint sensor:env:101Score 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  • Sorted Set1357542004000: {“temp”:523,..}1357542005000: {“temp”:523,..}1357542007000: {“temp”:530,..}1357542008000: {“temp”:531,..}1357542009000: {“temp”:540,..}1357542001000: {“temp”:545,..}…
  • Sorted Set1357542004000: {“temp”:523,..}1357542005000: {“temp”:523,..}1357542006000: {“temp”:527,..} <- fits nicely1357542007000: {“temp”:530,..}1357542008000: {“temp”:531,..}1357542009000: {“temp”:540,..}1357542001000: {“temp”:545,..}…
  • Know your data structure! A set is still a set… DatapointScore 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env",Value "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}
  • Requirement SatisfiedGateway Redis
  • There is a disturbance in the Force..
  • Collecting DataGateway Redis
  • “In Memory” Means Many Things• The data store capacity is aggressively capped – Redis can only store as much data as the server has RAM
  • Collecting Big DataGateway Redis
  • We could throw away data…• If we only cared about current values• However, our data – Must be stored for 1+ years for compliance – Must be able to be queried for historical/trend analysis
  • We Still Need Long-term Data Storage Solution? Migrate data to an archive with expansive storage capacity
  • Winning RedisGateway Archiver Postgre SQL
  • Winning? RedisGateway Archiver ? ? Postgre Some Poor Client SQL ?
  • Yes, Winning RedisGateway A Archiver P I Postgre Some Happy Client SQL
  • Gateway Redi Best of both worlds s Redis allows quick access to A real-time data, for Archiver P monitoring and event I detection Postg PostgreSQL allows complex reSQL queries and scalable storage for deep and historical analysis
  • We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events
  • We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”?
  • We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”? What about new data types?
  • Gateway Redis A Archiver P I Postgre motion > x SQL && pressure < y && audio > z New guy: provide a way App Django to read the data and DB App create rules
  • Gateway Redis A Archiver P I motion > x Postgre SQLAll true? pressure < y audio > z New guy: Eventread the rules and Event App Django Monitor data, trigger Monitor DB App alarms
  • Gateway Redis A Archiver P I Postgre SQLEvent monitorservices can bescaledindependently Event Event App Django Monitor Monitor DB App
  • Getting The Message Out
  • Getting The Message OutConsiderations• Event monitor already has a job, avoid re- tasking as a notification engine
  • Getting The Message OutConsiderations• Event monitor already has a job, avoid re- tasking as a notification engine• Notifications most efficiently should be a “push” instead of needing to poll
  • Getting The Message OutConsiderations• Event monitor already has a job, avoid re- tasking as a notification engine• Notifications most efficiently should be a “push” instead of needing to poll• Notification system should be generalized, e.g. SMTP, SMS
  • If only…
  • Pub/Sub with synchronized workers is an optimal solution to real-time event notifications. No need to add another system, Redis Data Redis offersGateway pub/sub services Redis as well! Pub/Sub A Archiver P I Worker Postgre Worker Notificatio SQL n Worker Event Event App Django SMTP Monitor Monitor DB App
  • Conclusions• Redis is a powerful tool for collecting large amounts of data in real-time• In addition to maintaining a rapid pace of data insertion, we were able to concurrently query, monitor, and detect events on our Redis data collection system• Bonus: Redis also enabled a robust, scalable real-time notification system using pub/sub
  • Things to watch • Data persistence – if Redis needs to restart, it takes 10-20 seconds per gigabyte to re-load all data into memory 1 – Redis is unresponsive during startup1 http://oldblog.antirez.com/post/redis-persistence-demystified.html
  • Future Work• Improve scalability through: – Data encoding – Data compression – Parallel batch inserts for all nodes on a gateway• Deep historical data analytics
  • Acknowledgements• Project engineers Chris Taschner and Jeff Hamed @ CMU SEI• Prof. Anthony Rowe & CMU ECE WiSE Lab http://wise.ece.cmu.edu/• Our organizations CMU https://www.cmu.edu CERT http://www.cert.org SEI http://www.sei.cmu.edu Cylab https://www.cylab.cmu.edu
  • Thank You
  • Thank YouQuestions?
  • Slides of Live Redis Demo
  • A Closer Look at Redis Data redis> keys * 1)"sensor:environment:f80” 2)"sensor:environment:f81” 3)"sensor:environment:f82" 4)"sensor:environment:f83" 5)"sensor:environment:f84" 6)"sensor:power:f85" 7)"sensor:power:f86" 8)"sensor:radiation:f87" 9)"sensor:particulate:f88"
  • A Closer Look at Redis Data redis> keys sensor:power:* 1)"sensor:power:f85" 2)"sensor:power:f86”
  • A Closer Look at Redis Dataredis> zcount sensor:power:f85 –inf +inf(integer) 3565958(45.38s)
  • A Closer Look at Redis Dataredis> zcount sensor:power:f85 1359728113000 +inf(integer) 47
  • A Closer Look at Redis Dataredis> zrange sensor:power:f85 -1000 -11) "{"long_energy1": 73692453, "total_secs": 6784, "energy": [49, 175, 62, 0, 0, 0], "c2_center": 485, "socket_state": 1, "node_type": "power", "c_p2p_low2": 437, "socket_state1": 0, "mac_address": "103", "c_p2p_low": 494, "rms_current": 6, "true_power": 1158, "timestamp": 1359728143000, "v_p2p_low": 170, "c_p2p_high": 511, "rms_current1": 113, "freq": 60, "long_energy": 4108081, "v_center": 530, "c_p2p_high2": 719, "energy1": [37, 117, 100, 4, 0, 0], "v_p2p_high": 883, "c_center": 509, "rms_voltage": 255, "true_power1": 23235}”2) …
  • Redis Python APIimport redispool = redis.ConnectionPool(host=127.0.0.1, port=6379, db=0)r = redis.Redis(connection_pool=pool)byindex = r.zrange(“sensor:env:f85”, -50, -1)# [{"acc_z":663,"bat":0,"gpio_state":1,"temp":663,"light”:…byscore = r.zrangebyscore(“sensor:env:f85”, 1361423071000, 1361423072000)# [{"acc_z":734,"bat":0,"gpio_state":1,"temp":734,"light”:…size = r.zcount(“sensor:env:f85”, "-inf", "+inf")# 237327L