Time Series Data Storage in MongoDB
Upcoming SlideShare
Loading in...5
×
 

Time Series Data Storage in MongoDB

on

  • 32,812 views

Skyline Innovations, a renewable energy company in Washington DC, uses MongoDB to store its time series data from its solar installations. This talk tells how, and why. ...

Skyline Innovations, a renewable energy company in Washington DC, uses MongoDB to store its time series data from its solar installations. This talk tells how, and why.

www.skylineinnovations.com

Given at MongoDC2011

Statistics

Views

Total Views
32,812
Views on SlideShare
24,891
Embed Views
7,921

Actions

Likes
40
Downloads
490
Comments
3

31 Embeds 7,921

http://www.10gen.com 4596
http://www.mongodb.com 3217
http://snipick.com 14
https://www.mongodb.com 11
https://twitter.com 11
http://howtotube.deldig.com 10
http://115.112.206.131 9
http://www.twylah.com 6
http://apy.mongodb.org 4
http://drupal1.10gen.cc 4
http://webcache.googleusercontent.com 4
http://www.google.be 3
http://translate.googleusercontent.com 3
http://www.google.com 3
http://10.1.12.249:15871 3
http://www.linkedin.com 2
http://wiigames.deldig.com 2
http://free-downloads.deldig.com 2
http://pluhspw.mongodb.org 2
http://jason.10gen.com 2
http://jire.mongodb.org 2
http://zziiraf.mongodb.org 2
http://thewww.mongodb.org 1
http://trunkly.com 1
http://www.docshut.com 1
http://picturetube.deldig.com 1
http://rfrrjbs.mongodb.org 1
http://vvww.mongodb.org 1
http://tube.deldig.com 1
http://utube-funny-videos.deldig.com 1
http://www.deldig.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

13 of 3 Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • You have an array (polysun_proj) in the 'install' dimension. Would that relate to a M:M dimension in a classic star / snowflake schema? In other words, in your example, are each of those numeric values in polysun_proj a related project to the install?
    Are you sure you want to
    Your message goes here
    Processing…
  • Thanks for sharing your presentation. Just wondering if you could share an example of your map reduce code that you used to build your other dimensions in your cube?

    Thanks
    Are you sure you want to
    Your message goes here
    Processing…
  • I need help please I want to search for time series and historical data management
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Time Series Data Storage in MongoDB Time Series Data Storage in MongoDB Document Transcript

    • +Sunday, July 24, 2011
    • ajackson @ skylineinnovations.comSunday, July 24, 2011
    • a tale of rapid prototyping, data warehousing, solar power, an architecture designed for data analysis at “scale” ...and arduinos!Sunday, July 24, 2011So here’s what i’d like to talk about: Who we are, how we got started, and most importantly,how we’ve been able to use MongoDB to help us. We’re not a traditional startup -- and whilei know that this is not a “startups” talk, but a Mongo one, i’d like to show how Mongo’sflexible nature really helped us as a business, and how Mongo specifically has been a goodchoice for us as we build some of our tools. Here are some themes:
    • ScalingSunday, July 24, 2011Mongo has come to have a pretty strong association with the word “scaling.”Scaling is a word we throw around a lot, and it almost always means “software performance,as inputs grow by orders of magnitude.”But scaling also means performance as the variety of inputs increases. I’d argue that it’sscaling to go from 10 users to 10,000, and it’s also scaling to go from ten ‘kinds’ of input toa hundred.There’s another word for this.
    • Scaling FlexibilitySunday, July 24, 2011Particularly when you scale in the real world, you start to find that it’s complicated and messyand entropic in ways that software isn’t always equipped to handle. So for us, when we say“mongo helps us scale”, we don’t necessarily mean scaling to petabytes of data. We’ll comeback to them as well.
    • Business-first developmentSunday, July 24, 2011This generally means flexibile, lightweight processes. Things that become fixed &unchangable quickly become obsolete and sad :’(
    • When Does “Context” become “Yak Shaving”?Sunday, July 24, 2011When i read new things or hear about new stuff, I’m always trying to put it in context. So,sometimes i put too much context in my talks :( To avoid it, I sometimes go a little too fastover the context that *is* important. So please stop me to ask questions! Also, the problemdomain here is a little different than what we might be used to, so bear with me as we go intoplumbing & construction.
    • PreliminariesSunday, July 24, 2011
    • Est. 8/2009Sunday, July 24, 2011
    • Project Development + TechnologySunday, July 24, 2011
    • “Project Development”Sunday, July 24, 2011
    • finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings.Sunday, July 24, 2011
    • finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings.Sunday, July 24, 2011We’ll pay to put stuff on your roof, and we’ll keep it at its maximally awesome.
    • finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings.Sunday, July 24, 2011Right now, this means solar thermal, more efficient lighting retrofits, and maybe HVAC.
    • finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings.Sunday, July 24, 2011So, here’s the interesting part. Since we put stuff on your roof for free, we need to get thatmoney back. What we do is, we’ll charge you for the energy that it saved you, but, here’s thetwist. Other companies have done similar things, where they say “we’ll pay for a system/retrofit/whatever, and you’ll agree to pay us an arbitrary number, and we say you’ll getsavings, but you won’t actually be able to tell, really.” That always seemed sketchy to us. So,we actually measure the performance of this stuff, collect the data, and guarantee that yousave money.
    • (not webapps)Sunday, July 24, 2011
    • Topics not covered:Sunday, July 24, 2011
    • • Why solar thermal? • Why hasn’t anyone else done this before? • Pivots? Iterations? • What’s the market size? • Funding? Capital structures? • Wait, how do you guys make money?Sunday, July 24, 2011Oh, right, this isn’t a startup talk. But feel free to ask me these later!
    • Solar Thermal in Five Minutes ( mongo next, i promise! )Sunday, July 24, 2011
    • Municipal => Roof => Tank => CustomerSunday, July 24, 2011
    • Relevant Data to TrackSunday, July 24, 2011
    • Temperatures (about a dozen)Sunday, July 24, 2011
    • Flow Rates (at least two)Sunday, July 24, 2011
    • Parallel data streams (hopefully many)Sunday, July 24, 2011e.g., weather data, insolation data. It’d be nice if we didn’t have to collect it all ourselves.
    • how much data? 20 data points @ 4 bytes 1 minute intervals at 1000 projects (I wish!) for 10 years 80 * 60 * 24 * 365 * 10 * 1000 = 400 GB? ...not much, really, “in the raw”Sunday, July 24, 2011unfortunately, we can’t really store it with maximal efficiency, because of things liketimestamps, metadata, etc., but still.
    • Sunday, July 24, 2011I hope this provides enough context on the business problems we’re trying to solve. It lookslike we’ll need a data pipeline, and we’ll need one fast.We’ve got data that we’ll need to use to build, monitor, and monetize these energytechnologies. Having worked at other smart grid companies before, I’ve seen some gooddata pipelines and some bad data pipelines. I’d like to build a good one. The less stuff ihave to build, the better.
    • Sunday, July 24, 2011As i do some research, i find that a lot of these data pipelines have a few well-defined areasof responsibility.
    • Acquisition, Storage, Search, Retrieval, Analytics.Sunday, July 24, 2011These should be self explanatory. What’s interesting is that not only are most of the end-users of the system analysts, interested in analyzing, but that most systems seem to bedesigned for the other functionality. More importantly, they’re not very well decoupled: bythe time the analysts get to start building tools, the design decisions from the beginning areinextricable from the systems that came before.
    • Acquisition, Storage, Search, Retrieval, } Designed for these Analytics. <= Users are hereSunday, July 24, 2011These should be self explanatory. What’s interesting is that not only are most of the end-users of the system analysts, interested in analyzing, but that most systems seem to bedesigned for the other functionality. More importantly, they’re not very well decoupled: bythe time the analysts get to start building tools, the design decisions from the beginning areinextricable from the systems that came before.
    • Acquisition, Storage, Search, Retrieval, Analytics.Sunday, July 24, 2011These should be self explanatory. What’s interesting is that not only are most of the end-users of the system analysts, interested in analyzing, but that most systems seem to bedesigned for the other functionality. More importantly, they’re not very well decoupled: bythe time the analysts get to start building tools, the design decisions from the beginning areinextricable from the systems that came before.It’s important to remember that, while you can’t get good analytics without the other stuff,the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
    • Acquisition, Storage, Search, Retrieval, } Designed for these Analytics. <= Users are here Business value is here!Sunday, July 24, 2011These should be self explanatory. What’s interesting is that not only are most of the end-users of the system analysts, interested in analyzing, but that most systems seem to bedesigned for the other functionality. More importantly, they’re not very well decoupled: bythe time the analysts get to start building tools, the design decisions from the beginning areinextricable from the systems that came before.It’s important to remember that, while you can’t get good analytics without the other stuff,the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
    • Sunday, July 24, 2011so, here’s how i started thinking about things. This is a design diagram from the early daysof the company.
    • Sunday, July 24, 2011easy, python, no problem. There are some interesting topics here, but they’re not mongoDBrelated. I was pretty sure i knew how to build this part, and i was pretty sure i knew what thedata would look like.
    • Sunday, July 24, 2011This part was also easy -- e-mail reports, csvs, maybe some fancy graphs, possibly somelight webapps for internal use. These would be dictated by business goals first, but thetechnological questions were straightforward.
    • Sunday, July 24, 2011Here was the real question.What would be some use cases of an analyst having a good experience look like? What wouldthey expect the tools to do?
    • Now we can think about what the data looks likeSunday, July 24, 2011So, let’s think about what this data looks like, how it’s structured and what it is. Then, afterthat, we can look at what the best ways to organize it for future usefulness.
    • Time series?Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array returntemperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle CountTue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614Sunday, July 24, 2011
    • TIME SERIES DATASunday, July 24, 2011So what is time series data?
    • Features, Over TimeSunday, July 24, 2011multi-dimensional features. What’s fun in a business like this is that we’re not really surewhat the features we study will be. -- Flexibility callout
    • Features, Over Time Thing (Feature vector, v) Time (t)Sunday, July 24, 2011multi-dimensional features. What’s fun in a business like this is that we’re not really surewhat the features we study will be. -- Flexibility callout
    • Features, Over Time Thing (Feature vector, v) Time (t)Sunday, July 24, 2011multi-dimensional features. What’s fun in a business like this is that we’re not really surewhat the features we study will be. -- Flexibility callout
    • Sunday, July 24, 2011A couple of ideas:sampling rates. “regularity”. “completeness”analog vs. digitalinstantaneous vs. cumulative (tradeoffs)
    • tn tn+1Sunday, July 24, 2011Finding known interesting ranges (definitely the most common)
    • tn tn+1Sunday, July 24, 2011Finding known interesting ranges (definitely the most common)
    • t t’ etc.Sunday, July 24, 2011Using features to find interesting ranges.These two ways to look for things should inform our design decisions.
    • y t t’ etc.Sunday, July 24, 2011Using features to find interesting ranges.These two ways to look for things should inform our design decisions.
    • y Thresholds y’ t t’ etc.Sunday, July 24, 2011Using features to find interesting ranges.These two ways to look for things should inform our design decisions.
    • y Thresholds y’ t t’ etc.Sunday, July 24, 2011Using features to find interesting ranges.These two ways to look for things should inform our design decisions.
    • (more complicated stuff can be thought of as transformations...)Sunday, July 24, 2011e.g., frequency analysis, wavelets, whatever.
    • Sunday, July 24, 2011At this point, I go off and do a bunch of research on existing technologies. I really hatereinventing the wheel, and we really don’t have the manpower.
    • Time series specific tools Scientific tools & libraries Traditional data-warehousing approachesSunday, July 24, 2011So, these were some of the options i looked at. I want to quickly point out why i eliminatedthe first two classes of tools.
    • Time series specific tools RRDtool -- Round Robin DatabaseSunday, July 24, 2011There’s really surprisingly few of these. One of the best is the RRDtool. It’s pretty sweet, andi highly recommend it. Unfortunately, it’s really designed for applications that are highlyregular, and that are already pretty digital, for instance, sampling latencies, or temperaturesin a datacenter. It’s not really good for unreliable sensors, nor is it really designed for longterm persistance. It also has a really high lock-in, with legacy data formats, etc. Don’t getme wrong, it’s totally rad, but i didn’t think it was for us.
    • Scientific tools & libraries e.g., PyTablesSunday, July 24, 2011Pretty cool, but not many of these were mature & ready for primetime. Some that were, likePyTables, didn’t really match our business use-case.
    • Traditional data-warehousing approachesSunday, July 24, 2011So, these were some of the options i looked at. I want to quickly point out why i eliminatedthe first two classes of tools. [...]. That leaves us with the traditional approaches. Thisrepresents a pretty well established field, but very few of the tools are free, lightweight, andmature.
    • Enterprise buzzwords (Just google for OLAP)Sunday, July 24, 2011But the biggest idea i learned is that most data warehousing revolves around the idea of a“fact table”. They call it a “multidimensional OLAP cube”, but basically it exists as a totallydenormalized SQL table.
    • “Measures” and their “Dimensions”Sunday, July 24, 2011(or facts)
    • pretty neat!Sunday, July 24, 2011
    • “how elegant!”Sunday, July 24, 2011
    • in practice...Sunday, July 24, 2011
    • Sunday, July 24, 2011
    • (from “How to Build OLAP Application Using Mondrian + XMLA + SpagoBI”)Sunday, July 24, 2011to which the only acceptable response is:
    • Sunday, July 24, 2011ha! Yeah right.
    • Time series are not relational!Sunday, July 24, 2011even extracted features are not inherently relational!Also: you don’t know what you’re looking for, you don’t know when you’ll find it, you won’tknow when you’ll have to start looking for something different.Why would you lock yourself into a schema?
    • We don’t know what we’ll want to know.Sunday, July 24, 2011We won’t know what we want to know. Not only are we warehousing time-series ofmultidimensional feature vectors, we don’t even know the dimensions we’ll be interested inyet!
    • natural fit for documentsSunday, July 24, 2011This makes a schema-less database a natural fit for these sorts of things. Think about all thealter-table calls i’ve avoided...
    • "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6Sunday, July 24, 2011isn’t this better?
    • "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, “measures” "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, “dimensions” "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, ...right? "year" : 2010, "day" : 6Sunday, July 24, 2011measures & dimensions. This would be a nice, clean division, except that it isn’t. Frequentlywe’ll look for measures by other measures -- i.e., each measure serves as a dimension.
    • ...actually, not a good model.Sunday, July 24, 2011The line gets pretty blurry, in practice. Multi-dimensional vectors mean every measureprovides another dimension.Anyway!
    • "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6Sunday, July 24, 2011How do we build these quickly & efficiently?
    • the goal: good numbers!Sunday, July 24, 2011Remember, the goal here is to make it easy for analysts to get comparable numbers, so wheni ask for the delivered energy for one system, compared to the delivered energy fromanother, i can just get the time-series data, without having to worry about if sensorschanged, when the network was out, when a logger was replaced with another one, etc.
    • Sunday, July 24, 2011So, the OLTP layer serving as our inputs essentially serves up timestamped data as CSVseries. It doesn’t really provide a lot of intelligence, and is basically the raw numbers
    • from rows to columnsSunday, July 24, 2011So, most of what our pipeline does is turn things from rows to columns, in a flexible, usefulway. I’m gonna walk through that process, quickly.
    • "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { Let’s just look at one "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6Sunday, July 24, 2011
    • row-major dataTime,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array returntemperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle CountTue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614Sunday, July 24, 2011
    • “Functional” class Mass(BasicMeasure): def __init__(self, density, volume): ... self._result_func = functools.partial( lambda data, density, volume: density * volume(data) density=density, volume=volume) def __call__(self, data): return self._result_func(data)Sunday, July 24, 2011quasi-functional classes that describe how to calculate a value from data.
    • "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, A formula: E = ∆t × F #pseudocode class LoopEnergy(BasicMeasure): def __init__(self, heat_cap, delta, mass): ... def result_func(data): return self.delta(data) * self.mass(data) * self.heat_cap self._result_func = result_func def __call__(self, data): return self._result_func(data)Sunday, July 24, 2011
    • Creating a Cube For each install, for each chunk of data: apply all known formulas to get values make some convenience keys (e.g., day_of_year) stuff it in mongo Then, map/reduce to whatever dimensionalities you’re interested in: e.g., downsampling.Sunday, July 24, 2011Here’s some pseudocode for how to make a cube of multidimensional data.So, what’s the payoff?
    • How much water did [x] use, monthly? > db.facts_monthly.find({"install.name": [foo]}, {"measures.Gallons Sold": 1}).sort({“_id”: 1})Sunday, July 24, 2011Complicated analytical queries can be boiled down to nearly single line mongo-queries.Here’s some examples:
    • What were our highest production days? > db.facts_daily.find({}, {“measures.Energy Sold”: 1}).sort({_measures.Energy Sold”: -1})Sunday, July 24, 2011Complicated analytical queries can be boiled down to nearly single line mongo-queries.Here’s some examples:
    • How does the distribution of [x] on the weekend compare to its distribution on the weekdays? > weekends = db.facts_daily.find({"day_of_week": {$in: [5,6]}}) > weekdays = db.facts_daily.find({"day_of_week": {$nin: [5,6]}}) > do stuffSunday, July 24, 2011Complicated analytical queries can be boiled down to nearly single line mongo-queries.Here’s some examples:
    • What’s the production of installs north of a certain latitude, with a certain class of panel, on Tuesdays? For hours where the average delivered temperature delta was above [x], what was our generation efficiency? Normalize by number of panels? (map/reduce) Normalize by distance from equinox? (map/reduce) ...etc.Sunday, July 24, 2011
    • • Building a cube can be done in parallel • Map/reduce is an easy way to think about transforms. • Not maximally efficient, but parallelizes on commodity hardware.Sunday, July 24, 2011Some advantages.re #3 -- so what? It’s not a webapp.
    • mongoDB: The future of enterprise business intelligence. (they just don’t know it yet)Sunday, July 24, 2011So, here’s my thesis:document-databases are far superior to relational databases for business intelligence cases.Not only that, but mongoDB and some common sense lets you replace multimillion dollarIBM-level enterprise solutions with open-source awesomeness. All this in a rapid, agile way.
    • Lastly...Sunday, July 24, 2011
    • Mongo expands in an organization.Sunday, July 24, 2011it’s cool, don’t fight it. Once we started using it for our analytics, we realized there was a lotof other schema-loose data that we could use it for -- like the definitions of the measuresthemselves, or the details about an install, etc., etc.
    • Final ThoughtsSunday, July 24, 2011Ok, i want to close up with a few jumping-off points.
    • “Business Intelligence” no longer requires megabucksSunday, July 24, 2011
    • Flexible tools means business responsiveness should be easySunday, July 24, 2011
    • “Scaling” doesn’t just mean depth-first.Sunday, July 24, 2011businesses grow deep, in the sense of adding more users, but they also grow broad.
    • Questions?Sunday, July 24, 2011
    • Epilogue Quest for Logging HardwareSunday, July 24, 2011
    • This’ll be easy! This is such an obvious and well explored problem space, i’m sure we’ll be able to find a solution that matches our needs without breaking the bank!Sunday, July 24, 2011
    • Shopping List! 16 temperature sensors 4 flow sensors maybe some miscellaneous ones internet backhaul no software/data lock in.Sunday, July 24, 2011
    • Conventions FTW! And since we’ve walked a couple convention floors and product catalogs from major industrial supply vendors, i’m sure it’s in here somewhere!Sunday, July 24, 2011
    • derp derp “internet”? I’m sure there’s a reason why all of these loggers have to connect via USB... Pace Scientific XR5: 8 analog 3 pulse ONE MB no internet? $500?!?Sunday, July 24, 2011
    • yay windows? ...and require proprietary (windows!) software or subscription plans that route my data through their servers (basically all of them!)Sunday, July 24, 2011
    • Maybe the gov’t can help! Perhaps there’s some kind of standard that the governments require for solar thermal monitoring systems to be eligible for incentives or tax credits.Sunday, July 24, 2011
    • Vive la France! An obscure standard by the Organisation Internationale de Métrologie Légale appears! Neat!Sunday, July 24, 2011
    • A “Certified” Logger two temperature sensors one pulse no increase in accuracy no data backhaul -- at all ... what’s the price?Sunday, July 24, 2011
    • $1,000Sunday, July 24, 2011
    • $1,000Sunday, July 24, 2011
    • Hmm... I can solder, and arduinos are pretty cheapSunday, July 24, 2011
    • It’s on!Sunday, July 24, 2011
    • arduino + netbook!Sunday, July 24, 2011
    • TL; DR: Existing loggers are terrible.Sunday, July 24, 2011Also, existing industries aren’t really ready for rapid prototyping and its destructive effects.
    • • http://www.flickr.com/photos/rknight/4358119571/ • http://4.bp.blogspot.com/_8vNzwxlohg0/ TJoUWqsF4LI/AAAAAAAABMg/QaUiKwCEZn8/ s320/turtles-all-the-way-down.jpg • http://www.flickr.com/photos/rhk313/3801302914/ • http://www.flickr.com/photos/benny_lin/481411728/ • http://spagobi.blogspot.com/ 2010_08_01_archive.html • http://community.qlikview.com/forums/t/37106.aspxSunday, July 24, 2011