Solution Architect
Jay Runkel
@jayrunkel
Time Series Data:
Aggregations in Action
Agenda
• Review Traffic Use Case
• Review Schema Design
• Document Retention Model
• Aggregation Queries
• Map Reduce
• Ha...
Use Case Review
We need to prepare for this
Develop Nationwide traffic monitoring
system
Traffic sensors to monitor interstate
conditions
• 16,000 sensors
• Measure at one minute intervals
• Speed
• Travel time
...
What we want from our data
Charting and Trending
What we want from our data
Historical & Predictive Analysis
What we want from our data
Real Time Traffic Dashboard
Review Schema Design
Document Structure
{ _id: ObjectId("5382ccdd58db8b81730344e2"),
linkId: 900006,
date: ISODate("2014-03-12T17:00:00Z"),
dat...
Sample Document Structure
Compound, unique
Index identifies the
Individual document
{ _id: ObjectId("5382ccdd58db8b8173034...
Sample Document Structure
Saves an extra index
{ _id: “900006:14031217”,
data: [
{ speed: NaN, time: NaN },
{ speed: NaN, ...
{ _id: “900006:14031217”,
data: [
{ speed: NaN, time: NaN },
{ speed: NaN, time: NaN },
{ speed: NaN, time: NaN },
...
],
...
{ _id: “900006:14031217”,
data: [
{ speed: NaN, time: NaN },
{ speed: NaN, time: NaN },
{ speed: NaN, time: NaN },
...
],
...
Advantages
1. In place updates  efficient
2. Dashboards  simple queries
Dashboards
0
10
20
30
40
50
60
70
MonMar10201404:57:00…
MonMar10201405:28:00…
MonMar10201405:59:00…
MonMar10201406:30:00…
...
Supporting Queries From
Navigation Systems
Navigation System Queries
What is the average speed for the last 10 minutes on
50 upcoming road segments?
Current Real-Time Conditions
Last ten minutes of speeds and
times
{ _id : “I-87:10656”,
description : "NYS Thruway Harrima...
{ _id : “I-87:10656”,
description : "NYS Thruway Harriman Section Exits 14A - 16",
update : ISODate(“2013-10-10T23:06:37.0...
{ _id : “I-87:10656”,
description : "NYS Thruway Harriman Section Exits 14A - 16",
update : ISODate(“2013-10-10T23:06:37.0...
db.linksAvg.update(
{"_id" : linkId},
{ "$set" : {"lUpdate" : date},
"$push" : {
"times" : { "$each" : [ time ], "$slice" ...
Document Retention
Document retention
Doc per hour
Doc per day
2 weeks
2 months
1year
Doc per Month
Rollup – 1 day
// daily document
// retained for 2 months
{
_id: "link:date",
// 24 element array
hourly: [
{ speed: { sum...
Analysis With The Aggregation
Framework
Pipelining operations
grep | sort | uniq
Piping command line operations
Pipelining operations
$match $group | $sort|
Piping aggregation operations
Stream of documents Result document
What is the average speed for a
given road segment?
> db.linkData.aggregate(
{ $match: { ”_id" : /^20484097:/ } },
{ $proj...
What is the average speed for a
given road segment?
Select documents on the target segment
> db.linkData.aggregate(
{ $mat...
What is the average speed for a
given road segment?
Keep only the fields we really need
> db.linkData.aggregate(
{ $match:...
What is the average speed for a
given road segment?
Loop over the array of data points
> db.linkData.aggregate(
{ $match: ...
What is the average speed for a
given road segment?
Use the handy $avg operator
> db.linkData.aggregate(
{ $match: { ”_id"...
More Sophisticated Pipelines:
average speed with variance
{ "$project" : {
mean: "$meanSpd",
spdDiffSqrd : {
"$map" : {
"i...
Analysis With MapReduce
Historic Analysis
How does weather and road conditions affect
traffic?
The Ask: what are the average speeds per
weather, s...
MapReduce
function map() {
for( var i = 0; i < this.data.length; i++ ) {
emit (
this.conditions.weather,
{ speed : this.da...
MapReduce
function map() {
for( var i = 0; i < this.data.length; i++ ) {
emit (
this.conditions.weather,
{ speed : this.da...
MapReduce
function map() {
for( var i = 0; i < this.data.length; i++ ) {
emit (
this.conditions.weather,
{ speed : this.da...
MapReduce
function map() {
for( var i = 0; i < this.data.length; i++ ) {
emit (
this.conditions.weather,
{ speed : this.da...
MapReduce
MapReduce
Weather: “Rain”, speed: 44
MapReduce
Weather: “Rain”, speed: 39
MapReduce
Weather: “Rain”, speed: 46
MapReduce
function reduce ( key, values ) {
var result = { count : 1, speedSum : 0 };
values.forEach( function( v ){
resul...
MapReduce
function reduce ( key, values ) {
var result = { count : 1, speedSum : 0 };
values.forEach( function( v ){
resul...
Results
results: [
{
"_id" : "Generally Clear and Dry Conditions",
"value" : {
"count" : 902,
"speedSum" : 45100
}
},
{
"_...
Analysis With Hadoop
(using the MongoDB
Connector)
Processing Large Data Sets
• Need to break data into smaller pieces
• Process data across multiple nodes
Hadoop Hadoop Had...
Benefits of the Hadoop Connector
• Increased parallelism
• Access to analytics libraries
• Separation of concerns
• Integr...
MongoDB Hadoop Connector
• Multi-source analytics
• Interactive & Batch
• Data lake
• Online, Real-time
• High concurrency...
Questions?
@jayrunkel
jay.runkel@mongodb.com
Part 3 - July 16th, 2:00 PM EST
Sign up for our “Path to Proof” Program
and get expert advice on implementation,
architecture, and configuration.
www.mong...
HVDF:
https://github.com/10gen-labs/hvdf
Hadoop Connector:
https://github.com/mongodb/mongo-hadoop
Consulting Engineer, MongoDB Inc.
Bryan Reinero
#ConferenceHashtag
Thank You
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregation Framework and Hadoop
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregation Framework and Hadoop
Upcoming SlideShare
Loading in …5
×

MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregation Framework and Hadoop

3,384 views

Published on

Published in: Technology, Business
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,384
On SlideShare
0
From Embeds
0
Number of Embeds
1,042
Actions
Shares
0
Downloads
120
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide
  • Compound unique index on linkId & Interval
    update field used to identify new documents for aggregation
  • Compound unique index on linkId & Interval
    update field used to identify new documents for aggregation
  • Compound unique index on linkId & Interval
    update field used to identify new documents for aggregation
  • Compound unique index on linkId & Interval
    update field used to identify new documents for aggregation
  • Compound unique index on linkId & Interval
    update field used to identify new documents for aggregation
  • Priority
    Floating point number between 0..1000
    Highest member that is up to date wins
    Up to date == within 10 seconds of primary
    If a higher priority member catches up, it will force election and win

    Slave Delay
    Lags behind master by configurable time delay
    Automatically hidden from clients
    Protects against operator errors
    Fat fingering
    Application corrupts data
  • Priority
    Floating point number between 0..1000
    Highest member that is up to date wins
    Up to date == within 10 seconds of primary
    If a higher priority member catches up, it will force election and win

    Slave Delay
    Lags behind master by configurable time delay
    Automatically hidden from clients
    Protects against operator errors
    Fat fingering
    Application corrupts data
  • Priority
    Floating point number between 0..1000
    Highest member that is up to date wins
    Up to date == within 10 seconds of primary
    If a higher priority member catches up, it will force election and win

    Slave Delay
    Lags behind master by configurable time delay
    Automatically hidden from clients
    Protects against operator errors
    Fat fingering
    Application corrupts data
  • Priority
    Floating point number between 0..1000
    Highest member that is up to date wins
    Up to date == within 10 seconds of primary
    If a higher priority member catches up, it will force election and win

    Slave Delay
    Lags behind master by configurable time delay
    Automatically hidden from clients
    Protects against operator errors
    Fat fingering
    Application corrupts data
  • Priority
    Floating point number between 0..1000
    Highest member that is up to date wins
    Up to date == within 10 seconds of primary
    If a higher priority member catches up, it will force election and win

    Slave Delay
    Lags behind master by configurable time delay
    Automatically hidden from clients
    Protects against operator errors
    Fat fingering
    Application corrupts data
  • Compound unique index on linkId & Interval
    update field used to identify new documents for aggregation
  • Priority
    Floating point number between 0..1000
    Highest member that is up to date wins
    Up to date == within 10 seconds of primary
    If a higher priority member catches up, it will force election and win

    Slave Delay
    Lags behind master by configurable time delay
    Automatically hidden from clients
    Protects against operator errors
    Fat fingering
    Application corrupts data

  • Priority
    Floating point number between 0..1000
    Highest member that is up to date wins
    Up to date == within 10 seconds of primary
    If a higher priority member catches up, it will force election and win

    Slave Delay
    Lags behind master by configurable time delay
    Automatically hidden from clients
    Protects against operator errors
    Fat fingering
    Application corrupts data

  • Priority
    Floating point number between 0..1000
    Highest member that is up to date wins
    Up to date == within 10 seconds of primary
    If a higher priority member catches up, it will force election and win

    Slave Delay
    Lags behind master by configurable time delay
    Automatically hidden from clients
    Protects against operator errors
    Fat fingering
    Application corrupts data

  • Priority
    Floating point number between 0..1000
    Highest member that is up to date wins
    Up to date == within 10 seconds of primary
    If a higher priority member catches up, it will force election and win

    Slave Delay
    Lags behind master by configurable time delay
    Automatically hidden from clients
    Protects against operator errors
    Fat fingering
    Application corrupts data

  • Priority
    Floating point number between 0..1000
    Highest member that is up to date wins
    Up to date == within 10 seconds of primary
    If a higher priority member catches up, it will force election and win

    Slave Delay
    Lags behind master by configurable time delay
    Automatically hidden from clients
    Protects against operator errors
    Fat fingering
    Application corrupts data

  • Priority
    Floating point number between 0..1000
    Highest member that is up to date wins
    Up to date == within 10 seconds of primary
    If a higher priority member catches up, it will force election and win

    Slave Delay
    Lags behind master by configurable time delay
    Automatically hidden from clients
    Protects against operator errors
    Fat fingering
    Application corrupts data


  • Makes MongoDB a Hadoop-enabled file system
    Read and write to live data, in-place
    Copy data between Hadoop and MongoDB
    Uses MongoDB indexes to filter data
    Full support for data processing
    Hive
    MapReduce
    Pig
    Streaming
  • MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregation Framework and Hadoop

    1. 1. Solution Architect Jay Runkel @jayrunkel Time Series Data: Aggregations in Action
    2. 2. Agenda • Review Traffic Use Case • Review Schema Design • Document Retention Model • Aggregation Queries • Map Reduce • Hadoop
    3. 3. Use Case Review
    4. 4. We need to prepare for this
    5. 5. Develop Nationwide traffic monitoring system
    6. 6. Traffic sensors to monitor interstate conditions • 16,000 sensors • Measure at one minute intervals • Speed • Travel time • Weather, pavement, and traffic conditions • Support desktop, mobile, and car navigation systems
    7. 7. What we want from our data Charting and Trending
    8. 8. What we want from our data Historical & Predictive Analysis
    9. 9. What we want from our data Real Time Traffic Dashboard
    10. 10. Review Schema Design
    11. 11. Document Structure { _id: ObjectId("5382ccdd58db8b81730344e2"), linkId: 900006, date: ISODate("2014-03-12T17:00:00Z"), data: [ { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, ... ], conditions: { status: ”Snow / Ice Conditions", pavement: ”Ice Spots", weather: ”Light Snow" } }
    12. 12. Sample Document Structure Compound, unique Index identifies the Individual document { _id: ObjectId("5382ccdd58db8b81730344e2"), linkId: 900006, date: ISODate("2014-03-12T17:00:00Z"), data: [ { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, ... ], conditions: { status: ”Snow / Ice Conditions", pavement: ”Icy Spots", weather: ”Light Snow" } }
    13. 13. Sample Document Structure Saves an extra index { _id: “900006:14031217”, data: [ { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, ... ], conditions: { status: ”Snow / Ice Conditions", pavement: ”Icy Spots", weather: ”Light Snow" } }
    14. 14. { _id: “900006:14031217”, data: [ { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, ... ], conditions: { status: ”Snow / Ice Conditions", pavement: ”Icy Spots", weather: ”Light Snow" } } Sample Document Structure Range queries: /^900006:1403/ Regex must be left-anchored & case-sensitive
    15. 15. { _id: “900006:14031217”, data: [ { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, ... ], conditions: { status: ”Snow / Ice Conditions", pavement: ”Icy Spots", weather: ”Light Snow" } } Sample Document Structure Pre-allocated, 60 element array of per-minute data
    16. 16. Advantages 1. In place updates  efficient 2. Dashboards  simple queries
    17. 17. Dashboards 0 10 20 30 40 50 60 70 MonMar10201404:57:00… MonMar10201405:28:00… MonMar10201405:59:00… MonMar10201406:30:00… MonMar10201407:01:00… MonMar10201407:32:00… MonMar10201408:03:00… MonMar10201408:34:00… MonMar10201409:05:00… MonMar10201409:36:00… MonMar10201410:07:00… MonMar10201410:38:00… MonMar10201411:52:00… TueMar11201402:35:00… TueMar11201403:05:00… TueMar11201403:36:00… TueMar11201404:23:00… TueMar11201404:54:00… TueMar11201405:25:00… TueMar11201405:56:00… TueMar11201406:27:00… TueMar11201406:58:00… TueMar11201407:29:00… TueMar11201408:00:00… TueMar11201408:31:00… TueMar11201409:05:00… TueMar11201410:32:00… TueMar11201411:03:00… TueMar11201411:34:00… TueMar11201412:05:00… TueMar11201412:39:00… TueMar11201413:10:00… TueMar11201413:41:00… TueMar11201414:15:00… TueMar11201415:54:00… WedMar12201401:39:00… WedMar12201402:10:00… WedMar12201402:41:00… WedMar12201403:12:00… WedMar12201404:35:00… WedMar12201406:58:00… WedMar12201408:36:00… WedMar12201409:07:00… WedMar12201410:15:00… WedMar12201410:46:00… db.linkData.find({_id : /^20484087:2014031/})
    18. 18. Supporting Queries From Navigation Systems
    19. 19. Navigation System Queries What is the average speed for the last 10 minutes on 50 upcoming road segments?
    20. 20. Current Real-Time Conditions Last ten minutes of speeds and times { _id : “I-87:10656”, description : "NYS Thruway Harriman Section Exits 14A - 16", update : ISODate(“2013-10-10T23:06:37.000Z”), speeds : [ 52, 49, 45, 51, ... ], times : [ 237, 224, 246, 233,... ], pavement: "Wet Spots", status: "Wet Conditions", weather: "Light Rain”, averageSpeed: 50.23, averageTime: 234, maxSafeSpeed: 53.1, location" : { "type" : "LineString", "coordinates" : [ [ -74.056, 41.098 ], [ -74.077, 41.104 ] } }
    21. 21. { _id : “I-87:10656”, description : "NYS Thruway Harriman Section Exits 14A - 16", update : ISODate(“2013-10-10T23:06:37.000Z”), speeds : [ 52, 49, 45, 51, ... ], times : [ 237, 224, 246, 233,... ], pavement: "Wet Spots", status: "Wet Conditions", weather: "Light Rain”, averageSpeed: 50.23, averageTime: 234, maxSafeSpeed: 53.1, location" : { "type" : "LineString", "coordinates" : [ [ -74.056, 41.098 ], [ -74.077, 41.104 ] } } Current Real-Time Conditions Pre-aggregated metrics
    22. 22. { _id : “I-87:10656”, description : "NYS Thruway Harriman Section Exits 14A - 16", update : ISODate(“2013-10-10T23:06:37.000Z”), speeds : [ 52, 49, 45, 51, ... ], times : [ 237, 224, 246, 233,... ], pavement: "Wet Spots", status: "Wet Conditions", weather: "Light Rain”, averageSpeed: 50.23, averageTime: 234, maxSafeSpeed: 53.1, location" : { "type" : "LineString", "coordinates" : [ [ -74.056, 41.098 ], [ -74.077, 41.104 ] } } Current Real-Time Conditions Geo-spatially indexed road segment
    23. 23. db.linksAvg.update( {"_id" : linkId}, { "$set" : {"lUpdate" : date}, "$push" : { "times" : { "$each" : [ time ], "$slice" : -10 }, "speeds" : {"$each" : [ speed ], "$slice" : -10} } }) Maintaining the current conditions Each update pops the last element off the array and pushes the new value
    24. 24. Document Retention
    25. 25. Document retention Doc per hour Doc per day 2 weeks 2 months 1year Doc per Month
    26. 26. Rollup – 1 day // daily document // retained for 2 months { _id: "link:date", // 24 element array hourly: [ { speed: { sum: , count: }, time: { sum: , count: } }, { speed: { sum: , count: }, time: { sum: , count: } } ] }
    27. 27. Analysis With The Aggregation Framework
    28. 28. Pipelining operations grep | sort | uniq Piping command line operations
    29. 29. Pipelining operations $match $group | $sort| Piping aggregation operations Stream of documents Result document
    30. 30. What is the average speed for a given road segment? > db.linkData.aggregate( { $match: { ”_id" : /^20484097:/ } }, { $project: { "data.speed": 1, linkId: 1 } } , { $unwind: "$data"}, { $group: { _id: "$linkId", ave: { $avg: "$data.speed"} } } ); { "_id" : 20484097, "ave" : 47.067650676506766 }
    31. 31. What is the average speed for a given road segment? Select documents on the target segment > db.linkData.aggregate( { $match: { ”_id" : /^20484097:/ } }, { $project: { "data.speed": 1, linkId: 1 } } , { $unwind: "$data"}, { $group: { _id: "$linkId", ave: { $avg: "$data.speed"} } } ); { "_id" : 20484097, "ave" : 47.067650676506766 }
    32. 32. What is the average speed for a given road segment? Keep only the fields we really need > db.linkData.aggregate( { $match: { ”_id" : /^20484097:/ } }, { $project: { "data.speed": 1, _id: 1 } } , { $unwind: "$data"}, { $group: { _id: "$_id", ave: { $avg: "$data.speed"} } } ); { "_id" : 20484097, "ave" : 47.067650676506766 }
    33. 33. What is the average speed for a given road segment? Loop over the array of data points > db.linkData.aggregate( { $match: { ”_id" : /^20484097:/ } }, { $project: { "data.speed": 1, _id: 1 } } , { $unwind: "$data"}, { $group: { _id: "$_id", ave: { $avg: "$data.speed"} } } ); { "_id" : 20484097, "ave" : 47.067650676506766 }
    34. 34. What is the average speed for a given road segment? Use the handy $avg operator > db.linkData.aggregate( { $match: { ”_id" : /^20484097:/ } }, { $project: { "data.speed": 1, “_id”: 1 } } , { $unwind: "$data"}, { $group: { _id: "$_id", ave: { $avg: "$data.speed"} } } ); { "_id" : 20484097, "ave" : 47.067650676506766 }
    35. 35. More Sophisticated Pipelines: average speed with variance { "$project" : { mean: "$meanSpd", spdDiffSqrd : { "$map" : { "input": { "$map" : { "input" : "$speeds", "as" : "samp", "in" : { "$subtract" : [ "$$samp", "$meanSpd" ] } } }, as: "df", in: { $multiply: [ "$$df", "$$df" ] } } } } }, { $unwind: "$spdDiffSqrd" }, { $group: { _id: mean: "$mean", variance: { $avg: "$spdDiffSqrd" } } }
    36. 36. Analysis With MapReduce
    37. 37. Historic Analysis How does weather and road conditions affect traffic? The Ask: what are the average speeds per weather, status and pavement
    38. 38. MapReduce function map() { for( var i = 0; i < this.data.length; i++ ) { emit ( this.conditions.weather, { speed : this.data[i].speed } ); emit ( this.conditions.status, { speed : this.data[i].speed } ); emit ( this.conditions.pavement, { speed : this.data[i].speed } ); } }
    39. 39. MapReduce function map() { for( var i = 0; i < this.data.length; i++ ) { emit ( this.conditions.weather, { speed : this.data[i].speed } ); emit ( this.conditions.status, { speed : this.data[i].speed } ); emit ( this.conditions.pavement, { speed : this.data[i].speed } ); } } “Snow”, 34
    40. 40. MapReduce function map() { for( var i = 0; i < this.data.length; i++ ) { emit ( this.conditions.weather, { speed : this.data[i].speed } ); emit ( this.conditions.status, { speed : this.data[i].speed } ); emit ( this.conditions.pavement, { speed : this.data[i].speed } ); } } “Icy spots”, 34
    41. 41. MapReduce function map() { for( var i = 0; i < this.data.length; i++ ) { emit ( this.conditions.weather, { speed : this.data[i].speed } ); emit ( this.conditions.status, { speed : this.data[i].speed } ); emit ( this.conditions.pavement, { speed : this.data[i].speed } ); } } “Delays”, 34
    42. 42. MapReduce
    43. 43. MapReduce Weather: “Rain”, speed: 44
    44. 44. MapReduce Weather: “Rain”, speed: 39
    45. 45. MapReduce Weather: “Rain”, speed: 46
    46. 46. MapReduce function reduce ( key, values ) { var result = { count : 1, speedSum : 0 }; values.forEach( function( v ){ result.speedSum += v.speed; result.count++; }); return result; }
    47. 47. MapReduce function reduce ( key, values ) { var result = { count : 1, speedSum : 0 }; values.forEach( function( v ){ result.speedSum += v.speed; result.count++; }); return result; }
    48. 48. Results results: [ { "_id" : "Generally Clear and Dry Conditions", "value" : { "count" : 902, "speedSum" : 45100 } }, { "_id" : "Icy Spots", "value" : { "count" : 242, "speedSum" : 9438 } }, { "_id" : "Light Snow", "value" : { "count" : 122, "speedSum" : 7686 } }, { "_id" : "No Report", "value" : { "count" : 782, "speedSum" : NaN } }
    49. 49. Analysis With Hadoop (using the MongoDB Connector)
    50. 50. Processing Large Data Sets • Need to break data into smaller pieces • Process data across multiple nodes Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop HadoopHadoop Hadoop
    51. 51. Benefits of the Hadoop Connector • Increased parallelism • Access to analytics libraries • Separation of concerns • Integrates with existing tool chains
    52. 52. MongoDB Hadoop Connector • Multi-source analytics • Interactive & Batch • Data lake • Online, Real-time • High concurrency & HA • Live analytics Operational Post Processingand MongoDB Connector for Hadoop
    53. 53. Questions? @jayrunkel jay.runkel@mongodb.com Part 3 - July 16th, 2:00 PM EST
    54. 54. Sign up for our “Path to Proof” Program and get expert advice on implementation, architecture, and configuration. www.mongodb.com/lp/contact/path-proof-program
    55. 55. HVDF: https://github.com/10gen-labs/hvdf Hadoop Connector: https://github.com/mongodb/mongo-hadoop
    56. 56. Consulting Engineer, MongoDB Inc. Bryan Reinero #ConferenceHashtag Thank You

    ×