More Related Content Similar to MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and Visualization Using Flight Data Similar to MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and Visualization Using Flight Data (20) MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and Visualization Using Flight Data1. Mongo Analytics –
Learn aggregation by example
Exploratory Analytics and
Visualization using Flight Data
www.jsonstudio.com
2. Analyzing Flight Data
• JSON data imported from CSV downloaded from www.transtats.bts.gov
• Sample document for a flight:
{
"_id": { "$oid": "534205f61c479f6149a92709" },
"YEAR": 2013, "QUARTER": 1,
"MONTH": 1,
"DAY_OF_MONTH": 18,
"DAY_OF_WEEK": 5,
"FL_DATE": "2013-01-18”,
"UNIQUE_CARRIER": "DL”,
"AIRLINE_ID": 19790,
"CARRIER": "DL",
"TAIL_NUM": "N325US”,
"FL_NUM": 1497,
"ORIGIN_AIRPORT_ID": 14100,
"ORIGIN_AIRPORT_SEQ_ID": 1410002,
"ORIGIN_CITY_MARKET_ID": 34100,
"ORIGIN": "PHL",
"ORIGIN_CITY_NAME": "Philadelphia, PA",
"ORIGIN_STATE_ABR": "PA”,
"ORIGIN_STATE_FIPS": 42,
"DEST_AIRPORT_ID": 13487,
"DEST_AIRPORT_SEQ_ID": 1348702,
"DEST_CITY_MARKET_ID": 31650,
"DEST": "MSP",
"DEST_CITY_NAME": "Minneapolis, MN",
"DEST_STATE_ABR": "MN",
"DEST_STATE_FIPS": 27,
"DEST_STATE_NM": "Minnesota",
"DEST_WAC": 63,
"CRS_DEP_TIME": 805,
"DEP_TIME": 758,
"DEP_DELAY": -7,
"DEP_DELAY_NEW": 0,
"DEP_DEL15": 0,
"DEP_DELAY_GROUP": -1,
"DEP_TIME_BLK": "0800-0859",
"TAXI_OUT": 24,
"WHEELS_OFF": 822,
"WHEELS_ON": 958,
"TAXI_IN": 4,
"CRS_ARR_TIME": 1015,
"ARR_TIME": 1002,
"ARR_DELAY": -13,
"ARR_DELAY_NEW": 0,
"ARR_DEL15": 0,
"ARR_DELAY_GROUP": -1,
"ARR_TIME_BLK": "1000-1059",
"CANCELLED": 0,
"CANCELLATION_CODE": "",
"DIVERTED": 0,
"CRS_ELAPSED_TIME": 190,
"ACTUAL_ELAPSED_TIME": 184,
"AIR_TIME": 156,
"FLIGHTS": 1,
"DISTANCE": 980,
"DISTANCE_GROUP": 4,
"CARRIER_DELAY": "",
"WEATHER_DELAY": "",
"NAS_DELAY": "",
"SECURITY_DELAY": "",
"LATE_AIRCRAFT_DELAY": "",
"FIRST_DEP_TIME": "",
"TOTAL_ADD_GTIME": "",
"LONGEST_ADD_GTIME": "",
"": ""
}
• We will build aggregation pipelines and visualize data using JSON Studio (www.jsonstudio.com)
• You will fall madly in love with the Aggregation Framework and it’s capabilities
5. Some Carrier Stats {
"$group": {
"_id": {
"CARRIER": "$CARRIER"
},
"_avg_DEP_DELAY": {
"$avg": "$DEP_DELAY"
},
"_avg_ARR_DELAY": {
"$avg": "$ARR_DELAY"
},
"_avg_DISTANCE_GROUP": {
"$avg": "$DISTANCE_GROUP"
},
"_avg_TAXI_IN": {
"$avg": "$TAXI_IN"
},
"_avg_TAXI_OUT": {
"$avg": "$TAXI_OUT"
}
}
}
{
"_id": {
"CARRIER": "9E"
},
"_avg_DEP_DELAY": 8.45451754385965,
"_avg_ARR_DELAY": 3.3237368838726744,
"_avg_DISTANCE_GROUP": 2.2188688815622624,
"_avg_TAXI_IN": 7.082464246424642,
"_avg_TAXI_OUT": 20.558167120639663
}
16. Order Does Matter
http://docs.mongodb.org/manual/core/aggregation-pipeline-optimization/
17. An example for $unwind
Count how many airports each carrier lands in
{
"_id": {
"$oid": "5383623b7bfb8767e2e9ca1f"
},
"iata": "00M",
"airport": "Thigpen ",
"city": "Bay Springs",
"state": "MS",
"country": "USA",
"lat": 31.95376472,
"long": -89.23450472,
"carriers": [
"AA",
"UA",
"DL",
"BA"
]
}
…
[
{
"_id": {
"$oid": "5383623b7bfb8767e2e9ca1f"
},
"iata": "00M",
"airport": "Thigpen ",
"city": "Bay Springs",
"state": "MS",
"country": "USA",
"lat": 31.95376472,
"long": -89.23450472,
"carriers": "AA"
},
{
"_id": {
"$oid": "542217ffc026b858b47a6640"
},
"iata": "00M",
"airport": "Thigpen ",
"city": "Bay Springs",
"state": "MS",
"country": "USA",
"lat": 31.95376472,
"long": -89.23450472,
"carriers": "UA"
}
…
]
[
{
"_id": {
"carriers": "BA"
},
"count": 10
},
{
"_id": {
"carriers": "DL"
},
"count": 10
}
…
]
airports2
$unwind $group
20. Hub airports – try 3
{ $group: { _id: { ORIGIN: "$ORIGIN", CARRIER: "$CARRIER" }, count: { $sum: 1 } } },
{ $project: { airport: "$_id.ORIGIN", carrier: "$_id.CARRIER", "count": 1 } },
{ $match: { "count": { $gte: "$$hub_threshold" } } },
{ $group: {
_id: { airport: "$airport" },
airlines: { $sum: 1 },
flights: { $sum: "$count" },
avg_airline: { $avg: "$count" },
max_airline: { $max: "$count" } } },
{ $project: {
"airlines": 1,
"flights": 1,
"avg_airline": 1,
"max_airline": 1,
"avg_no_max": { $divide: [ { $subtract: [ "$flights", "$max_airline" ] }, "$airlines" ] } } },
{ $sort: { "flights": -1 } }
22. From-to Insensitive
{ $group: { _id: { UNIQUE_CARRIER: "$UNIQUE_CARRIER", ORIGIN: "$ORIGIN",
DEST: "$DEST" }, count: { $sum: 1 } } },
{ $match: { "count": { $gt: "$$count_threshold" } } },
{ $project: { _id_UNIQUE_CARRIER: "$_id.UNIQUE_CARRIER", "count": 1,
rroute: {
$cond: [
{ $lt: [ { $cmp: [ "$_id.ORIGIN", "$_id.DEST" ] }, 0 ] },
{ $concat: [ "$_id.ORIGIN", "$_id.DEST" ] },
{ $concat: [ "$_id.DEST", "$_id.ORIGIN" ] }
] } }
},
{ $group: { _id: { _id_UNIQUE_CARRIER: "$_id_UNIQUE_CARRIER", rroute: "$rroute" },
_sum_count: { $sum: "$count" } } }
24. Using “R” for Advanced Analytics
• Using a MongoDB driver for “R”
• Using the JSON Studio Gateway (including using aggregation output)
install.packages("jSonarR")
library(’jSonarR')
con2 <- sonarR::new.SonarConnection('https://localhost:8443', 'localhost', 'flights', port=47017, username="ron",
pwd=”<pwd>”)
nyc_by_day <- sonarR::sonarAgg(con2, 'delays_by_day', 'NYCFlights',
colClasses=c(X_avg_AirTime='numeric', X_avg_ArrDelay='numeric',X_avg_DepDelay='numeric'))
lm.out = lm(nyc_by_day$X_sum_ArrDelay ~ nyc_by_day$X_sum_AirTime)
MongoDB
26. NYC Flights – Quiz Questions
• Of the three airports, who has the most flights?
– Nyc1
• Who has the most cancellations and highest cancellation ratio?
– Nyc2
• Taxi in/out times?
– Nyc3
• What about delays?
– Nyc4
• How do delays differ by month?
– Nyc5 + nyc5
– (summer vs. winter / bubble size vs. y-axis)
• What about weather delays only? Which months are worse? Are the three airports
equivalent?
– Nyc7 + nyc7
• Where can I fly to if I work for Boeing and am very loyal (and on which aicraft)?
– Nyc8 + map