Data Con LA 2020
Description
Amid the recent COVID-19 pandemic, we are curious to explore and visualize its impact on global airline traffic such as flight frequencies, volume and schedules. We are also interested in utilizing non-flight-related datasets in this analysis from organizations such as the John Hopkins University Center for Systems Science. OnPrem Solution Partners has built a fully operationalized data ingestion, transformation, and analysis platform using modern tools such as Snowflake, dbt, and Streamsets Data Collector to capture and analyze flight data. Our primary data source consists of minute-by-minute flight data from flights globally. We have been collecting and storing this state data since November 2018 and now have billions of records accumulated for analysis. In addition, we load auxiliary data into Snowflake to form a comprehensive analysis and reporting platform. We have explored the following use cases:
*How the overall number of flights changes as the virus spreads?
*How do the number of flights change for a specific region as the number of positive tests/deaths change?
*How does the number of flights to/from countries with the highest number of cases (hot zones) change?
Speaker
Yasha Mouradi, OnPrem Solution Partners, Data & Analytics Manager
6. onprem.com
6
Other positive developments presented in the outlook
include increased connectivity, with airlines
connecting more cities than ever and at a lower
cost. IATA forecasts that this will rise above 23,000 for
the first time in 2020.
Emerging markets are experiencing the strongest
growth in air travel, with China leading the way at
8.5% year-on-year from October 2018 to 2019.
12. onprem.com
12
Technical
○ Scalability of data ingest and processing from 10s to 100s of billions of rows
○ Maintaining known good state of system - version control or otherwise
○ Automated testing of data
○ Economical processing/run costs
○ Possibility of near-real time ingest if needed in the future
○ Avoid development of platforms and tooling where possible
○ Modular in stack where possible
Staffing/Organization
○ Focus on SQL - easier to staff and easier to train
○ User expectation of rapid changes/updates - days/weeks not months/years
15. onprem.com
15
- scalable, flexible technologies
- minimal infrastructure upkeep
- abstracted storage and compute
- ingest data as raw as possible
- only perform required data type changes
16. onprem.com
16
- scalable, flexible technologies
- minimal infrastructure upkeep
- abstracted storage and compute
- ingest data as raw as possible
- only perform required data type changes
- modern software development best-practices
- version controlled data processing scripts
- data pipelines testable and integrated into CI/CD frameworks
- data outputs monitored for data quality and errors
17. onprem.com
17
- modern software development best-practices
- version controlled data processing scripts
- data pipelines testable and integrated into CI/CD frameworks
- data outputs monitored for data quality and errors
- scalable, flexible technologies
- minimal infrastructure upkeep
- abstracted storage and compute
- ingest data as raw as possible
- only perform required data type changes
- tech stack usable by a variety of consumers
- data models are easy to join and query
- collapse the division between
"thinkers/planners" and "doers"
25. onprem.com
25
● Filter/cleanse/deduplicate raw OpenSky data (~50% reduction)
● Group sets of flight points down into a set of flight legs
● For each flight leg, locate the start/end airports (1 degree box)
● Group sets of flight legs into flights (flights with layovers)
A
B C
1x1 box
B
A
Airport
Flight Point
Flight Leg
Flight
31. onprem.com
31
dbt is an open source tool for authoring, testing, and orchestrating data engineering pipelines.
Key strengths:
- Data transformations are written using modular, performant, maintainable SQL
- Natively integrates with version control
- Integrated unit testing framework
- Uses the Jinja templating language to take advantage of sophisticated macroing capabilities
- Continuous Integration / Continuous Deployment (CI/CD)
- Metadata catalog / automated documentation
- Native logging / alerting capabilities
- dbt Core - open source / no licensing costs
74. onprem.com
74
● General
○ Any information about the airline beyond the IATA 3-character code
○ Any information about airports (location, size, status, etc.)
○ Any information about individual airplanes beyond a six-digit Hex code identifying the transponder (which has data quality issues too)
○ "Clean" data -- We get negative velocities, negative altitudes, garbled callsigns, missing timestamps, every single field can have a
problem
● Points vs Legs/Flights
○ Any simple notion of how to group points into a "leg”
○ Any points over the open ocean (transponder network is land-based, not satellite)
■ This makes it harder to group points into a “leg” – we can’t assume that points for a single leg will be close time-wise to each
other
○ Points are much more reliable and consistent in some areas of the world than others. Continental US very good. Mexico and China,
for example, are very sparse.
○ Sometimes points over land in US are missing for certain areas too (military/FCC/other reasons? Area 51?)
● Legs/Flights
○ Any notion of how multiple legs for a multi-leg flight with a single callsign relate to each other
○ Any simple way to separate commercial, cargo, private, charter, and military flights from each other
○ Takeoff or landing airport for a given flight (even after we can group points together to a leg)
○ Any notion of a flight schedule for the future – having a list of legs/flights at particular times in the past is not simple to extrapolate into
the future