Successfully reported this slideshow.
Your SlideShare is downloading. ×

Automating Federal Aviation Administration’s (FAA) System Wide Information Management (SWIM) Data Ingestion and Analysis

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 27 Ad

Automating Federal Aviation Administration’s (FAA) System Wide Information Management (SWIM) Data Ingestion and Analysis

Download to read offline

The System Wide Information Management (SWIM) Program is a National Airspace System (NAS)-wide information system that supports Next Generation Air Transportation System (NextGen) goals. SWIM facilitates the data-sharing requirements for NextGen, providing the digital data-sharing backbone of NextGen.

The System Wide Information Management (SWIM) Program is a National Airspace System (NAS)-wide information system that supports Next Generation Air Transportation System (NextGen) goals. SWIM facilitates the data-sharing requirements for NextGen, providing the digital data-sharing backbone of NextGen.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Automating Federal Aviation Administration’s (FAA) System Wide Information Management (SWIM) Data Ingestion and Analysis (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Automating Federal Aviation Administration’s (FAA) System Wide Information Management (SWIM) Data Ingestion and Analysis

  1. 1. Automating Federal Aviation Administration’s (FAA) System Wide Information Management (SWIM) Data Ingestion and Analysis Dr. Mehdi Hashemipour, Data Scientist, Bureau of Transportation Statistics Marcelo Zambrana, Cloud Solutions Architect, Microsoft Sheila Stewart, Solutions Architect, Databricks
  2. 2. Agenda Mehdi Hashemipour, PhD SWIM Overview Marcelo Zambrana Automating Infrastructure Architecture Sheila Stewart SWIM Data Processing
  3. 3. Objectives and Benefits Objectives: Using FAA flight data to build a Commercial Flight Database to validate airline data and support the BTS mandate to measure and report aviation system performance. Potential Benefits: ▪ Enable timely estimates of enplanements and on-time performance ▪ Provide a point of validation for airline-submitted data ▪ Expand BTS’s analytical capabilities and breadth of reporting ▪ Support special aviation studies ▪ Provide source of data to aviation dashboards and other statistical products ▪ Serve as the aviation component of the Transportation Disruption and Disaster System
  4. 4. The System Wide Information Management (SWIM) SWIM service provides a single interface point to multiple data services including airport, flight, aeronautical and weather data. STDD Stream : Access to data from over 200 airports. Data from over 400 individual systems.
  5. 5. Potential BTS Use Cases for SWIM Data • Airport Time Delays • Ground Stop History, Status and Impact • On-time Performance estimate by causes • System Passenger Loading • Airline Data Quality Assurance Check • OAG Replacement Airport/Airline Performance • Freight Aircraft Location On Ground • Air Cargo Patterns and Seasonality • Multi-modal Cargo Movement AirCargoTraffic • Planned vs. Actual Flight Path Analysis • Actual flight path deviations from the “norm” • Fuel Cost and Ticket Price Correlation • Financial Impact of Delays EconomicImpactof Delays/Diversions/ Cancellations • Gate availability • Flight pattern interruption • Late Arriving flight pattern • Morning Flight Delay Impact Operational Impact of Delays/Diversions/ Cancellations • Re-direct diverted passengers • Passenger Impact of Cancellations Passenger Impact of Delays/Diversions/ Cancellations
  6. 6. Data Lake BTS Conceptual SWIM Architecture BTS SWIM MSG Service SWIM Data Msg Service XML MSG Processing Economic Impact Weather Impact Ground Movement ITWS Kafka … TFM Flight TFM Flow TBFM FDPS ITWS ITWS FAA SWIM Data Service Bureau of Transportation Statistics (BTS) Temp Raw XML File Storage Performance Air Cargo Traffic DOT Virtual Machine FAA NW Gateway Mapping & Animation … Data AnalyticsXML Message Handling Data Transformation and Storage DOT Cloud Computing Environment Data Analyst Data Lake 8
  7. 7. Infrastructure Architecture
  8. 8. Initial Goals ▪ Automate as much as possible ▪ Infrastructure. ▪ Server Configuration. ▪ Databricks resources. ▪ Multiple Ingestion Sources ▪ SWIM offers multiple types of sources. ▪ On-prem data sources. ▪ Security and Scalability ▪ Internal traffic only. ▪ Multiple environments.
  9. 9. Infrastructure Architecture
  10. 10. Automating the Environment ▪ Initial Networking ▪ Security ▪ VMs ▪ Storage ▪ Databricks Workspace ▪ Software Requirements ▪ Solace Connector ▪ SWIM Access configuration ▪ Kafka Configuration Kafka ClusterInfrastructure ▪ Cluster Creation ▪ Libraries Configuration ▪ Notebooks ▪ Secrets Databricks Cluster
  11. 11. Terraform ▪ Infrastructure as a Code ▪ Helps to automate infrastructure management. ▪ Understanding infrastructure changes before they are applied. ▪ Allows to build, change and version infrastructure. ▪ Multi-cloud ▪ Common language for different providers. ▪ Feature rich ▪ Module Registry. ▪ Providers. ▪ Workspaces. ▪ Variables. # Project Structure ├── LICENSE ├── README.md ├── main.tf ├── networking.tf ├── outputs.tf ├── security.tf ├── storage.tf ├── variables.tf ├── vm.tf └── workspace.tf # Common Commands terraform fmt terraform init terraform validate terraform plan terraform apply
  12. 12. Configuration Management Ansible/Chef ▪ Consistency ▪ No more snowflake servers. ▪ Version Control of all configurations. ▪ Replicated environments. ▪ Scalability ▪ Add more SWIM source configurations. ▪ Easy to deploy new environments. ▪ Documentation ▪ Building-up knowledge. ▪ Change History.
  13. 13. Databricks CLI ▪ Easy Interface to Databricks Platform ▪ Open source. ▪ Built on top of Databricks REST API. ▪ Allows you to interact with: workspace, clusters, fs, groups, jobs, runs, libraries, and secrets. ▪ Supports multiple profiles. ▪ Experimental ▪ Still under active development. # Create Databricks Cluster databricks clusters create --json-file config/cluster.json # Import Libraries databricks libraries install --cluster-id CLUSTER_ID --maven-coordinates com.databricks:spark-xml_2.11:0.9.0 # Import Notebooks databricks workspace import -l PYTHON -f DBC Notebooks/TFMS.dbc /Users/USER/tfms #Secret Management ## Create secret scope databricks secrets create-scope --scope swim --initial-manage-principal users ## Create new secret databricks secrets put --scope bts-swim --key bts-swim-sp --string-value my-value databricks secrets put --scope bts-swim --key bts-swim-sp --binary-file config/SP.txt databricks secrets put --scope bts-swim --key bts-swim-sp
  14. 14. GitHub – GitHub Actions ▪ Automate from code to Cloud ▪ Workflow Automation ▪ Any OS, any language, and any cloud.
  15. 15. CI/CD Architecture
  16. 16. Lessons Learned ▪ Infrastructure and Configuration as a Code ▪ Initial setup takes time. ▪ Test, fail and improve faster. ▪ Learning curve. ▪ Version Control ▪ Easy to review changes. ▪ Helps on-boarding new developers. ▪ Security ▪ Internal Network only. ▪ Limited access.
  17. 17. https://github.com/Chambras/SparkSummit2020
  18. 18. SWIM Data Processing
  19. 19. Future State Architecture SWIM Data Lake Architecture with Streaming SWIM into Databricks 21 Predictive Analysis and Advanced Analytics Bronze Oracle Sybase Adhoc & Graph AnalysisSpark ETL Silver Gold Summary/Platinum - optional Enrichment OperationsSWIM DataLake Tableau Dashboards and Apps Data Stores Streaming SWIM-TFMS Azure Data Lake Storage Batch Raw XML data:, Staging Batch data Parsed XML data, Schema Validation with spark-xml Joined and Aggregated data Potential Further Aggregations SWIM-Other topics Streaming Ingress Data ETL, Stream, and Store Data Build JIT Data Warehouse Analytics and BI Streaming SWIM Data to Databricks RUNTIME
  20. 20. DEMONSTRATION
  21. 21. Lessons Learned ▪ spark-xml is improving ▪ Need to investigate new features for mitigating complex nested XML schemas ▪ XML Schema Validation ▪ Copying schema to executors mitigates File I/O latencies by making use of memory for fast validation ▪ XML Schema Inference ▪ Batch processing of XML data at hourly or daily periodicity based on SLAs mitigates allows for more accurate inference
  22. 22. Next Steps ▪ Validate SWIM data against data provided by airlines ▪ Deeper dive into predictive modeling to gather insights on flight delays and passengers affected ▪ Open up data pipeline to more SWIM data feeds
  23. 23. Contact Us marcelo.zambrana@microsoft.com @ch4mbr4s sheila.stewart@databricks.com m.hashemipour@dot.gov
  24. 24. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×