The document describes analyzing trends in ozone and particulate matter (PM2.5) levels in the United States from 2008 to 2015 using air quality data from the EPA. It details data gathering, filtering, and averaging processes. Various visualization methods are used like histograms, maps, and radar charts to analyze pollutant levels, monitoring radii, and their evolution over time. An animated choropleth map summarizes trends in pollutant levels across the 50 US states.
1) The document provides confidential information about Valeritas, Inc. to shareholders, including an overview of their V-Go insulin delivery device, clinical data demonstrating its effectiveness at lowering A1C levels and daily insulin doses, and the experience and focus of the leadership team.
2) Valeritas believes V-Go is well-positioned for success due to its ability to address unmet needs in the large type 2 diabetes market, its demonstrated commercial traction, established reimbursement, and capital efficient strategy.
3) Clinical studies show patients experienced statistically significant improvements in A1C levels and reductions in daily insulin needs after switching to V-Go from multiple daily injections.
The document is an introduction to the book "79 Simple Solutions for Magazines, Newspapers, and Websites" by Mario Garcia. It discusses Garcia's concept of "pure design", which aims to present information in a simple, stripped-down way. Pure design is inspired by minimalism and focuses on clarity, repetition of forms, and allowing the overall design to tell the story. The introduction provides some general principles of newspaper design according to Garcia, such as making the content easy to read, find, and visually appealing.
- The FBI attempted to modernize its systems and processes with a $400 million project called Sentinel to replace outdated methods that contributed to security breaches.
- The first attempt used a traditional "waterfall" approach with a large upfront design that led to scope creep, changing requirements, and delays. After spending over $170 million, the project was failing to deliver.
- A new Chief Information Officer, Zalmai Azmi, was appointed. With his background in IT leadership and military service, he faced the challenge of getting the troubled Sentinel project back on track.
The document summarizes the SCUBAnauts program, which trains middle and high school students in marine science exploration. Students in the Tampa Bay area conduct underwater research, monitoring coral reefs and collecting data on water temperature, fish populations, and more. They share their findings with NASA's GLOBE program. The program aims to engage students in STEM fields and educate them about protecting ocean environments like coral reefs threatened by climate change, pollution, and overfishing.
Investigating Geographic Information System Technologies A Global Positioning...Simon Sweeney
The document describes the design and testing of a geographic localization reference system for the city of Ottawa using GPS technology, wherein control points around the perimeter of the city were used to create a digital calibration file that could accurately localize individual construction sites within the effective range of 30 km of the base station, and testing of the system found the level of accuracy to be within industry standards for construction activities.
This document is a dissertation submitted by Theofylaktos Papapanagiotou for the degree of Master of Science. It evaluates different grid performance monitoring tools and information services for distributing monitoring data in a multi-level architecture. It describes how tools like Ganglia, Nagios, BDII, and WSRF can be used to monitor load averages on grid nodes, aggregate the data, and present performance visualizations. The dissertation aims to understand how standards like the GLUE schema are used to organize information in the services and evaluate which approach better supports the multi-level monitoring model.
Event Visualization with OpenStreetMap Data, Interdisciplinary ProjectBibek Shrestha
This document describes an interdisciplinary project to visualize events using OpenStreetMap data. It discusses how OpenStreetMap's database structure was copied and extended to collect event information using custom tags. A JOSM plugin was created to simplify data entry. An OpenEventMap web application was built to extract, structure, and visualize the event data on a map. It allows users to search for events by parameters like name, category, and date. The document outlines the OpenStreetMap technology stack, including the Rails Port application and Postgres database, and how it was configured for this project.
This document provides functional specifications for a data browsing tool and country overviews to be developed as part of the DaCoTA project. The data browsing tool will allow interactive exploration of crash, exposure and road safety performance data stored in the project's data warehouse. The country overviews will use a standardized SUNflower methodology to concisely summarize and compare key elements of individual countries' road safety programs, indicators, outcomes and social costs. Templates, formats and an example are provided for the country overviews, and available technologies are discussed for implementing the interactive data browsing tool.
1) The document provides confidential information about Valeritas, Inc. to shareholders, including an overview of their V-Go insulin delivery device, clinical data demonstrating its effectiveness at lowering A1C levels and daily insulin doses, and the experience and focus of the leadership team.
2) Valeritas believes V-Go is well-positioned for success due to its ability to address unmet needs in the large type 2 diabetes market, its demonstrated commercial traction, established reimbursement, and capital efficient strategy.
3) Clinical studies show patients experienced statistically significant improvements in A1C levels and reductions in daily insulin needs after switching to V-Go from multiple daily injections.
The document is an introduction to the book "79 Simple Solutions for Magazines, Newspapers, and Websites" by Mario Garcia. It discusses Garcia's concept of "pure design", which aims to present information in a simple, stripped-down way. Pure design is inspired by minimalism and focuses on clarity, repetition of forms, and allowing the overall design to tell the story. The introduction provides some general principles of newspaper design according to Garcia, such as making the content easy to read, find, and visually appealing.
- The FBI attempted to modernize its systems and processes with a $400 million project called Sentinel to replace outdated methods that contributed to security breaches.
- The first attempt used a traditional "waterfall" approach with a large upfront design that led to scope creep, changing requirements, and delays. After spending over $170 million, the project was failing to deliver.
- A new Chief Information Officer, Zalmai Azmi, was appointed. With his background in IT leadership and military service, he faced the challenge of getting the troubled Sentinel project back on track.
The document summarizes the SCUBAnauts program, which trains middle and high school students in marine science exploration. Students in the Tampa Bay area conduct underwater research, monitoring coral reefs and collecting data on water temperature, fish populations, and more. They share their findings with NASA's GLOBE program. The program aims to engage students in STEM fields and educate them about protecting ocean environments like coral reefs threatened by climate change, pollution, and overfishing.
Investigating Geographic Information System Technologies A Global Positioning...Simon Sweeney
The document describes the design and testing of a geographic localization reference system for the city of Ottawa using GPS technology, wherein control points around the perimeter of the city were used to create a digital calibration file that could accurately localize individual construction sites within the effective range of 30 km of the base station, and testing of the system found the level of accuracy to be within industry standards for construction activities.
This document is a dissertation submitted by Theofylaktos Papapanagiotou for the degree of Master of Science. It evaluates different grid performance monitoring tools and information services for distributing monitoring data in a multi-level architecture. It describes how tools like Ganglia, Nagios, BDII, and WSRF can be used to monitor load averages on grid nodes, aggregate the data, and present performance visualizations. The dissertation aims to understand how standards like the GLUE schema are used to organize information in the services and evaluate which approach better supports the multi-level monitoring model.
Event Visualization with OpenStreetMap Data, Interdisciplinary ProjectBibek Shrestha
This document describes an interdisciplinary project to visualize events using OpenStreetMap data. It discusses how OpenStreetMap's database structure was copied and extended to collect event information using custom tags. A JOSM plugin was created to simplify data entry. An OpenEventMap web application was built to extract, structure, and visualize the event data on a map. It allows users to search for events by parameters like name, category, and date. The document outlines the OpenStreetMap technology stack, including the Rails Port application and Postgres database, and how it was configured for this project.
This document provides functional specifications for a data browsing tool and country overviews to be developed as part of the DaCoTA project. The data browsing tool will allow interactive exploration of crash, exposure and road safety performance data stored in the project's data warehouse. The country overviews will use a standardized SUNflower methodology to concisely summarize and compare key elements of individual countries' road safety programs, indicators, outcomes and social costs. Templates, formats and an example are provided for the country overviews, and available technologies are discussed for implementing the interactive data browsing tool.
Software defined networks (SDNs) is one of the most emerging field and will cause
revolution in the Information Technology (IT) industry. The flexibility in the SDNs
make it most attractive technology to adopt in all type of networks. This flexibility in
the network made the SDNs more prone to the security issues so it is important to cater
these issues in start from the SDN design up-to the deployment and operations. This
Paper proposed a DNS based approach to prevent SDNs from botnet by applying one
million web database concept without reading packet payload. To do any activity, Bot
need to communicate with CnC and requires DNS to IP resolution. For any request
having destination port 53 (DNS) will be checked. The protocol will get all matching
traffic and will send it to 1Mdb. If URL Exists in 1Mdb then do not respond otherwise
send reply with remove flow and block flow to the controller. This approach will use
Machine learning algorithms to classify the traffic as BOT or normal traffic. Naive
Bayes Classifier is used to classify the data using python programming language. The
selection of dataset is very important task for machine learning based botnet detection
and prevention techniques. The poor selection of dataset possibly lead to biased results.
The real world and publically available dataset is a good choice for evaluation of botnet
detection techniques. To meet these criteria, publicly available CTU-43 botnet dataset
has been used. This dataset provide packet dumps (pcap files) of seven real botnets
(Neris, Rbot, Virut, Murlo, Menti, Sogou, and NSIS). We will use these files to generate
botnet traffic for evaluation and test our model. To generate normal traffic, we selected
ISOT dataset. This dataset provides a single pcap file having normal traffic and traffic
for weladec and zeus botnet.
This document describes gridifying the SWAT model to allow users to run calculations on the grid. It discusses three ways SWAT was gridified: 1) splitting the SWAT model into sub-basins that can run independently on the grid, 2) using LH-OAT to create parallel parameter sets for the model to run on the grid, and 3) splitting the SUFI-2 calibration algorithm into iterations that can run independently on the grid. It also provides information on using the EnviroGRIDS virtual organization and job monitoring tools for gridified SWAT calculations.
This document is a final report from 1993 on modelling and designing scalable parallel computing systems. It details the development of a fractal parallel computer topology that can extend to fill a wafer. Algorithms were developed for routing and load balancing, and a simulation program tested a 64-node network using UNIX and PC workstations. Benchmarking of a 16-node example network demonstrated scalability. The report discusses implementing hardware control to support applications using wafer-scale integration.
This document describes the Pan European Plan4all Platform, which was designed and implemented based on previous analysis and architecture in the Plan4all project. The platform follows service oriented architecture principles and INSPIRE directives, using OGC standards like CSW, WMS, and WFS. It includes components for metadata management, data management, data visualization, and content management. The platform allows partners to publish spatial planning data and metadata according to INSPIRE and supports a central portal for access across partners through networking services like discovery and portrayal.
This document describes the Pan European Plan4all Platform, which was designed and implemented based on the analysis and architecture from previous work packages. The platform follows service oriented architecture principles and INSPIRE compliance, using loose coupling and standard interfaces like OGC CSW, WMS, and WFS. It includes components for metadata management, data management, and data visualization. The metadata profile supports spatial planning aspects. The platform allows centralized access to partner spatial planning data through networking services like discovery and portrayal.
The document describes a project report submitted by R Ashwin for the award of a Bachelor of Technology degree. It discusses the distributed implementation of the graph database system DGraph. The key steps in the distributed version include sharding the data, assigning unique IDs, loading the data into different servers, and enabling communication between servers through network calls. Performance evaluation on the Freebase film dataset showed that the distributed version had higher throughput and lower latency than the centralized version, especially as the load and computational power increased.
Graduation Project - Face Login : A Robust Face Identification System for Sec...Ahmed Gad
Face login is my 2015 graduation project started in 2014 and lasted 1.5 years of work.
Generally, it is an identification system using face images. It is a multi-use system but it was mainly created to authorize users to login into their system.
There is an IEEE paper published by the project algorithm used in ICCES 2014 http://ieeexplore.ieee.org/abstract/document/7030929/.
Here is its citation Semary, Noura A., and Ahmed Fawzi Gad. "A proposed framework for robust face identification system." Computer Engineering & Systems (ICCES), 2014 9th International Conference on. IEEE, 2014.
A YouTube video describing the project generally.
https://www.youtube.com/watch?v=OUvaPW70Eko
Find me on:
AFCIT
http://www.afcit.xyz
YouTube
https://www.youtube.com/channel/UCuewOYbBXH5gwhfOrQOZOdw
Google Plus
https://plus.google.com/u/0/+AhmedGadIT
SlideShare
https://www.slideshare.net/AhmedGadFCIT
LinkedIn
https://www.linkedin.com/in/ahmedfgad/
ResearchGate
https://www.researchgate.net/profile/Ahmed_Gad13
Academia
https://www.academia.edu/
Google Scholar
https://scholar.google.com.eg/citations?user=r07tjocAAAAJ&hl=en
Mendelay
https://www.mendeley.com/profiles/ahmed-gad12/
ORCID
https://orcid.org/0000-0003-1978-8574
StackOverFlow
http://stackoverflow.com/users/5426539/ahmed-gad
Twitter
https://twitter.com/ahmedfgad
Facebook
https://www.facebook.com/ahmed.f.gadd
Pinterest
https://www.pinterest.com/ahmedfgad/
An intro to applied multi stat with r by everitt et alRazzaqe
This document provides an introduction to the book "An Introduction to Applied Multivariate Analysis with R" by Brian Everitt and Torsten Hothorn. It discusses the contents of the book, which focuses on teaching core multivariate analysis techniques using examples in R. The book assumes a basic understanding of statistics and familiarity with R. It contains 8 chapters covering topics like principal components analysis, exploratory factor analysis, cluster analysis, and linear mixed models. Code used in the examples is available in the MVA package for R.
Data Mining & Analytics for U.S. Airlines On-Time Performance Mingxuan Li
The document analyzes on-time performance data of U.S. airlines from 2008 using various data mining techniques. It describes the dataset, which contains over 1.5 million records of airline flights with 17 variables. It then preprocesses the data, analyzes the variables, and applies methods like association rules, cluster analysis, decision trees, random forests, and classification to identify factors that influence flight delays.
This document provides an introduction to the book "An Introduction to Applied Multivariate Analysis with R" by Brian Everitt and Torsten Hothorn. It discusses the contents and organization of the book, which covers multivariate analysis techniques including principal components analysis, exploratory factor analysis, multidimensional scaling, cluster analysis, structural equation modeling, and linear mixed-effects models. The document also acknowledges support provided during writing and provides information on how to access R code used in the examples in the book.
This document provides a report on a GPS-based bus management system software engineering project. It includes an introduction describing the purpose and scope of tracking buses using GPS. It outlines the software requirements specification including data flow diagrams and a data dictionary. It also discusses project management aspects like cost estimation, scheduling, and risk management. The design section includes architectural design and an entity relationship diagram. Finally, it proposes some test cases for the administrator module.
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
This document provides an introduction to spatial data analysis using open source software R. It discusses spatial data concepts like spatial reference systems and coordinate reference systems. It describes how to create, load and visualize spatial point, line and polygon data in R. It also covers digital image processing and classification in QGIS. Methods discussed include spatial point pattern analysis, interpolation, geostatistics, spatial modeling and accuracy assessment. The document uses data from Kilimanjaro region as an example to demonstrate these spatial analysis techniques in R and QGIS.
MIL-STD-498, dated 5 December 1994, is hereby canceled. Information
regarding software development and documentation is now contained in the Institute of
Electrical and Electronics Engineers (IEEE)/Electronics Industries Association (EIA)
standard, IEEE/EIA 12207, “Information technology-Software life cycle processes”.
IEEE/EIA 12207 is packaged in three parts. The three parts are: IEEE/EIA 12207.0,
“Standard for Information Technology-Software life cycle processes”; IEEE/EIA
12207.1, “Guide for ISO/IEC 12207, Standard for Information Technology-Software life
cycle processes-Life cycle data”; and IEEE/EIA 12207.2, “Guide for ISO/IEC 12207,
Standard for Information Technology-Software life cycle processes-Implementation
considerations.”
This document summarizes three data visualizations created by a group of students to analyze the Divvy bike sharing dataset from Chicago. It describes the dataset used, which includes over 3 million records of bike trips and station information. The group created visualizations to show bike usage patterns by weekday/weekend and time of day, a network map of stations in downtown Chicago, and a circular network visualization of bike trips. The visualizations help answer questions about when and where riders are going, busy stations, and other usage patterns.
Project on nypd accident analysis using hadoop environmentSiddharth Chaudhary
For this project NYC motor-vehicle-collisions dataset is processed in Hadoop ecosystem using map reduce, Pig script and Hive query for analysis and visualization.
This document is a main project report submitted by 4 students from the Department of Computer Science and Engineering at Government College of Engineering, Kannur in March 2014. The project develops a distributed traffic management framework using fuzzy logic control that aims to enhance network performance compared to existing protocols. Routers are deployed with intelligent data rate controllers to regulate traffic. Unlike protocols relying on estimated network parameters, the fuzzy logic controller directly measures router queue size.
This document proposes a framework for cross drive correlation using Normalized Compression Distance (NCD) as a similarity metric. The framework consists of the following sub-tasks:
1. Disk image preprocessing - Extracting data blocks from disk images without parsing file system data.
2. NCD similarity correlation - Calculating NCD scores between all pairs of data blocks to determine similarity.
3. Reports and graphical output - Generating reports on correlated drives and graphical representations of similarity scores.
4. Data block extraction - Extracting data blocks that satisfy a given similarity threshold for further analysis.
The framework aims to provide preliminary analysis of evidence spanning multiple disks in an automated manner without requiring in-depth
The document benchmarks 20 machine learning models on two datasets to compare their accuracy and speed. On the smaller Car Evaluation dataset, bagged decision trees, random forests and boosted decision trees achieved over 99% accuracy, while neural networks, decision stumps and support vector machines exceeded 95% accuracy. On the larger Nursery dataset, similar models exceeded 99% accuracy, while other models like decision rules and k-nearest neighbors exceeded 95% accuracy. However, models varied significantly in speed depending on the hardware, with decision trees, mixture discriminant analysis and gradient boosting as the fastest on Car Evaluation, and mixture discriminant analysis, one rule and boosted decision trees as the fastest on Nursery. The findings imply the importance of regular benchmarking
Software defined networks (SDNs) is one of the most emerging field and will cause
revolution in the Information Technology (IT) industry. The flexibility in the SDNs
make it most attractive technology to adopt in all type of networks. This flexibility in
the network made the SDNs more prone to the security issues so it is important to cater
these issues in start from the SDN design up-to the deployment and operations. This
Paper proposed a DNS based approach to prevent SDNs from botnet by applying one
million web database concept without reading packet payload. To do any activity, Bot
need to communicate with CnC and requires DNS to IP resolution. For any request
having destination port 53 (DNS) will be checked. The protocol will get all matching
traffic and will send it to 1Mdb. If URL Exists in 1Mdb then do not respond otherwise
send reply with remove flow and block flow to the controller. This approach will use
Machine learning algorithms to classify the traffic as BOT or normal traffic. Naive
Bayes Classifier is used to classify the data using python programming language. The
selection of dataset is very important task for machine learning based botnet detection
and prevention techniques. The poor selection of dataset possibly lead to biased results.
The real world and publically available dataset is a good choice for evaluation of botnet
detection techniques. To meet these criteria, publicly available CTU-43 botnet dataset
has been used. This dataset provide packet dumps (pcap files) of seven real botnets
(Neris, Rbot, Virut, Murlo, Menti, Sogou, and NSIS). We will use these files to generate
botnet traffic for evaluation and test our model. To generate normal traffic, we selected
ISOT dataset. This dataset provides a single pcap file having normal traffic and traffic
for weladec and zeus botnet.
This document describes gridifying the SWAT model to allow users to run calculations on the grid. It discusses three ways SWAT was gridified: 1) splitting the SWAT model into sub-basins that can run independently on the grid, 2) using LH-OAT to create parallel parameter sets for the model to run on the grid, and 3) splitting the SUFI-2 calibration algorithm into iterations that can run independently on the grid. It also provides information on using the EnviroGRIDS virtual organization and job monitoring tools for gridified SWAT calculations.
This document is a final report from 1993 on modelling and designing scalable parallel computing systems. It details the development of a fractal parallel computer topology that can extend to fill a wafer. Algorithms were developed for routing and load balancing, and a simulation program tested a 64-node network using UNIX and PC workstations. Benchmarking of a 16-node example network demonstrated scalability. The report discusses implementing hardware control to support applications using wafer-scale integration.
This document describes the Pan European Plan4all Platform, which was designed and implemented based on previous analysis and architecture in the Plan4all project. The platform follows service oriented architecture principles and INSPIRE directives, using OGC standards like CSW, WMS, and WFS. It includes components for metadata management, data management, data visualization, and content management. The platform allows partners to publish spatial planning data and metadata according to INSPIRE and supports a central portal for access across partners through networking services like discovery and portrayal.
This document describes the Pan European Plan4all Platform, which was designed and implemented based on the analysis and architecture from previous work packages. The platform follows service oriented architecture principles and INSPIRE compliance, using loose coupling and standard interfaces like OGC CSW, WMS, and WFS. It includes components for metadata management, data management, and data visualization. The metadata profile supports spatial planning aspects. The platform allows centralized access to partner spatial planning data through networking services like discovery and portrayal.
The document describes a project report submitted by R Ashwin for the award of a Bachelor of Technology degree. It discusses the distributed implementation of the graph database system DGraph. The key steps in the distributed version include sharding the data, assigning unique IDs, loading the data into different servers, and enabling communication between servers through network calls. Performance evaluation on the Freebase film dataset showed that the distributed version had higher throughput and lower latency than the centralized version, especially as the load and computational power increased.
Graduation Project - Face Login : A Robust Face Identification System for Sec...Ahmed Gad
Face login is my 2015 graduation project started in 2014 and lasted 1.5 years of work.
Generally, it is an identification system using face images. It is a multi-use system but it was mainly created to authorize users to login into their system.
There is an IEEE paper published by the project algorithm used in ICCES 2014 http://ieeexplore.ieee.org/abstract/document/7030929/.
Here is its citation Semary, Noura A., and Ahmed Fawzi Gad. "A proposed framework for robust face identification system." Computer Engineering & Systems (ICCES), 2014 9th International Conference on. IEEE, 2014.
A YouTube video describing the project generally.
https://www.youtube.com/watch?v=OUvaPW70Eko
Find me on:
AFCIT
http://www.afcit.xyz
YouTube
https://www.youtube.com/channel/UCuewOYbBXH5gwhfOrQOZOdw
Google Plus
https://plus.google.com/u/0/+AhmedGadIT
SlideShare
https://www.slideshare.net/AhmedGadFCIT
LinkedIn
https://www.linkedin.com/in/ahmedfgad/
ResearchGate
https://www.researchgate.net/profile/Ahmed_Gad13
Academia
https://www.academia.edu/
Google Scholar
https://scholar.google.com.eg/citations?user=r07tjocAAAAJ&hl=en
Mendelay
https://www.mendeley.com/profiles/ahmed-gad12/
ORCID
https://orcid.org/0000-0003-1978-8574
StackOverFlow
http://stackoverflow.com/users/5426539/ahmed-gad
Twitter
https://twitter.com/ahmedfgad
Facebook
https://www.facebook.com/ahmed.f.gadd
Pinterest
https://www.pinterest.com/ahmedfgad/
An intro to applied multi stat with r by everitt et alRazzaqe
This document provides an introduction to the book "An Introduction to Applied Multivariate Analysis with R" by Brian Everitt and Torsten Hothorn. It discusses the contents of the book, which focuses on teaching core multivariate analysis techniques using examples in R. The book assumes a basic understanding of statistics and familiarity with R. It contains 8 chapters covering topics like principal components analysis, exploratory factor analysis, cluster analysis, and linear mixed models. Code used in the examples is available in the MVA package for R.
Data Mining & Analytics for U.S. Airlines On-Time Performance Mingxuan Li
The document analyzes on-time performance data of U.S. airlines from 2008 using various data mining techniques. It describes the dataset, which contains over 1.5 million records of airline flights with 17 variables. It then preprocesses the data, analyzes the variables, and applies methods like association rules, cluster analysis, decision trees, random forests, and classification to identify factors that influence flight delays.
This document provides an introduction to the book "An Introduction to Applied Multivariate Analysis with R" by Brian Everitt and Torsten Hothorn. It discusses the contents and organization of the book, which covers multivariate analysis techniques including principal components analysis, exploratory factor analysis, multidimensional scaling, cluster analysis, structural equation modeling, and linear mixed-effects models. The document also acknowledges support provided during writing and provides information on how to access R code used in the examples in the book.
This document provides a report on a GPS-based bus management system software engineering project. It includes an introduction describing the purpose and scope of tracking buses using GPS. It outlines the software requirements specification including data flow diagrams and a data dictionary. It also discusses project management aspects like cost estimation, scheduling, and risk management. The design section includes architectural design and an entity relationship diagram. Finally, it proposes some test cases for the administrator module.
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
This document provides an introduction to spatial data analysis using open source software R. It discusses spatial data concepts like spatial reference systems and coordinate reference systems. It describes how to create, load and visualize spatial point, line and polygon data in R. It also covers digital image processing and classification in QGIS. Methods discussed include spatial point pattern analysis, interpolation, geostatistics, spatial modeling and accuracy assessment. The document uses data from Kilimanjaro region as an example to demonstrate these spatial analysis techniques in R and QGIS.
MIL-STD-498, dated 5 December 1994, is hereby canceled. Information
regarding software development and documentation is now contained in the Institute of
Electrical and Electronics Engineers (IEEE)/Electronics Industries Association (EIA)
standard, IEEE/EIA 12207, “Information technology-Software life cycle processes”.
IEEE/EIA 12207 is packaged in three parts. The three parts are: IEEE/EIA 12207.0,
“Standard for Information Technology-Software life cycle processes”; IEEE/EIA
12207.1, “Guide for ISO/IEC 12207, Standard for Information Technology-Software life
cycle processes-Life cycle data”; and IEEE/EIA 12207.2, “Guide for ISO/IEC 12207,
Standard for Information Technology-Software life cycle processes-Implementation
considerations.”
This document summarizes three data visualizations created by a group of students to analyze the Divvy bike sharing dataset from Chicago. It describes the dataset used, which includes over 3 million records of bike trips and station information. The group created visualizations to show bike usage patterns by weekday/weekend and time of day, a network map of stations in downtown Chicago, and a circular network visualization of bike trips. The visualizations help answer questions about when and where riders are going, busy stations, and other usage patterns.
Project on nypd accident analysis using hadoop environmentSiddharth Chaudhary
For this project NYC motor-vehicle-collisions dataset is processed in Hadoop ecosystem using map reduce, Pig script and Hive query for analysis and visualization.
This document is a main project report submitted by 4 students from the Department of Computer Science and Engineering at Government College of Engineering, Kannur in March 2014. The project develops a distributed traffic management framework using fuzzy logic control that aims to enhance network performance compared to existing protocols. Routers are deployed with intelligent data rate controllers to regulate traffic. Unlike protocols relying on estimated network parameters, the fuzzy logic controller directly measures router queue size.
This document proposes a framework for cross drive correlation using Normalized Compression Distance (NCD) as a similarity metric. The framework consists of the following sub-tasks:
1. Disk image preprocessing - Extracting data blocks from disk images without parsing file system data.
2. NCD similarity correlation - Calculating NCD scores between all pairs of data blocks to determine similarity.
3. Reports and graphical output - Generating reports on correlated drives and graphical representations of similarity scores.
4. Data block extraction - Extracting data blocks that satisfy a given similarity threshold for further analysis.
The framework aims to provide preliminary analysis of evidence spanning multiple disks in an automated manner without requiring in-depth
The document benchmarks 20 machine learning models on two datasets to compare their accuracy and speed. On the smaller Car Evaluation dataset, bagged decision trees, random forests and boosted decision trees achieved over 99% accuracy, while neural networks, decision stumps and support vector machines exceeded 95% accuracy. On the larger Nursery dataset, similar models exceeded 99% accuracy, while other models like decision rules and k-nearest neighbors exceeded 95% accuracy. However, models varied significantly in speed depending on the hardware, with decision trees, mixture discriminant analysis and gradient boosting as the fastest on Car Evaluation, and mixture discriminant analysis, one rule and boosted decision trees as the fastest on Nursery. The findings imply the importance of regular benchmarking
1. US Ozone and PM2.5 Maps and Trends
Marc Borowczak
2015-07-04
Contents
Summary 1
System and Platform Documentation 2
Reproducible Data Gathering and Transformation 3
Functional Core 4
Screener, Liner and Normalize Helper Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Screener ‘yhat’ Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Analysis 7
Averaging Pollutants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ggplot Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Retrieving county Data for 50 US States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Checking Out AK and HI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Radar Charts to the Rescue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Choroplethr Approach with Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Conclusions 27
References 27
Summary
From Annual US Airdata reports available between 2008 and 2015 on US EPA download site and data
repositories, 2 pollutants of interest, Ozone and Particulate Material up to 2.5 micro-meters, were selected
for comparative and evolutive analysis. The tool was derived from the basic example covered by Dr. Roger D.
Peng in the Coursera ’Developing Data Products, Week 4 - yhat module. The core data filtering process
is contained in the screener function, where the minimum data is retrieved using an adaptive radius. The
screening algorithm is based on the nearest 5 reporting sites, instead of straight averaging within a fixed
radius provided. The module also retrieves all data available to date, aggregated on a yearly basis. A yhat
module function was also developed to verify functionality, and could be used in a multi-user distributed
environment. Multiple methods including averaging a single value per county or mapping the average of the
5 nearest neighbors are performed. Pollutant levels (Ozone and PM2.5) and Radii of Reports are analyzed,
indicating the absence of data from HI and very little reporting in AK. The use of radar chart is also quickly
1
2. demonstrated. However, the evolution of levels and Radii show clear trends over the period 2008 thru 2015
as well as progress in monitoring density overall. A number of interesting differences, distributions and radar
charts explain clearly the evolution observed in US. Finally, the choroplethr animation summarizes well the
trends in a dynamic web player, and extend mapping to all 50 states.
System and Platform Documentation
Before any analysis is performed, let’s start with system and platform documentation in a fresh directory to
insure reproducibility.
Sys.info()[1:5] # system info but exclude login and user info
## sysname release version nodename machine
## "Windows" "7 x64" "build 9200" "LEARNING-PC" "x86-64"
userdir<-getwd() # user-defined startup directory
library(plyr)
library(dplyr)
library(tidyr)
library(data.table) # for liner
library(pbapply) # for progressbar
library(magrittr)
library(ggplot2)
library(maps)
library(mapproj)
library(RColorBrewer) # for brewer.pal(...)
library(fields) # for rdist() in screener
library(choroplethr)
library(choroplethrMaps)
library(animation)
sessionInfo() # to document platform
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] animation_2.3 yhatr_0.13.6 choroplethrMaps_1.0
## [4] choroplethr_3.1.0 fields_8.2-1 spam_1.0-1
## [7] RColorBrewer_1.1-2 mapproj_1.2-2 maps_2.3-9
2
3. ## [10] ggplot2_1.0.1 magrittr_1.5 pbapply_1.1-1
## [13] data.table_1.9.4 tidyr_0.2.0 dplyr_0.4.2
## [16] plyr_1.8.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.11.6 formatR_1.2 tools_3.2.1
## [4] rpart_4.1-9 digest_0.6.8 evaluate_0.7
## [7] gtable_0.1.2 lattice_0.20-31 WDI_2.4
## [10] DBI_0.3.1 yaml_2.1.13 parallel_3.2.1
## [13] proto_0.3-10 gridExtra_0.9.1 cluster_2.0.1
## [16] stringr_1.0.0 knitr_1.10.5 nnet_7.3-9
## [19] R6_2.0.1 XML_3.98-1.3 survival_2.38-1
## [22] foreign_0.8-63 rmarkdown_0.7 latticeExtra_0.6-26
## [25] Formula_1.2-1 reshape2_1.4.1 acs_1.2
## [28] scales_0.2.5 Hmisc_3.16-0 htmltools_0.2.6
## [31] MASS_7.3-40 splines_3.2.1 assertthat_0.1
## [34] colorspace_1.2-6 stringi_0.5-5 acepack_1.3-3.3
## [37] munsell_0.4.2 chron_2.3-47
datadir <- "./data"
figdir <- "./Pollution_files/figure-html"
anidir <- "./Animations"
if (!file.exists("data")) { dir.create("data") } # data will reside in subdir ./data
if (!file.exists("figures")) { dir.create("figures") } # figures in subdir ./figures
if (!file.exists("Animations")) { dir.create("Animations") } # for animated gif
Reproducible Data Gathering and Transformation
All data is obtained reproducibly from documented US airdata source, aggregating the 2 pollutants Ozone
and PM2.5 - Local Conditions. Monitoring sites and pollutant averages are populating the two corresponding
data frames, as introduced in the Coursera Module.
y_monitors<-NULL
y_pollavg<-NULL
for (i in 2008:2015)
{
url<-paste0("http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/annual_all_",
as.character(i),".zip",sep='')
filename <- paste(datadir,strsplit(url,"/")[[1]][length(strsplit(url,"/")[[1]])],
sep="/")
download.file(url, dest=filename, mode="wb")
unzip (filename, exdir = datadir) # unzip creates and populates the data structure
unlink(filename) # remove the zip file
filename <-gsub(".zip",".csv",filename)
d <- read.csv(filename,stringsAsFactors=FALSE) # populate data frame
sub<-subset(d, Parameter.Name %in% c("PM2.5 - Local Conditions","Ozone")
& Pollutant.Standard %in% c("Ozone 8-Hour 2008","PM25 Annual 2006"),
c(Year,Longitude,Latitude,Parameter.Name, Arithmetic.Mean))
y_poll<-aggregate(sub[,"Arithmetic.Mean"],
sub[,c("Year","Longitude","Latitude","Parameter.Name")],
mean,na.rm=TRUE)
names(y_poll)[5]<-"level"
3
4. y_poll$Parameter.Name[y_poll$Parameter.Name=="PM2.5 - Local Conditions"]<-"PM2p5"
y_poll<-transform(y_poll,Parameter.Name=factor(Parameter.Name))
if(is.null(nrow(y_monitors))) {
y_pollavg<-y_poll
y_monitors<-data.matrix(y_pollavg[,c("Year","Longitude","Latitude")])
}
else {
y_pollavg<-rbind(y_pollavg,y_poll)
y_monitors<-rbind(y_monitors,
data.matrix(y_pollavg[,c("Year","Longitude","Latitude")]))
}
}
names(y_pollavg)[1]<-"yr"
rm(d,sub,datadir,filename,url,i,y_poll)
y_monitors<-as.data.frame(y_monitors)
y_monitors$Year<-as.integer(y_monitors$Year)
y<-spread(y_pollavg,Parameter.Name,level)
# eliminate missing values records
y<-y[!is.na(y$Ozone),]
y<-y[!is.na(y$PM2p5),]
y_pollavg<-gather(y,"Parameter.Name","level",4:5)
rm(y)
Functional Core
Screener, Liner and Normalize Helper Functions
The core data filtering processes a positional data frame containing longitude and latitude. It returns the
same information in 8 records, one for for each year in 2008:2015 range:
• yr : 2008:2015
• radius: 5 closest neighbors reporting both
• Ozone : in ppm and
• PM2p5 : in ppm
screener <- function(df)
{
#
# input is a data.frame df
# lon lat
# 1 -86.70649 32.4302
# ...
#
# output the complete data.frame df
# lon lat yr radius Ozone PM2p5
# 1 -86.70649 32.4302 2008 77.67187 0.04523967 12.698744
# 2 -86.70649 32.4302 2009 66.45152 0.03956100 10.457460
# 3 -86.70649 32.4302 2010 66.45152 0.04697900 11.594312
# 4 -86.70649 32.4302 2011 66.45152 0.04703300 11.413982
# 5 -86.70649 32.4302 2012 77.67187 0.04431500 10.488044
4
5. # 6 -86.70649 32.4302 2013 77.67187 0.03956567 9.688020
# 7 -86.70649 32.4302 2014 77.67187 0.04164267 9.960654
# 8 -86.70649 32.4302 2015 77.67187 0.03565967 9.902436
# ...
n<-nrow(df)
df.in<-df
df.out<-data.frame()
# pb <- txtProgressBar(min = 0, max = n, style = 3,char="+")
for (k in 1:n) {
df<-data.frame(df.in[k,])
dlevel1<-data.frame()
for (j in 2008:2015) {
use<-vector()
d<-rdist.earth(subset(y_pollavg,yr==j,select=-yr),data.matrix(df[,c("lon","lat")]))
use<-lapply(seq_len(ncol(d)),function(i) {
head(data.frame(sort(d[,i],decreasing=FALSE,index.return=TRUE)),5)$ix })
r<-max(sapply(seq_len(ncol(d)),function(i) {
head(data.frame(sort(d[,i],decreasing=FALSE,index.return=TRUE)),5)$x }))
l.Ozone<-sapply (use,function(idx) {
with(subset(y_pollavg,yr==j & Parameter.Name=="Ozone",select=-yr)[idx,],
tapply(level, Parameter.Name=="Ozone", mean))})
l.PM2p5<-sapply (use,function(idx) {
with(subset(y_pollavg,yr==j & Parameter.Name=="PM2p5",select=-yr)[idx,],
tapply(level, Parameter.Name=="PM2p5", mean))})
dlevel1 <- rbind(dlevel1,c(j,r,l.Ozone,l.PM2p5))
# cleanup time
rm(d,r,l.Ozone,l.PM2p5,use)
}
names(dlevel1)<-c("yr","radius","Ozone","PM2p5")
df<-data.frame(df,dlevel1,row.names = NULL)
df.out<-rbind(df.out,df)
# setTxtProgressBar(pb,k)
}
# close(pb)
df<-df.out
rm(df.in,df.out,dlevel1,k,j)
df
}
liner() is a helper function wrapper which may be used on systems where available memory is low. It invoques
screener() in a listwise fashion, and presents the advantage of a text progressbar. The returned list is bound
quickly using the rbindlist data.table function.
liner <- function(v)
{
p <- pblapply(seq_len(nrow(v)), function(i)
{
screener(v[i, ])
})
v <- data.frame(rbindlist(p))
v
}
5
6. normalize() is a convenience reformatter function capable of addressing vectors and data frames and is used
for radar charting.
normalize <- function(df)
{
if (is.vector(df) == TRUE)
{
m <- max(df)
n <- min(df)
df <- (df - n)/(m - n)
} else
{
df <- data.frame(sapply(seq_len(ncol(df)), function(i)
{
m <- max(df[, i])
n <- min(df[, i])
df[, i] <- (df[, i] - n)/(m - n)
df[, i]
}))
}
df
}
Screener ‘yhat’ Module
library(yhatr)
model.require <- function() {
library(fields)
library(pbapply)
}
model.transform <- function(df) {
df
}
model.predict <- function(df) {
screener(df)
}
yhat.config follows but is not rendered to keep the apikey private. . . and function is not redeployed
Only invoke yhat.deploy when changing the model, so we will not re-deploy now.
yhat.deploy("screener")
However, since the function has been deployed previously, we can call yhat.predict. It returns the expected
output illustrated in the screner function.
data <-c("lon"=-86.70649,"lat"= 32.4302)
yhat.predict("screener",data)
6
7. ## lon lat yr radius Ozone PM2p5
## 1 -86.70649 32.4302 2008 77.67187 0.04523967 12.698744
## 2 -86.70649 32.4302 2009 66.45152 0.03956100 10.457460
## 3 -86.70649 32.4302 2010 66.45152 0.04697900 11.594312
## 4 -86.70649 32.4302 2011 66.45152 0.04703300 11.413982
## 5 -86.70649 32.4302 2012 77.67187 0.04431500 10.488044
## 6 -86.70649 32.4302 2013 77.67187 0.03956567 9.688020
## 7 -86.70649 32.4302 2014 77.67187 0.04164267 9.960654
## 8 -86.70649 32.4302 2015 77.67187 0.03565967 9.902436
Analysis
Averaging Pollutants
As a first step in the analysis, the average values obtained by counties are computed.
pollutant<-data.frame()
set.seed(1) # for reproducible example
# this creates an example formatted as a pollutant.map
df<-data.frame(map_data('county'))
names(df)[1]<-'lon'
df[,c("lon","lat")] %>%
screener %>%
merge(df,.,by=c("lon","lat"),all=TRUE) %>%
group_by(region,subregion,yr) %>%
select(-group,-order) %>%
summarise_each(funs(mean)) %>%
as.data.frame %>%
gather("type","level",7:8) -> pollutavg
names(pollutavg)[1:2]<-c("id","region")
#
df %>%
group_by(region,subregion) %>%
select(lon,lat) %>%
summarise_each (funs(mean)) %>%
as.data.frame-> df
df[,c("lon","lat")] %>%
screener %>%
merge(df,.,by=c("lon","lat"),all=TRUE) %>%
as.data.frame -> pollutant
pollutant$yr<-as.factor(pollutant$yr)
Histograms
It is important to observe pollutant levels, but also essential to realize at which distance the radius monitoring
is occuring. Since the screener routing adapts itself to the 5 nearest neighbors mean, we need to see how
these radii are distributed.
m1<-ggplot(pollutant,aes(x=radius)) +
geom_histogram(binwidth=20,color="grey50",fill='blue') +
labs(title="2008 ~ 2015 Radius (Miles) reporting Pollutants",
7
8. x="Radius (Miles)",y="Count")+
theme_bw()
m1
0
1000
2000
3000
4000
0 200 400 600
Radius (Miles)
Count
2008 ~ 2015 Radius (Miles) reporting Pollutants
Figure 1:
m2<-ggplot(pollutant,aes(x=radius,fill=yr)) +
geom_histogram(binwidth=20,color="grey50") +
facet_wrap(~yr) +
labs(title="Radius (Miles) reporting Pollutants by Year",
x="Radius (Miles)",y="Count") +
theme_bw()
m2
ggplot Maps
With a little re-arranging we a can chart the levels and radii trends for Ozone and PM2.5 using first ggplot().
We tidy the data and merge counties information from the map_data(“county”). Observing that reporting
for 2015 is not complete, we will base comparisons and trends up to 2014 at this time.
# time to tydy the data
pollutant %>% gather("type","level",7:8) -> pollutant
names(pollutant)[3:4]<-c("id","region")
#
m.usa <- map_data("county")
m.usa <- m.usa[ ,-5]
names(m.usa)[5] <- 'region'
8
9. 2008 2009 2010
2011 2012 2013
2014 2015
0
200
400
600
0
200
400
600
0
200
400
600
0 200 400 600 0 200 400 600
Radius (Miles)
Count
yr
2008
2009
2010
2011
2012
2013
2014
2015
Radius (Miles) reporting Pollutants by Year
Figure 2:
We are now ready to chart these county ggplots. . . We will slightly adapt the scales as indicated to produce
meaningful ranges of Ozone and PM2.5 (in ppm) and radii (in miles).
g1 <- ggplot(subset(pollutant,type=="Ozone"), aes(map_id = region)) +
geom_map(aes(fill = level), map = m.usa) +
expand_limits(x = m.usa$long, y = m.usa$lat) +
scale_fill_gradientn("ppm",colours=brewer.pal(9,"YlGnBu"))
g1 <- last_plot() + coord_map() + facet_wrap(~ yr,nrow=3) +
labs(title="Ozone Pollutant level by County, ppm",x="",y="")+ theme_bw()
g1$scales$scales[[1]]$limits<-c(0,0.07)
g1
9
16. ## radius.change
## 2509 57.47642
## 2513 57.79976
## 466 58.94323
## 462 59.94095
## 261 77.46404
## 467 84.37639
## 246 96.91446
## 305 98.00507
## 284 142.38816
## 271 181.87518
Retrieving county Data for 50 US States
Finally, we attempt another approach with choroplethr, athough there is little data for AK and HI. . . we
want to undertake the challenge to visualize not only the lower 48 but also the 49th and 50th US states. . .
This requires a download of the complete county list for all states, a dataset available from the US Census
and data repository.
pollutant<-data.frame()
set.seed(1) # for reproducible example
# this creates an example formatted as a pollutant.map
datadir<-"./data" ; if (!file.exists("data")) { dir.create("data") }
url<-"http://www2.census.gov/geo/docs/maps-data/data/gazetteer/Gaz_counties_national.zip"
filename <- paste(datadir,strsplit(url,"/")[[1]][length(strsplit(url,"/")[[1]])],sep="/")
download.file(url, dest=filename, mode="wb")
unzip (filename, exdir = datadir) # unzip creates and populates the data structure
unlink(filename) # remove the zip file
filename <-gsub(".zip",".txt",filename)
d <- read.delim(filename,header=TRUE,sep="t",stringsAsFactors=FALSE) # populate
# cleanup
rm(datadir,filename,url)
names(d)<-tolower(names(d))
subset(d,select=-c(ansicode:awater_sqmi)) %>%
rename(state.abb=usps,region=geoid,lat=intptlat,lon=intptlong) -> d
d$region<-as.numeric(d$region)
data(county.regions)
df <- data.frame(county.regions)
subset(df,select=-c(county.fips.character,state.fips.character)) %>%
merge(d,.,by=c("region","state.abb"),all=FALSE) -> df
# cleanup
rm(d,county.regions)
df[,c("lon","lat")] %>%
screener %>%
merge(df,.,by=c("lon","lat"),all=TRUE) %>%
rename(state=state.name) %>%
as.data.frame -> t -> pollutant
rm(df) # cleanup
We now have a pollutant data frame that can contain all 50 states counties, and are ready to checkout AK
and HI. . .
16
22. level
radius
0.03
0.04
0.05
0.06
0
200
400
600
AL
AR
AZ
CA
CO
CT
DC
DE
FL
GA
IA
ID
IL
IN
KS
KY
LA
MA
MD
ME
MI
MN
MO
MS
MT
NC
ND
NE
NH
NJ
NM
NV
NY
OH
OK
OR
PA
RI
SC
SD
TN
TX
UT
VA
VT
WA
WI
WV
WY
id
value
yr
2008
2009
2010
2011
2012
2013
2014
2015
funs
mean
min
max
US Lower 48 Ozone Level and Radius Statistics by State and Year
We observe Radius reporting minimum is usually below 50 miles, with notable exceptions. . .
data.frame(subset(t, type == "PM2p5" & !funs=="n" & !(state.abb %in% c("AK", "HI")))) %>%
rename(id=state.abb) %>% as.data.frame -> df
gg4 <- ggplot(df, aes(x=id, y=value)) + geom_point(aes(shape = funs, color = yr)) +
labs(title = "US Lower 48 PM2.5 Level and Radius Statistics by State and Year") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))+
facet_wrap( ~ metric, nrow=4,scales="free_y")
gg4
rm(df) # cleanup
It is important tonote the high level of PM2.5 reported in CA and NV, in excess of 20ppm, while The Ozone
ppm level reained overall below 7 ppm overall.
To generate the radar charts, we will use again the ggplot function, and perform charts on the normalized
data. We produce faceted charts by Year and superimpose on the chart Ozone and PM2.5
# Define a new coordinate system
coord_radar <- function(...) {
structure(coord_polar(...), class = c("radar", "polar", "coord"))
}
is.linear.radar <- function(coord) TRUE
subset(t,funs!="n" & metric=="radius") %>%
unique %>% rename(id=state.abb) -> y
y$value <-normalize(y$value)
rownames(y)<-NULL
ggplot(y,aes(x=id,y=value)) +
22
24. AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY
AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY
AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY
AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY AKALARAZCACOCTDCDEFL
GA
HI
IA
ID
IL
INKSKYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SDTNTXUTVAVTWAWIWVWY
Ozone, 2008 Ozone, 2009 Ozone, 2010 Ozone, 2011
Ozone, 2012 Ozone, 2013 Ozone, 2014 Ozone, 2015
PM2p5, 2008 PM2p5, 2009 PM2p5, 2010 PM2p5, 2011
PM2p5, 2012 PM2p5, 2013 PM2p5, 2014 PM2p5, 2015
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Normalizedvalue
funs
mean
min
max
Annual Ozone and PM2.5 Radius Statistics for All 50 US States
We observe that both AK and HI exhibit very large radii, so let’s exclude these. . .
subset(t,funs!="n" & metric=="radius" & !state.abb %in% c("AK","HI")) %>%
unique %>% rename(id=state.abb)-> y
y$value <-normalize(y$value)
rownames(y)<-NULL
ggplot(y,aes(x=id,y=value)) +
geom_path(aes(group=type, color=funs)) +
coord_radar()+facet_wrap(type ~ yr,nrow=4) +
labs(title = "Annual Ozone and PM2.5 Radius Statistics for Lower 48 US States",
x="",y="Normalized value") +
theme(strip.text.x = element_text(size = rel(0.8)),
axis.text.x = element_text(size = rel(0.8)))
rm(y,t) # cleanup
Choroplethr Approach with Animation
To perform this last step, we limit our observation to a radius less than 250 miles. This will provide visual
support for the progress achieved between 2008 and 2015. Recognizing that such radius is quite large, we also
map the current progress in collecting denser information. Levels of Ozone and PM2.5 and Radii evolutions
are captured in a series of 8 annual charts, one for each variable, and animated with the chroroplethr player
implemented with an html file.
subset(pollutant,radius<250) %>% gather("type","value",9:10) -> t
# the animated story...
choropleths<-list()
setwd(figdir) # point to ./figures sub directory
24
25. ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY
ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY
ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY
ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY ALARAZCACOCTDCDEFL
GA
IA
ID
IL
IN
KS
KYLAMAMDMEMIMNMOMSMTNCNDNENHNJNMNVNYOH
OK
OR
PA
RI
SC
SD
TNTXUTVAVTWAWIWVWY
Ozone, 2008 Ozone, 2009 Ozone, 2010 Ozone, 2011
Ozone, 2012 Ozone, 2013 Ozone, 2014 Ozone, 2015
PM2p5, 2008 PM2p5, 2009 PM2p5, 2010 PM2p5, 2011
PM2p5, 2012 PM2p5, 2013 PM2p5, 2014 PM2p5, 2015
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Normalizedvalue
funs
mean
min
max
Annual Ozone and PM2.5 Radius Statistics for Lower 48 US States
Figure 10:
for (j in 2008:2015) {
i<-j-2007
df<-subset(t,type=="Ozone" & yr==as.character(j),select=c(region,value))
choropleths[[i]]=county_choropleth(df,
title = paste(as.character(j),"Ozone level (ppm) Reporting Radius < 250 Miles"),
legend = "ppm level", num_colors = 1, state_zoom = NULL, county_zoom = NULL)
choropleths[[i]]$scales$scales[[1]]$limits<-c(0,0.07)
}
for (j in 2008:2015) {
i<-j-1999
df<-subset(t,type=="PM2p5" & yr==as.character(j),select=c(region,value))
choropleths[[i]]=county_choropleth(df,
title = paste(as.character(j),"PM2.5 level (ppm) Reporting Radius < 250 Miles"),
legend = "ppm level", num_colors = 1, state_zoom = NULL, county_zoom = NULL)
choropleths[[i]]$scales$scales[[1]]$limits<-c(0,21)
}
for (j in 2008:2015) {
i<-j-1991
df<-subset(t,type=="Ozone" & yr==as.character(j),select=c(region,radius))
df %>% rename(value=radius) -> df
choropleths[[i]]=county_choropleth(df,
title = paste(as.character(j),"Pollutant Reporting Radius (Miles)"),
legend = "Miles", num_colors = 1, state_zoom = NULL, county_zoom = NULL)
choropleths[[i]]$scales$scales[[1]]$limits<-c(0,250)
}
choroplethr_animate(choropleths)
25
26. for (j in 1:9) {
orig<-paste0("choropleth_",j,".png")
dest<-paste0("choropleth_",formatC(j,width=2,flag="0"),".png")
file.rename(from=orig,to=dest)
file.remove(orig)
}
for(j in 1:24) {
orig<-paste0("choropleth_",formatC(j,width=2,flag="0"),".png")
dest<-paste0("choroplethz_",formatC(j,width=2,flag="0"),".png")
shell(paste("convert -depth 3 -background white -quality 70",orig,dest))
}
setwd(userdir) # return to working directory
rm(t,df,orig,dest) # cleanup
The choroplethr animation is provided with its own player. All figures have been written to the ./figure-html
sub directory and can be viewed individually as well. Alternatively, we can render as an animated gif directly
provided we convert the png files into an animated gif with the ImageMagick converter and execute a shell
command. We will also use the animation library to illustrate animation in a pdf document.
setwd(figdir)
filestocopy<-as.vector(list.files(pattern="choropleth_"))
anidir<-paste(userdir,strsplit(anidir,"/")[[1]][length(strsplit(anidir,"/")[[1]])],sep="/")
file.copy(from=filestocopy, to=anidir, copy.mode=TRUE)
setwd(anidir)
# now convert in anidir these sequentially numbered files to one gif using ImageMagick
shell("convert -delay 100 -loop 0 -depth 3 -background white -quality 70 *.png choroplethr.gif")
# for html, use animated gif via
#![](./Animations/choroplethr.gif)
# for pdf, use animate package in LaTex
26
27. # now remove all the *.png files
file.remove(list.files(pattern=".png"))
setwd(userdir)
Conclusions
This analysis shows definite trends and explores techniques to screen data content. ggmap and choroplethr
visualization are complementing more common histograms, point charts and less frequently used, but powerful
radar charts. Both external players and animated gif allow for versatile animation display in an HTML
environment. The next target will be to port this type of visualization to a shiny app and extend to other
pollutants as selected dynamically in the application from the US EPA records.
References
The following sources are referenced as they provided significant help and information to develop this analysis:
1. Coursera Developing Data Product - Week 4 yhat(Part1) by Roger D. Peng, PhD
2. US EPA download site and data repositories
3. US Census and data repository
4. yhat
5. choroplethr Package
6. stackoverflow ggplot/mapping US counties thread
7. pbapply Package
8. ImageMagick converter
27