This document summarizes an exploratory data analysis (EDA) of air pollution data from major Chinese cities conducted using IBM's Watson Analytics. The analysis found that northern Chinese cities generally had higher levels of air pollution than southern cities, as indicated by higher emissions, PM10 concentrations, and fewer days of good air quality. Population levels and industrial activity contributed more to air pollution than other factors like living emissions. The EDA process involved initial questions from Watson Analytics, developing new questions, experimenting with filters and visualizations, and refining the data.
1. Decision Management Systems
Title
Instructor:
Student:
Contact:
Date:
Introduction
Despite being a rising like star in the global economy, China
still has many issues needing to be addressed such as political
freedom, health care incompetence, environment crisis, etc.
Smog is a huge concern in recent years. It is both health-related
and environment-related.
On December 8, 2015, Beijing's city government issued its first
red alert for pollution. They closed schools and construction
sites and restricted the number of cars on the road. Ten days
later, the second red alert was issued as the air pollution
continued to chock the city sky. The smog problem is not only
in Beijing but also has spread over other major cities across the
country.
The data set in my study is combined from three data sets
(Table 1, 2 & 3) that are retrieved from the National Bureau of
Statistics of China, which ensures the quality of the source and
data itself. There are no missing values or bad records in the
data set. The primary data sets are Ambient Air Quality in
Major Cities, Main Pollutant Emission in Waste Gas in Main
Cities and Pollutant Emission in Waste Water in Main Cities.
There are a total of 21 variables in the final data set. City name
2. serves as index and year indicates the data collection period.
The targeted variable is “Days of Air Quality equal or above
Grade II”. Other variables are interests as well such as,
“Volume of Consumption Soot Emission”, “Annual Average
Concentration of PM10”, etc. First section of data is
quantitative data of chemical concentrations and the Air quality.
Second section is the measurements of the emission of gas
waste and third section is the records of emission of water
waste. The emission of water waste might seem little related to
the air quality, but it is part of the environment pollution
elements. It would be useful when we explore the data in
Watson Analytics to confirm that.
Data exploration
To discover all the meanings of the data variables, data
exploration of Watson Analytics not only covers all data aspects
but also visualizes them. Some suggested questions are good
starting point.
The first question examined the volume of consumption soot
emission per city. Data visualization (Exploration Figure 1)
shows that the City of Harbin has the highest volume of this
emission then followed by the capital of China, Beijing.
Nanjing is the lowest and similar to cities of Kunming and
Fuzhou.
Next question provided the visualization of the living emissions
of the cities (Exploration Figure 2). The cities of Shanghai,
Chongqing and Guangzhou are the top three. According to the
report from the United Nations, these three cities are among the
top five most populated cities in China between the years of
2010 – 2015. Therefore, the data evidently shows the high
living emissions.
Now, we would like to explore the details of Days of Air
Quality equal or above Grade II compared by cities (Exploration
Figure 3). We can see that Shijiazhuang occupied the smallest
area, which means that this city has the least days of good air
quality. Shijiazhuang is a heavy industry concentrated area,
which explains the low air quality. Meanwhile, Haikou and
3. Kunming showed that they have the best quality for air. It is
very obvious that both cities have a low range of the living
emission. It is not hard to be convinced that since the city of
Haikou and Kunming both are not highly populated. In addition
both are located far away from the heavy industrial area.
To further explore the data, we would like to apply the features
available to discover more details of the data. To focus on the
Days of Air Quality ≥ Grade II, we will create a new variable
called “Region” to group the cities in four group by their
location. Since the north showed the highest Volume of
Consumption Soot Emission and South is just opposite
(Exploration Figure 4.) so we would like to see all other
differences in comparison while we apply more filters.
As Region is applied as a global filter (Exploration Figure 5),
South and North in particular. We can see that the cities in
southern region all have more days of better air quality than
northern ones. After we calculated the total of emission in waste
gas, it had a similar outcome – all the southern cities have less
industrial total waste in gas compared with cities in north. If we
look closely by applied additional filters - Annual Average
Concentration of PM10 < 101, which means air is not polluted.
These five cities are all in southern region.
The data exploration is really helpful in finding relationships
between variables. It could be used in many cases. In my field,
research, principle investigators could use the features to
explore key variables and see which ones most affect the
outcomes, even before the deep analysis by statisticians. This
would help them make decisions about planning subsequent
investigations.
Data refinement
Even though the data set has good quality, it is still necessary to
do some additional modification to in order to present
meaningful data for data exploration.
Begin with data matrix, Days of Air Quality Equal to or Above
Grade II (day) is the best variable scored 100% and the lowest
4. quality is Common Industrial Solid Wastes Disposed (56%). It
is not a surprise because the data was collected cross the
country that includes the heavy industrial cities and light
industrial cities. The outliers are due to the nature of the data so
it should be kept. For the same reason, variable Volume of
Industrial sulfur dioxide (ton) should be treated the same even
though it is median quality with score of 59%.
Being said that data was collected from wide spread areas
including heavy industry and light industry cities, a new
variable was created grouping the cities into North, South, West
and East and was used with the data exploration tool to examine
the difference or similarity.
A hierarchy “Pollution hierarchy” was created by using the
“Annual Average Concentration of PM10”. And it was used in
conjunction with the new variable “Total emission in waste gas
waste” that was created by summarize the total industrial
emission waste gas. Once again the northern cities indicated
serious pollution level with PM10 value from 150 to 305 and
followed by the west region.
Conclusion
By now, we can confidently conclude that the air pollution in
China is serious. The overall mean PM10 indicates air is lightly
polluted but that is misleading without considering the huge
variances. The PM10 ranges from 47 – 305. The north region
cities have the largest variances followed by west region cities.
From the data exploration we can tell that the high population
has an impact on living emissions. However, the emission in
waste gas more significantly contributes to the air pollution.
This is consistent with the nature of industry distribution in
China. Since the data was collected during 2013 and 2014, it
only tells the old story. The red alert of pollution issued in end
of 2015 is definitely an encouragement that implies that Chinese
government is taking it seriously.
5. References
Ambient Air Quality in Key Cities of Environmental Protection
(2013)
http://www.stats.gov.cn/english/Statisticaldata/AnnualData/
Hunt, K., Lu, S. (2015). Smog in China closes schools and
construction sites, cuts traffic in Beijing; CNN. Retrieved From:
http://www.cnn.com/2015/12/07/asia/china-beijing-pollution-
red-alert/
Main Pullutant Emission in Waste Gas in Main Cities (2013).
Data Retrieved from:
http://www.stats.gov.cn/english/Statisticaldata/AnnualData/
Main Pullutant Emission in Waste Gas in Main Cities (2014).
Data Retrieved from:
http://www.stats.gov.cn/english/Statisticaldata/AnnualData/
Most populated cities in China. Data Retrieved from:
http://www.nationsonline.org/oneworld/china_cities.htm
Rohde, R., Muller, R. (2015) Air Pollution in China: Mapping
of Concentrations and Sources. PLoS ONE 10(8): e0135749.
doi: 10.1371/journal.pone.0135749; Retrieved From:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone
.0135749
Wong, E. (2015). World Briefing | Asia; China: ‘Red Alert’ on
Beijing’s Air. New York. P10. Retrieved From:
http://www.nytimes.com/2015/12/18/world/asia/beijing-issues-
a-second-red-alert-on-pollution.html
Zhang, Y., Li, M., Bravo, M. A., Jin, L., Nori-Sarma, A., Xu,
Y., … Bell, M. L. (2014). Air Quality in Lanzhou, a Major
Industrial City in China: Characteristics of Air Pollution and
Review of Existing Evidence from Air Pollution and Health
Studies. Water, Air, and Soil Pollution, 225(10), 2187.
http://doi.org/10.1007/s11270-014-2187-3
Appendix:
Table 1.
6. City
Year
Annual Average Concentration of SO2 (μg/m3)
Annual Average Concentration of NO2 (μg/m3)
Annual Average Concentration of PM10 (μg/m3)
95th Percentile Daily Average Concentration of CO (mg/m3)
90th Percentile Daily Maximum 8 hours Average Concentration
of O3 (μg/m3)
Annual Average Concentration of PM25 (μg/m3)
Days of Air Quality Equal to or Above Grade II (day)
Beijing
2013
26
56
108
3.4
188
89
167
Changchun
2013
44
44
130
2.1
127
73
230
Changsha
2013
33
46
94
2.3
134
83
14. Exploration Figure 3.
Exploration Figure 4.
Exploration Figure 5.
1
Decision Management Systems
url: https://watson.analytics.ibmcloud.com/product
ibmid: [email protected]
password: 779Qldqgwatson
Assignment №2 – Exploratory Data Analysis (EDA) using
Watson Analytics
Deadline: Last day of week 5, 11:59 pm Eastern Time
Submission via LEO.
This is an individual assignment. Each student will complete the
assignment outlined below and post his/her written results to the
appropriate assignment. Please note that only 1 document is
allowed to be submitted. See content on p.2.
Grading criteria
Submitted assignments will be graded for (a) content, (b)
document quality (i.e. formatting, following guidelines,
pleasant to read, etc.), and timeliness of submission.
Assignments submitted late will be deducted 5 points for each
15. day it is late.
Activities
1. Select from the dataset provided (or ones designated by your
instructor). Provide a brief description of the datasets to include
the number of cases, description of the inputs, description of the
variables that could be used to develop predictive models, etc.
2. Examine the dataset and eliminate mistakes, bad records, data
entry errors, and outliers.
Using Watson Analytics:
3. Explore the dataset, including:
a) Examine the initial set of questions posed by Watson
Analytics. Provide any insights gained from this initial dataset.
b) Develop new specific questions which provide additional
insights into and answer specific questions from the dataset.
Discuss how these insights could be useful. Did Watson
Analytics provide the answers necessary? Discuss how you
would improve the relevancy.
c) Experiment with the available filters and visualization
options at the bottom of the screen and summarize the results.
Create and explain at least one insightful global and one local
filter for your dataset.
d) Create and explain at least one insightful calculation.
Discuss why this would be useful.
4. Refine the dataset.
a) Which variables have the highest quality score? Which ones
have the lowest quality score and why? Discuss how the quality
of the dataset could be improved.
b) Utilize the available grouping, filtering and hierarchical
functionalities to refine the data. Summarize the approach you
16. took and the outcome. What suggestions or insights are gained?
Submission
Each student will submit a single document conforming to the
guidelines and standards outlined below.
Document format:
· limited to 5 pages (excludingtitle page, references, and
appendix),
· Double-spaced, 12 point Times New Roman font, 1” margins,
Bottom-right page numbering.
Note: Submitted report must be either in MS Word or PDF
format and titled:
“Assignment4_LastName”.
Only one document will be allowed to submit.
Content(note that the document must have clearly marked
sections for the items listed below)
1) Title page (1 page limit): course number and term,
assignment number and project title, student name and contact
information, instructor’s name. Format it so it looks pleasant
and presentable. Follow formatting guidelines above.
2) Introduction. Provide a brief outline of the dataset you are
using for this assignment. Briefly explain the content of the
data. Include a screenshot of the data (not all, but partial as far
as all relevant variables are visible).
3) Discuss the data exploration process followed and the results.
Include any specific ideas or suggestions as to how this could
be used in your organization.
17. 4) Discuss the data refinement process followed and the results.
5) References (1 page limit): List all references in APA format
used in preparing this report. It is strongly recommended to use
outside knowledge in setting-up the analysis or discussing the
results where possible.
6) Appendix (4 page limit):
a) Appendix A: Include any appropriate workbooks and/or
screenshots (figures, tables, diagrams) used in this assignment.
Make sure all tables, figures, or diagrams are properly
numbered and titled. For example, “Table 1. Model Results”.
Make sure all tables or figures or diagrams are easily readable
and visually presentable.
General guidance
· Assignments that: 1) adequately address all required tasks; 2)
are submitted on time; 3) are properly formatted (APA format
for references, no typos or misspelled words, no grammar
errors, cover page, etc.) will receive a grade of B (80-89,
depending on content).
· In order to increase (but not guarantee) your chances of
receiving a higher grade, you need to show clear evidence of
critical thinking. Critical thinking can take many forms,
depending on the type of assignment. In some instances,
showing greater depth (e.g., such as creating more models,
looking at more than one insightful fact or relationships, and
comparing them on key criteria) is one method for providing
evidence of critical thinking. In other cases, it might include
providing more explanation to include the pros and cons of the
approach used or the arguments in favor and against the
proposal as well as some criteria for choosing among the
alternatives. Still another example would be providing
18. significant insights as to how the assignment outcome would
benefit (or would meet resistance) in your organization and
what steps might be employed to facilitate acceptance.
Certainly, this is not a complete list, but gives some examples
of critical thinking aspects.
Decision Management Systems
Student Name: Assignment №2 – EDA using
Watson Analytics Total points: 100
Content
Explanation
Points
Comment
Total
Earned
Written Report
a. Introduction
Is the dataset fully described and outlined? Is the intent of the
assignment discussed at an appropriate level of detail?
10
b. Dataset cleansing
19. Is the dataset fully described and cleansed of outliers (as
appropriate), mistakes and erroneous entries?
5
c. Data exploration
a) Are the initial questions discussed and their relevance
analyzed? Are insights from the questions provided?
15
b) Are additional specific questions developed and discussed?
Are the answers provided by Watson Analytics insightful and do
they provide specific answers?
20
c) Are the available filters and visualization options utilized
and explained? Are insights provided from using the filters and
visualization options?
15
a. Data refinement
a) Are quality scores compared and explained? Are suggestions
provided to improve the scores?
15
20. b) Are the grouping, filtering, and hierarchical features fully
explored and summarized? Are suggestions for improving the
outcomes provided?
15
b. Mechanics (spelling, grammar)
Is the paper free of grammatical errors and spelling and
punctuation? Is the paper properly formatted?
3
c. Citations and References
Are all references and citations correctly written and presented?
2
Total
100
d) Less
21. 1. Formatting
Does the submission follow formatting guidelines?
- 5
2. Page limit
Is the submission written within specified page limits?
- 5
3. Late submission (less)
5 points will be deducted for each day the assignment is late.
-5 each day
Final Grade for Assignment 2