Economical and Energetical Analysis of United
States (EEAUS)
Matteo Stabile
Master of Science in Engineering
in Computer Science
Rome,Italy
Email: stabile.1547019@studenti.uniroma1.it
Leonisio Schepis
Master of Science in Engineering
in Computer Science
Rome,Italy
Email: schepis.1533794@studenti.uniroma1.it
Abstract - This system is based on the main idea
to represent economical and energetical information of
United States of America in the interval of years that
starts from 2010 to 2014. The set of visualizations this
system provides is just educational. The dataset we
use is named ”United States Energy, Census, and GDP
2010-2014”. The main goal of the system is providing
a way to analyze the behaviour of both consumption
and economical situation of each single state of US. It
is addressed to a precise kind of users that could be
those employed in administrative tasks. The intended
use could be, for instance, getting information about
future investments in certain trades.
1 Introduction
The dataset we analyze contains mainly three macro
groups of information, namely:
- Energetical information: they concern strictly on
how much a certain kind of energy was consumed
or produced by each single state in all the five years
of interest. The most important indicators about
this are: TotalC for specifying the total energy con-
sumption in a given year, TotalP for indicating the
total energy production in a given year. For each
single energy source both the indicator of energy
consumption and energy production in a given year.
Actually also price information are included but
they are out of the scope of our system. What is im-
portant to notice is that all the measurements about
consumption and production are referred to billion
BTU (BTU stands for British Thermal Unit and cor-
responds to Joule in the SI. One BTU is equal to the
amount of energy used to raise the temperature of
one pound of water on degree Fahrenheit).
- Economical information: in this part we can find
just the value of the GDP index namely, the Gross
domestic product that is a monetary measure of the
market value of all final goods and services pro-
duced in a period. Even in this case there is a value
for each single year in the cited interval.
- Demographical information: contains informa-
tion about population density estimation, birth and
death rate or even about migration rate. Moreover
in this part we find information about specifical
geographic index like number of lakes or presence
of coasts.
The system provides a series of instruments
that make usage of colors and specific kinds of
visualization in order to :
◦ give to users a perceptual vision of how data
are distributed on the whole set of states;
◦ provide a mechanism of comparison among
states;
◦ merely get details about those information.
In the idea of representing information about each
single source energy we use some computed sub-
sets of the whole dataset in order to distinguish re-
newable and not renewable energy sources.
2 Preliminaries
In order to analyze the large dataset and investigate
deeply the correlation among our data we need to use
an analytical method. This is why we adopt the well
known Principal Component Analysis (PCA) method.
Before talking about that in the specific, let us explain
Fig. 1. This figure shows the resulting representation of the first two principal components performed by Python. On the left the whole
representation, on the right two zoomed zones showing more details. Notice that California, Texas and Hawaii represent border cases used
to make the analysis.
exactly which instruments we use:
- numpy and pandas libraries for performing the
Python PCA.
- XLSTAT 2017 for plumbing deeply all the aspects
of the analysis.
The main goal of this preliminary study is understand-
ing what are the main features of the dataset that are
most significant to represent.
2.1 Correlation matrix analysis.
By looking at intermediate steps of PCA we observe the
resulting correlation matrix. In this way we understand
how much each pair of features are more associated. By
definition, the criterion the matrix follows is that: the
more the index in a cell is near to one the more the as-
sociated features are related among them. Our first idea
was representing a possible relationship among the in-
dex of GDP and the Total Consumption one. In fact by
checking them for first, it comes out that the associated
index has a value around by 0,8 for each possibile pair,
regardless the year.
2.1.1 First component and incidences analysis.
Python PCA algorithm execution shows us what is
demonstrated in Figure 1. Obviously it does not allow
to actually understand which attributes affect on the first
component and which on the second one. This is why
we have to investigate on which they are more deeply.
Among the set of XLSTAT tables we are interested to
the variables contributions table. On this is reported
each single component produced by PCA. We are able
now to appreciate those features mainly affecting the
first component (F1): Total Consumption by 27% ap-
proximately and GDP around by 30%. The exact way
in which we can get that is easily justifiable by observ-
ing Figure 1. In particular we can see that there are
three border cases:
1) California has high values of those attributes that
affect both the first and the second component. In
fact, by looking at the datasets it has very high val-
ues of total consumption and GDP.
2) Texas has high values of attributes affecting on the
first component. In fact it has large values of those
attributes on that (consumption and GDP) compo-
nent.
3) Hawaii, at the end, represents the opposite case:
very low values of attributes on first component and
high values for second component attributes; in fact
their GDPs and total consumptions are very low in
the dataset, regardless the year.
These observations confirm what explained so far.
About the second component (F2) produced by PCA
Fig. 2. Map of USA for representing different kinds of information encoded by means of colors.
Fig. 3. Configuration panel with default combination.
we can perform the same analysis. On that component
the energy price affects around by 31% and the energy
production level around by 15%.
Obviously all the other attributes in the dataset have
own incidence levels on components. However either
they have too low values or they are even fewer that we
may just leave out them from this analysis.
The main conclusion to be derived from this pre-
liminary procedure is that our initial idea actually
works and so our system will be based on the represen-
tation of the two mentioned attributes affecting on the
first component.
3 Visualization and Interaction
EEAUS works on the interaction among three different
visualizations: USA map, timeseries and histograms.
3.1 USA map
This first visualization is in charge of making users able
to:
- choose one single state to inspect by using other
parts of the system;
- inspect information about any state by selecting dif-
ferent configurations.
The system makes available a default combination
made up by the ratio between the average value of total
consumption and the average value of GDP (as shown
in Figure 3). Average calculation performed on all the
five years. Figure 2 shows how color combinations en-
codes different values of that ratio. We use the green
scale taken by ColorBrewer since it suits better on the
background color of representations (near to black).
Notice that the higher is the value of the ratio the higher
is the intensity of green in that state. For example: since
Lousiana has the highest value in USA, it will have a
very high value of the average total consumption with
respect the average value of GDP.
The possible configurations the user can use for chang-
ing its perspective on the visualization are the follow-
ing:
1- ∆TotalCons/∆GDP in all the possible five years
available. We provide this trim with a specific cri-
terion: the GDP index contains also information
about total consumption (as well as information
about total production, price etc.). This means that
it encodes also how much the total consumption af-
fects the total GDP in a given year.
2- Average GDP in all the possible combination of five
years. It just represents a clear picture of the GDP
distribution.
3- Average Total Consumption like GDP represents a
picture of consumption levels of all kind of energies
during a certain period, for each state.
Fig. 4. On the left the timeseries representation based on a logarithmic scale for total consumption values. On the right the same represen-
tation on a normal scale based on the maximum total consumption value possible.
Fig. 5. This figure shows the histograms representation of not renewable energies. On the right the initial view of the picture where LPG
level are very low and so not much visible. On the left the same representation of LPG is perceptually clear thanks to the brushing tool .
3.2 Timeseries
The second visualization is in charge of offering to
users the possibility to better explore the behaviour of
total consumption value across the five years. Each
state is assigned to its own specific region of the USA.
Therefore it is set a way to analyze all the state belong-
ing to one specific part of the country. What is impor-
tant to notice in this representation covers two imple-
mentative choices explained in the following:
A- the y axis of the timeseries graph (as shown in Fig-
ure 4) is built with a logarithmic scale to ensure that
states with very high values of total consumption
can be perceptually relevant.
B- In order to represent appropriately the GDP value
for each year we encode that value with the area of
circles. In other words this means emphasizing the
presence of a difference instead of the difference
between different values, in the scope of GDPs.
Last aspect is about the variance filter. This tool serves
mainly to make the grade of interaction with the time-
series higher. With this filter we provide a mechanism
for inspecting specific intervals of values for variance.
Users wanting to know which state has a variance dif-
fering less than certain values from mean have just to
choose the interval and than inspect which timeseries
are still visible. Also in this case we use a logarithmic
scale for variances because a large set of values actually
are smeared in a tight interval (like [0.0,0.1]).
Actually is not so intuitive understanding what means
having variances between 0 and 1, but in our case they
are normalized over maximum and minimum value of
the set of variances. Therefore it makes sense in our
analysis.
Fig. 6. This picture shows entirely what the described case of study meant. On the left the circles about GDPs of Texas (to the top), California
(the second timeline down), Hawaii (the blue one) and Vermont (on the bottom). On the right just a proof of the variance interval where only
Texas is visible among all the set of states.
3.3 Histograms
The last visualization is in charge of representing in de-
tail what is the effective consumption level of each sin-
gle energy source for a selected state and year. Our goal
is also to allow users to make comparison with the av-
erage consumption level of the whole country. Notice
that there is a precise division between renewable and
not renewable energy sources not planned in the dataset
itself. The two groups involve:
Renewable:
◦ biomass
◦ geothermal
◦ hydropower
Not renewable:
◦ coal,
◦ fossil fuel
◦ natural gas
◦ LPG
For improving the user perceptual experience the sys-
tem allows to adjust dinamically the y axis scale. Since
some consumption levels are too low to be visualized
always, this mechanism makes easier the inspection of
those values (as shown in Figure 5).
4 Conclusions
First of all we want to show how is possible to link what
this paper said about PCA results and how we imple-
ment the presented data visualizations. This is easily
possible by using a case of study. As already mentioned
in the PCA paragraph there are some states that repre-
sent a sort of outliers on first and second components
and their associated attributes. This means that Texas,
Hawaii, California and Vermont in our dataset represent
special cases, in particular:
- Texas and California has high values of total con-
sumption and GDP whereas Vermont has low value
of those attributes.
- Hawaii has low value for total consumption and
GDP in spite of high value of second component.
What we expect from our visualizations is to have
particular timelines for the states mentioned on the
first point above. Texas and California with very high
value both of GDP (therefore circles will have largest
areas) and total consumption (thus associated timelines
will be on the top of the graph). On the other hand
Vermont will have very low values for GDP and total
consumption in the dataset. Figure 6 shows what we
are talking about. Since we do not represent visually
anything about second component, in somehow we
have willingly left out those information. Hereby
Hawaii will be not so relevant.
Just looking at Figure 6 we can grasp how the log-
arithmic scale on total consumption works. In fact
Vermont which presents very tight variation of total
consumption levels, actually has a timeline changing
clearly due to the mathematical consequences of loga-
rithm . However the analysis of timeseries figures out
boring information to be analyzed in which stands out
something of interesting about Texas and California.
About the former we can say that it is the only state
growing up greatly with respect all the others (in the
logaritmhic scale is less intuitive than in the normal one
as we can observe in Figure 6: Texas is the timeline
on the top. It is exaclty the opposite case of Vermont
cited above). The related trend is more or less the
same for all the five years represented. The latter
concerns GDP indexes because indeed it is the state
with highest values of GDPs(and so with the largest
circles) regardless the year. What is again interesting
to notice is about variances.
In support of the view that Texas behaves differently
there is a further evidence namely what happens if we
play with the filter interval. If we try to reduce the
selected range then we can notice that for values within
[0.75,1.00] the only timeline remaining visible is the
Texas one. This means that it differs from the mean
of the entire sample (that in this case is represented
by the set of variance values of each state) very much,
revealing that it is an outlier for our dataset.
Last observation we want to point out concerns
another possible point of view of the dataset. In
particular there is the possibility to build an entire
system like EEAUS based on those attributes that affect
on the second component (again with respect the GDP
index). This because GDP actually involves all the
features the datasets describes separately. It might be
a good idea but focusing on the consumption has two
main reasons:
1- given that inspective analysis PCA points out the
attribute affecting on the first component and then,
by definition, they capture the maximum variance
of data.
2- Our intention is not to perform a precise econom-
ical analysis (achievable by using the set of at-
tributes hitting on the second component indeed)
but rather to pay more attention on the energetical
consumption point of view.
References
[1] Kaggle. United States Energy, Census,
and GDP 2010-2014. Examining the re-
lationships between various data collected
by the US government. See the link:
https://wwww.kaggle.com/lislejoem/
us_energy_census_gdp_10-14.
[2] Slides. Theoretical part. Course material of Visual
Analytics held by Prof. Giuseppe Santucci.
[3] Slides. Pratical part. Course material of Visual
Analytics practical part. Held by Dott. Marco An-
gelini.
[4] Colors. http://colorbrewer2.org/

EEAUS paper- Visual Analytics application

  • 1.
    Economical and EnergeticalAnalysis of United States (EEAUS) Matteo Stabile Master of Science in Engineering in Computer Science Rome,Italy Email: stabile.1547019@studenti.uniroma1.it Leonisio Schepis Master of Science in Engineering in Computer Science Rome,Italy Email: schepis.1533794@studenti.uniroma1.it Abstract - This system is based on the main idea to represent economical and energetical information of United States of America in the interval of years that starts from 2010 to 2014. The set of visualizations this system provides is just educational. The dataset we use is named ”United States Energy, Census, and GDP 2010-2014”. The main goal of the system is providing a way to analyze the behaviour of both consumption and economical situation of each single state of US. It is addressed to a precise kind of users that could be those employed in administrative tasks. The intended use could be, for instance, getting information about future investments in certain trades. 1 Introduction The dataset we analyze contains mainly three macro groups of information, namely: - Energetical information: they concern strictly on how much a certain kind of energy was consumed or produced by each single state in all the five years of interest. The most important indicators about this are: TotalC for specifying the total energy con- sumption in a given year, TotalP for indicating the total energy production in a given year. For each single energy source both the indicator of energy consumption and energy production in a given year. Actually also price information are included but they are out of the scope of our system. What is im- portant to notice is that all the measurements about consumption and production are referred to billion BTU (BTU stands for British Thermal Unit and cor- responds to Joule in the SI. One BTU is equal to the amount of energy used to raise the temperature of one pound of water on degree Fahrenheit). - Economical information: in this part we can find just the value of the GDP index namely, the Gross domestic product that is a monetary measure of the market value of all final goods and services pro- duced in a period. Even in this case there is a value for each single year in the cited interval. - Demographical information: contains informa- tion about population density estimation, birth and death rate or even about migration rate. Moreover in this part we find information about specifical geographic index like number of lakes or presence of coasts. The system provides a series of instruments that make usage of colors and specific kinds of visualization in order to : ◦ give to users a perceptual vision of how data are distributed on the whole set of states; ◦ provide a mechanism of comparison among states; ◦ merely get details about those information. In the idea of representing information about each single source energy we use some computed sub- sets of the whole dataset in order to distinguish re- newable and not renewable energy sources. 2 Preliminaries In order to analyze the large dataset and investigate deeply the correlation among our data we need to use an analytical method. This is why we adopt the well known Principal Component Analysis (PCA) method. Before talking about that in the specific, let us explain
  • 2.
    Fig. 1. Thisfigure shows the resulting representation of the first two principal components performed by Python. On the left the whole representation, on the right two zoomed zones showing more details. Notice that California, Texas and Hawaii represent border cases used to make the analysis. exactly which instruments we use: - numpy and pandas libraries for performing the Python PCA. - XLSTAT 2017 for plumbing deeply all the aspects of the analysis. The main goal of this preliminary study is understand- ing what are the main features of the dataset that are most significant to represent. 2.1 Correlation matrix analysis. By looking at intermediate steps of PCA we observe the resulting correlation matrix. In this way we understand how much each pair of features are more associated. By definition, the criterion the matrix follows is that: the more the index in a cell is near to one the more the as- sociated features are related among them. Our first idea was representing a possible relationship among the in- dex of GDP and the Total Consumption one. In fact by checking them for first, it comes out that the associated index has a value around by 0,8 for each possibile pair, regardless the year. 2.1.1 First component and incidences analysis. Python PCA algorithm execution shows us what is demonstrated in Figure 1. Obviously it does not allow to actually understand which attributes affect on the first component and which on the second one. This is why we have to investigate on which they are more deeply. Among the set of XLSTAT tables we are interested to the variables contributions table. On this is reported each single component produced by PCA. We are able now to appreciate those features mainly affecting the first component (F1): Total Consumption by 27% ap- proximately and GDP around by 30%. The exact way in which we can get that is easily justifiable by observ- ing Figure 1. In particular we can see that there are three border cases: 1) California has high values of those attributes that affect both the first and the second component. In fact, by looking at the datasets it has very high val- ues of total consumption and GDP. 2) Texas has high values of attributes affecting on the first component. In fact it has large values of those attributes on that (consumption and GDP) compo- nent. 3) Hawaii, at the end, represents the opposite case: very low values of attributes on first component and high values for second component attributes; in fact their GDPs and total consumptions are very low in the dataset, regardless the year. These observations confirm what explained so far. About the second component (F2) produced by PCA
  • 3.
    Fig. 2. Mapof USA for representing different kinds of information encoded by means of colors. Fig. 3. Configuration panel with default combination. we can perform the same analysis. On that component the energy price affects around by 31% and the energy production level around by 15%. Obviously all the other attributes in the dataset have own incidence levels on components. However either they have too low values or they are even fewer that we may just leave out them from this analysis. The main conclusion to be derived from this pre- liminary procedure is that our initial idea actually works and so our system will be based on the represen- tation of the two mentioned attributes affecting on the first component. 3 Visualization and Interaction EEAUS works on the interaction among three different visualizations: USA map, timeseries and histograms. 3.1 USA map This first visualization is in charge of making users able to: - choose one single state to inspect by using other parts of the system; - inspect information about any state by selecting dif- ferent configurations. The system makes available a default combination made up by the ratio between the average value of total consumption and the average value of GDP (as shown in Figure 3). Average calculation performed on all the five years. Figure 2 shows how color combinations en- codes different values of that ratio. We use the green scale taken by ColorBrewer since it suits better on the background color of representations (near to black). Notice that the higher is the value of the ratio the higher is the intensity of green in that state. For example: since Lousiana has the highest value in USA, it will have a very high value of the average total consumption with respect the average value of GDP. The possible configurations the user can use for chang- ing its perspective on the visualization are the follow- ing: 1- ∆TotalCons/∆GDP in all the possible five years available. We provide this trim with a specific cri- terion: the GDP index contains also information about total consumption (as well as information about total production, price etc.). This means that it encodes also how much the total consumption af- fects the total GDP in a given year. 2- Average GDP in all the possible combination of five years. It just represents a clear picture of the GDP distribution. 3- Average Total Consumption like GDP represents a picture of consumption levels of all kind of energies during a certain period, for each state.
  • 4.
    Fig. 4. Onthe left the timeseries representation based on a logarithmic scale for total consumption values. On the right the same represen- tation on a normal scale based on the maximum total consumption value possible. Fig. 5. This figure shows the histograms representation of not renewable energies. On the right the initial view of the picture where LPG level are very low and so not much visible. On the left the same representation of LPG is perceptually clear thanks to the brushing tool . 3.2 Timeseries The second visualization is in charge of offering to users the possibility to better explore the behaviour of total consumption value across the five years. Each state is assigned to its own specific region of the USA. Therefore it is set a way to analyze all the state belong- ing to one specific part of the country. What is impor- tant to notice in this representation covers two imple- mentative choices explained in the following: A- the y axis of the timeseries graph (as shown in Fig- ure 4) is built with a logarithmic scale to ensure that states with very high values of total consumption can be perceptually relevant. B- In order to represent appropriately the GDP value for each year we encode that value with the area of circles. In other words this means emphasizing the presence of a difference instead of the difference between different values, in the scope of GDPs. Last aspect is about the variance filter. This tool serves mainly to make the grade of interaction with the time- series higher. With this filter we provide a mechanism for inspecting specific intervals of values for variance. Users wanting to know which state has a variance dif- fering less than certain values from mean have just to choose the interval and than inspect which timeseries are still visible. Also in this case we use a logarithmic scale for variances because a large set of values actually are smeared in a tight interval (like [0.0,0.1]). Actually is not so intuitive understanding what means having variances between 0 and 1, but in our case they are normalized over maximum and minimum value of the set of variances. Therefore it makes sense in our analysis.
  • 5.
    Fig. 6. Thispicture shows entirely what the described case of study meant. On the left the circles about GDPs of Texas (to the top), California (the second timeline down), Hawaii (the blue one) and Vermont (on the bottom). On the right just a proof of the variance interval where only Texas is visible among all the set of states. 3.3 Histograms The last visualization is in charge of representing in de- tail what is the effective consumption level of each sin- gle energy source for a selected state and year. Our goal is also to allow users to make comparison with the av- erage consumption level of the whole country. Notice that there is a precise division between renewable and not renewable energy sources not planned in the dataset itself. The two groups involve: Renewable: ◦ biomass ◦ geothermal ◦ hydropower Not renewable: ◦ coal, ◦ fossil fuel ◦ natural gas ◦ LPG For improving the user perceptual experience the sys- tem allows to adjust dinamically the y axis scale. Since some consumption levels are too low to be visualized always, this mechanism makes easier the inspection of those values (as shown in Figure 5). 4 Conclusions First of all we want to show how is possible to link what this paper said about PCA results and how we imple- ment the presented data visualizations. This is easily possible by using a case of study. As already mentioned in the PCA paragraph there are some states that repre- sent a sort of outliers on first and second components and their associated attributes. This means that Texas, Hawaii, California and Vermont in our dataset represent special cases, in particular: - Texas and California has high values of total con- sumption and GDP whereas Vermont has low value of those attributes. - Hawaii has low value for total consumption and GDP in spite of high value of second component. What we expect from our visualizations is to have particular timelines for the states mentioned on the first point above. Texas and California with very high value both of GDP (therefore circles will have largest areas) and total consumption (thus associated timelines will be on the top of the graph). On the other hand Vermont will have very low values for GDP and total consumption in the dataset. Figure 6 shows what we are talking about. Since we do not represent visually anything about second component, in somehow we have willingly left out those information. Hereby Hawaii will be not so relevant. Just looking at Figure 6 we can grasp how the log- arithmic scale on total consumption works. In fact Vermont which presents very tight variation of total consumption levels, actually has a timeline changing clearly due to the mathematical consequences of loga- rithm . However the analysis of timeseries figures out boring information to be analyzed in which stands out something of interesting about Texas and California. About the former we can say that it is the only state growing up greatly with respect all the others (in the logaritmhic scale is less intuitive than in the normal one as we can observe in Figure 6: Texas is the timeline on the top. It is exaclty the opposite case of Vermont cited above). The related trend is more or less the same for all the five years represented. The latter concerns GDP indexes because indeed it is the state
  • 6.
    with highest valuesof GDPs(and so with the largest circles) regardless the year. What is again interesting to notice is about variances. In support of the view that Texas behaves differently there is a further evidence namely what happens if we play with the filter interval. If we try to reduce the selected range then we can notice that for values within [0.75,1.00] the only timeline remaining visible is the Texas one. This means that it differs from the mean of the entire sample (that in this case is represented by the set of variance values of each state) very much, revealing that it is an outlier for our dataset. Last observation we want to point out concerns another possible point of view of the dataset. In particular there is the possibility to build an entire system like EEAUS based on those attributes that affect on the second component (again with respect the GDP index). This because GDP actually involves all the features the datasets describes separately. It might be a good idea but focusing on the consumption has two main reasons: 1- given that inspective analysis PCA points out the attribute affecting on the first component and then, by definition, they capture the maximum variance of data. 2- Our intention is not to perform a precise econom- ical analysis (achievable by using the set of at- tributes hitting on the second component indeed) but rather to pay more attention on the energetical consumption point of view. References [1] Kaggle. United States Energy, Census, and GDP 2010-2014. Examining the re- lationships between various data collected by the US government. See the link: https://wwww.kaggle.com/lislejoem/ us_energy_census_gdp_10-14. [2] Slides. Theoretical part. Course material of Visual Analytics held by Prof. Giuseppe Santucci. [3] Slides. Pratical part. Course material of Visual Analytics practical part. Held by Dott. Marco An- gelini. [4] Colors. http://colorbrewer2.org/