Tlad better with data - matthew love + charles (2)

TLAD 2015, The 13th International Workshop on Teaching, Learning and Assessment in Databases.
Birmingham, UK, 3rd July.
Better with Data:
A case study in sourcing Linked Data
into a Business Intelligence analysis
Matthew Love
Dept. of
Computing
Sheffield Hallam
University
Sheffield S1 1WB
m.love@shu.ac.uk
Charles Boisvert
Dept. of
Computing
Sheffield Hallam
University
Sheffield S1 1WB
c.boisvert@shu.ac.u
k
Amin Chowdhury
Dept. of
Computing
Sheffield Hallam
University
Sheffield S1 1WB
mdamin.chowdhur
y@gmail.com
Ian Ibbotson
Better with Data
Society
ianibbo@gmail.co
m
http://betterwithda
ta.co
Abstract
This paper describes a case study investigating the relationship between weather
conditions and levels of air pollution. The case study illustrates aspects of finding
and accessing Open Data, exploring Linked Data, sections of the Extract-Transform-
Load processes of data warehousing, building an analytic cube, and application of
data mining tools.
The paper is intended to aid tutors and students of databases by providing a study
that gives a practical and repeatable case example giving an overview of how several
topics in the area of data collection and analysis integrate together.
All the data sources and tools used are free for use in academic contexts
(permissions from data source owners should be sought). Web links are given to all
resources discussed.
Keywords
Business Intelligence, Data Mining, Data Warehousing, SPARQL, Linked Data
1. Introduction
This paper (and associated downloads) outlines a number of topics related to finding
and accessing Open Data, merging sources, and analysing data using self-service and

data mining tools. All the data and the software tools are available to
academics/students free of charge (though permission should be sought for some of
the data). The case study presented involves more data than might normally be used
for overviews of the subject and is ‘rich’ enough in content to help highlight genuine
issues, but remains sufficiently structured and simple as to not become
overwhelming. It is intended that tutors can use the resources described to give
overview presentations of the topic without needing to overcome problematic
barriers.
2. Case Study : Air pollution in a major city
Air pollution kills people. It is estimated that 29000 thousand people per year in the
UK die early through breathing difficulties at times of low air quality (PHE). The UK
government has imposed targets for reducing the quantities and/or frequencies of
the main pollutants (some figures given below). Local Authorities are responsible for
monitoring and publishing pollution levels in their areas.
Sheffield City Council uses two types of monitoring devices: diffusion tubes and fully
automated processing units.
Fig 1: Nitrogen Dioxide diffusion tube (left) and automated station (right)
Both types of devices are illustrated figure 1 above. There are around 160 diffusion
tube devices and six fully automated processing stations.
The diffusion tubes have the advantage of being spread throughout the city area.
However, they give data only when sent in for analysis, and typically this is once
every six to eight weeks per tube. The results are aggregated to an annual level prior
to publication.

The six automated processing stations, named “Groundhogs”, measure a variety of
pollutants, and one also measures temperature and air pressure. Between three and
eight readings are taken per hour. The public can read a log of all readings up to
around one hour previously. The council occasionally may correct readings, or delete
readings from the log. Although some of the stations have been operating since
2000, there are a number of gaps in the data logs. In addition, the stations are
occasionally moved (usually to help investigate new pollution “hot spot” concerns).
The council maintains a web site (SCC-1) giving several informative descriptions as to
the types of pollutants commonly found in air plus a description of the images, and
further descriptions (SCC-2).
Readers (and their students) are invited to visit the SCC-2 web link and then select
“Station info”. This is Figure 2 left on the following page.
Fig 2 - Council pollution monitoring information (left) and automated station results (right)
Fig 2 (left) illustrates a very common problem with data sourced from the internet.
The information is presented as textual descriptions, with no obvious way of
automatically deriving further information. We are told that Groundhog1 is at
“Orphanage Road, Firhill” but it would take a human-based web search to find the
geographical location and then further searches to discover the nature of the
location (residential or industrial area, nearness to main road, etc).
Tutors may use this to introduce a discussion on why data on the internet, even when
is not necessarily considered as “Open Data”, even when user rights are clear. The
Open Data Handbook [ODH] is a useful resource.

Readers (and their students) should then visit the “Latest data” tab of the same site,
and then click on any of the Groundhogs (Groundhog1 is often the best choice). This
is Figure 2 right.
Notice that navigation to this page is not designed for automation. The user must
click on a visual map to select the page. Notice too, that the URL for data pages does
not reflect the name of the Groundhog being visited so likewise manual navigation is
necessary. Students may discuss how, in an age of internet-sourced data, URLs can
and should be designed to allow for automated discovery by data harvesting tools.
Note, however, from the above image, that this page allows data (of any user-
selected range) to be downloaded, in a choice of PostScript, “Raw” (i.e. comma
separated values) or Excel formats.
If following this paper as an exercise, readers should attempt to download some
example CSV files. They will find this must be done one groundstation, and one
pollution type at a time. Downloads may be inspected using WordPad (not Notepad,
as the EndOfLines are not compatible), when it can be seen that there are date
(YYMMDD), time (HH:MM) and NO2 reading, roughly 3 to 8 readings per hour.
3. Linked data
The above discussion has highlighted the need for better methods of publishing data
for automated discovery and consumption. An important method is Linked Data
[W3S], which extends standard data by giving information about its relationship to
other data. Chains can be followed, to discover more information about the context
of something of interest (for example to discover about the type of neighbourhood
of each of the Groundhog sites)
Very informally, Linked Data items are presented as a threefold item:
 a unique identifier for the data item that is the subject of a relation
 a predicate describing the nature of the relation
 The information related the object of the relation.
The identifier and predicate take the form of a URI, but the object may be either a
URI or a literal value. URIs are often also the URL of a description file: in this case a
user can fetch human-readable information about the item into a browser.
Chains can be followed from triple to triple to discover more information about the
context of something of interest, for example to discover information about each of
the Groundhog sites. For this reason linked data is said to use a graph database.

For example, as illustrated figure 3 below, the data about the “Groundhog1” NO2
sensor shows what it measures, its location, the type of device it is, and actual
measured values for a given date and time.
Fig 3: A subset of the Sheffield Air Quality+ database. Data points in bold, red text are URIs;
One of this paper’s authors created the Air Quality+ database, which allows over-
the-web interrogation of the Sheffield pollution measurements as linked data. Each
of the Groundhog stations have their own URI, and the database holds the
measurements of each sensor, for the Groundhog stations and diffusion tubes; so
Groundhog1 sensors, for example, include not just NO2 but also SO2, micro-particles
(e.g. diesel fumes), air pressure and air temperature. Each of these sensors records
frequent measurements that are archived in the database as triples. Using the URI, all
but literal values can be further investigated, for instance to find out more about the
NO2 compound.
To query the Subject / Predicate / Value triples in the database, we use the SPARQL
query language. SPARQL is designed to facilitate exploring a linked data graph; for
example:
 If given the URI of a schema for Groundhogs, it can return a list of all
Groundhogs.
 If given a URI of a specific Groundhog, it can return its schema, which (for
example) can be used to discover what pollutions that particular Groundhog is set
up to measure.
 If given the URI of a specific pollutant, it can return all the values, i.e. all the
readings for that pollutant.
sensor
Groundhog1
NO2 sensor
sensing
device NO2
type
GroundHog1NO2
2015051300500
observation
value
13.189
4
2015-05-10T05:00Z
type
has
value
end
time
measurement
property
53.40266
latitude
-1.463957
longitude

All queries can contain filters, for example to only return values within a selected date
range.
From the above informal description the reader may be able to see that SPARQL can
be used to programmatically discover what Groundhogs there are, what pollutants
each monitors, and then get the readings of those pollutants. Most SPARQL systems
offer a range of formats for retrieving results, including CVS (useful for databases),
JSON (useful for Javascript web pages), XML (useful for further transpositions) etc.
Figure 3 - The SPARQL editor written by C. Boisvert for the Sheffield project
Figure 3 illustrates a SPARQL query asking for hourly readings from all available
Groundhogs between selected dates, together with the results of this query.
One of this paper’s authors installed SPARQL on a hosted server, and set up the
necessary triples to allow over-the-web interrogation of the Groundhog database. A
second author fed the SPARQL syntax definition into standard open editor tools to
generate an editor for queries. Users are invited to use the editor (Boisvert). Some
example queries for Groundhogs are available (BWDS).
There are also some very good, student appropriate, tutorial guides for SPARQL, e.g.
(CAM), accessing some well documented knowledge stores, including Wikipedia’s
triple store.

Tutors following this paper should note that if students cannot master using SPARQL
for downloading Groundhog data then the manual page-scraping approach
discussed in the previous section ultimately will deliver the same data.
4. Integration of further data sources
One of the principles of Data Warehousing when used for analytical purposes (as
opposed to “data store housing” for safe custody of data) is to try to give added
context to facts, through Dimension descriptors added from other sources.
In the current case study Groundhog1 lists temperature and air pressure readings.
But other factors may influence pollution formation and/or dispersal as well. Obvious
factors are wind strength and humidity.
Wind direction is also a factor (but is more complex: if the monitor is directly east of
a large polluting factory then a strong wind towards the east will increase
measurement values; if the monitor was directly west of the factor, then the same
strong towards-east wind would remove the pollution from the area of the monitor;
so wind direction is relevant, but the effect is different for each Groundhog)
Detailed historic weather data is commercially valuable, and quite hard to find for
free download. Sheffield is fortunate in having a local enthusiast who had monitored
and published readings at five-minute intervals for all the desired measures.
Unfortunately the data is published in PDF format, with documents of around 200
pages per month of data. The tool “Bytescount PDF viewer” can be used to extract all
pages into one CVS file. All readers are very strongly requested to contact the data
owner to get permissions to use the data for study purposes (any commercial use of
the data could cause the site to be closed).
5. Creation of a Data Warehouse, and ETL processes
All data values were then uploaded into a Microsoft SQL Server with Business
Intelligence database. This software is free (for academic use) to install from
Microsoft Dreamspark (MS1) onto university teaching systems and student laptops.
Alternatively, students can have 150-day free use of the same software from the
Microsoft Azure cloud platform (MS2). MS Azure has convenient setup options for
SQL Server Business Intelligence.
Alternatives, such as the Weka tools [WEKA], are also envisageble for the steps that
follow, and offer opportunities to investigate the processing algorithms further. The
authors have found Microsoft tools more appropriate for their student population.
Once the data is loaded into tables on the SQL Server it needs to be transformed into
formats (discussed below) suitable for data analysis. The case study demonstrates a

realistic but manageable number of steps that can be found in many Extract-
Transform-Load systems of Data Warehouses.
All the scripts used for ETL are available from (Love).
One of the useful teaching illustrations of the ETL process is to contrast using the
Server’s menu-driven wizard approaches for uploading files into tables with SQL
scripts that do the same tasks. Students are not always aware that SQL has
commands that allow for manipulation of database structure (as opposed to
manipulation of data values), but quickly start to see the value of relatively short
scripts that can be reused across multiple uploads.
A second useful teaching illustration comes from the “Sheffield Weather Page” data
being at five minute frequency, while the Groundhog readings vary between twelve
and twenty minute frequencies. This was resolved by writing further scripts in SQL
(again downloadable from the above link) that first summarised the respective
Groundhog and the Weather data into hourly readings (taking the means of readings
within each hour, except for wind direction where the most frequent wind direction
was taken), and then integrated these into a single observations table.
A second database on the same server was opened, and the obervations table copied
(via a short script command) into it. Descriptive tables were created that gave
informative names and attributes for the Groudhogs, and descriptive category names
range limits for each of the weather attributes (for example dimWindSpeed: No
wind = 0 kph; very light breeze = 1-3 kph, through to strong winds 20 kph and over).
A script then created a data “Star” based on (Kimball)’s designs, with a single Facts
table linked to relevant rows in each of the Dimension tables.
Having two databases on the same server, one for ETL data acquisition and
preparation, and one for storage of the integrated Star of facts and dimension tables,
helps students see for themselves the concept of a Data Staging area as described
throughout Kimball’s work. Just as Kimball describes, all the “messy” processes
happen, hidden from end-user view, in the Staging area. Clean, usable, subject-
structured data is then published to data marts.
6. Creation of Data Cube from Data Star
SQL Server with Business Intelligence contains a facility for defining Data Cubes for
fast analytical processing. Cubes can source their data directly from the uploaded
CVS files, but students quickly appreciate the simplicity of sourcing from the Star
created in the previous step. Refreshes in the data values in the Star (or even
alterations in the design of the Star) can quickly be pulled through into the Cube.

By default, when used for self-serve reporting (illustrated below) cubes automatically
report totals (sums) of data value, aggregated over the user-selected timeframe (or
geographic distribution, etc.). For example, selecting NO2 (Nitrogen Dioxide) would
automatically report on the total readings ever, or totals per year, or totals per
month, or per day, or even per hour, depending on what date-range the use
happened to select. Users usually start at the top level – the most aggregated – and
then “drill down” for more details.
In the current case study the averages of pollution values are much more relevant
than the totals. It is a lot easier to compare calendar months of data if an averages
are used, as this eliminates that some months are longer than others. Peak values
within any selected time frame are also of “headline” interest, but users need to treat
information with caution as a peak may well be caused by a local factor such as a
badly tuned lorry or tractor passing upwind of the monitor station)
The “Calculated Measures” facility of the Data Cube was used to set formulas to
report the means of each of the numeric measures. The formula for mean is as
simple as “Sum of NO2 divided by Count of NO2”: the cube automatically applies
the context of level of drilling for all selected dimensions. Setting up medians is
beyond the scope of this simplified case study, but students can discuss how median
values can be used to ignore the effect of outlier readings.
Many texts on Data Warehousing utilise Inmon’s term of “subject-oriented”. In simple
case studies students often cannot see the difference between the data sources and
the DW subject-orientation. One differentiator is that “business rules” can be
encoded into the data or data presentation within the cubes. For nitrogen dioxide air
pollution, 40 mg is a threshold for concern, and 100 mg is a threshold for serious
concern. Facilities within the Data Cube were used to encode these levels into colours
of presentation. Key Performance Indicators, with “traffic light” colours and “trend”
arrows could also be set up. The threshold values can usefully be explained to
students as examples of Business Metadata, contrasting with Technical Metadata
(such as field data types) more often seen in tutorials.
7. Self Service data exploration
Microsoft’s preferred “self-service” data exploration tool is Excel. Indeed, a single
button click from within the cube development tool will open the cube in Excel. Data
is presented via PivotTables. (Readers should note that the Azure cloud platform
does not contain Excel. Users can set end-points to allow their local copy of Excel to
link to the cloud server. Alternatively users can simply install 30-day trial copies of
Office onto Azure).

Figure 4 - A self-service display of data showing Nitrogen Dioxide levels per hour on days of week.
Students will very quickly (within minutes) start making discoveries about the data.
Figure 4 shows nitrogen dioxide pollution levels varying across time for each day of
the week. This image prompted a lot of discussion as to the timing of the apparent
peak times for pollution (the effect of driving?) and the clear difference between
Saturday and Sunday versus the rest of the week.
Figure 5 - Average NO2 levels for categories of temperature

Figure 6 - Average NO2 levels for source direction of wind (Groundhog 1 monitor)
It can also be discovered that freezing or near-freezing days are associated with high
NO2 pollution levels, and that for Goundhog1 winds from the east have worse
pollution.
Students can “self-service” discover other relationships between the data. Some are
obvious (winter months tend to have colder days), but student do get to experience
the concept of a data analyst exploring the data themselves. Many students do not
know that displays other than line graphs and bar charts are available, and useful
discussions can be held about using comparative percentages as a means of spotting
patterns or exceptions.
8. Data Mining
Many students of databases get a few introductory classes on Data Mining, but may
not get to build and use a data mining facility for themselves. Having got the
pollution and weather data into SQL Server, the same environment can be used to
develop mining reports within a few minutes and with no further coding.

Figure 7 - Data Mining using the SQL Server Business Intelligence suite
Left image : Selecting inputs (weather conditions) and an output - what is to be predicted (NO2
reading)
Right image: Result of running the Cluster data mining tool
In Figure 7 above a Clustering algorithm has identified ten clusters of weather data.
The darker clusters (for example) Cluster 9 contain a high proportion of bad pollution
days, The lighter clusters (for example 3 and 6) contain hardly any bad pollution days.
Figure 8 - Understanding the properties of Cluster 9 (the cluster with a large proportion of High NO2 readings)
Left image - Properties ranked by probability
Right image - Comparison of properties of Cluster 9 compared to all other clusters
Cluster 9 can be understood by viewing further screens (see Figure 9). The left image
appears to show that this cluster (with its high proportion of high pollution days)
tends to dry, low wind, high pressure, high/very high humidity, and cold or near
freezing temperatures. In other words: murky, dry winter mornings.
(It is not necessary for the current discussion, but rainfall absorbs NO2, and sunshine
breaks down NO2 but giving a by-product of ozone: So warm dry days are not
necessarily clear of pollutants).

Figure 9 - Further analysis of the Weather / Pollution dataset : Left - Association Rules, Right - Decision
Trees
Figure 9 (left) shows Association Rules. For example, the listed first states that Strong
Winds and Warm temperatures are associated with the lowest levels of NO2.
Figure 9 (right) shows a Decision Tree. The most influential factor for NO2 appears to
be wind speed. Time of day and then Air Temperature are the next deciding factors.
Other Data Mining plots available include Neural Nets, Regression and Naive Bayes.
Our experience is that the default settings for each of the analysis shown produce
interpretable results quickly. Fine-tuning the parameters (controlling the number of
clusters, for example) can generate increases in data interpretation, but often the
effect is marginal. However, discussing with students what the parameters do can
help students understand the concepts of “supervised” versus “unsupervised”
learning. A frequent discussion point is whether Categories can then be fed back into
the Data Warehouse, to fine-tune the Dimension attributes.
9. Summaryand Contributionto the knowledge
This paper has demonstrated the application of Data Warehousing and Data Mining
tools, using data gathered from internet sources. But rather than taking conveniently
prepared datasets, the paper has shown some of the common difficulties met by
database professionals when collecting data from non-traditional sources. Linked
Data has been explained and illustrated as a potentially very helpful way out of these
difficulties.

All the tools used, and all the data used, is available for free use in academic
contexts. Please do ask data providers for permissions to use, though.
Perhaps the major contribution of this case study is that although it introduces and
discusses a number of “real-world” issues, particularly around the Extract-Transform-
Load procedures of data warehousing, the scale of the study is not infeasible for
quick comprehension by students. Largely each of the steps can be done taking
default options, and mistakes in the design can be recovered simply by re-running
relevant steps. Of course there is a risk in this – that students may get the impression
that always selecting defaults without comprehension of the alternatives is the
correct thing to do. However, our experience is that it is very helpful to be able to see
the “end-to-end picture” at a relatively early stage, and then be able to revisit the
pieces to see their connection with their mainstream database and data analysis
studies.
10. References All web links accessed 14 May 2015
Boisvert SPARQL editor: http://www.boisvert.me.uk/opendata/sparql_aq+.html
BWDS Example Groundhog queries:
https://github.com/BetterWithDataSociety/ShefAirQualityAgent/wiki/Sample-
SPARQL
CAM SPARQL tutorial (with interactive editing and access to public datasets)
http://www.cambridgesemantics.com/semantic-university/sparql-by-example
Love Scripts for summarising data into hourly frequency & constructing data
Star
http://aces.shu.ac.uk/AirQuality
MS1 Microsoft Dreamspark (Software repository for academic users)
https://www.dreamspark.com
MS2 Microsoft Azure (Cloud-based platform supporting SQL Server)
http://azure.microsoft.com
ODH Open Data Handbook
http://opendatahandbook.org/en/what-is-open-data
PHE Estimating Local Mortality Burdens associated with Particulate Air
Pollution
Public Health England, 2014

https://www.gov.uk/government/uploads/system/uploads/attachment_
data/file/332854/PHE_CRCE_010.pdf
SCC1 Sheffield City Council - Air Quality web pages
https://www.sheffield.gov.uk/environment/air-quality/monitoring.html
SCC2 Sheffield City Council - Air Pollution Monitoring data pages
http://sheffieldairquality.gen2training.co.uk/sheffield/index.html
W3S Linked Data
http://www.w3.org/standards/semanticweb/data
WEKA Data Mining tools
http://www.cs.waikato.ac.nz/ml/weka

Tlad better with data - matthew love + charles (2)

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (15)

Similar to Tlad better with data - matthew love + charles (2)

Similar to Tlad better with data - matthew love + charles (2) (20)

More from Amin Chowdhury

More from Amin Chowdhury (7)

Tlad better with data - matthew love + charles (2)