SlideShare a Scribd company logo
1 of 15
TLAD 2015, The 13th International Workshop on Teaching, Learning and Assessment in Databases.
Birmingham, UK, 3rd July.
Better with Data:
A case study in sourcing Linked Data
into a Business Intelligence analysis
Matthew Love
Dept. of
Computing
Sheffield Hallam
University
Sheffield S1 1WB
m.love@shu.ac.uk
Charles Boisvert
Dept. of
Computing
Sheffield Hallam
University
Sheffield S1 1WB
c.boisvert@shu.ac.u
k
Amin Chowdhury
Dept. of
Computing
Sheffield Hallam
University
Sheffield S1 1WB
mdamin.chowdhur
y@gmail.com
Ian Ibbotson
Better with Data
Society
ianibbo@gmail.co
m
http://betterwithda
ta.co
Abstract
This paper describes a case study investigating the relationship between weather
conditions and levels of air pollution. The case study illustrates aspects of finding
and accessing Open Data, exploring Linked Data, sections of the Extract-Transform-
Load processes of data warehousing, building an analytic cube, and application of
data mining tools.
The paper is intended to aid tutors and students of databases by providing a study
that gives a practical and repeatable case example giving an overview of how several
topics in the area of data collection and analysis integrate together.
All the data sources and tools used are free for use in academic contexts
(permissions from data source owners should be sought). Web links are given to all
resources discussed.
Keywords
Business Intelligence, Data Mining, Data Warehousing, SPARQL, Linked Data
1. Introduction
This paper (and associated downloads) outlines a number of topics related to finding
and accessing Open Data, merging sources, and analysing data using self-service and
Page 2
data mining tools. All the data and the software tools are available to
academics/students free of charge (though permission should be sought for some of
the data). The case study presented involves more data than might normally be used
for overviews of the subject and is ‘rich’ enough in content to help highlight genuine
issues, but remains sufficiently structured and simple as to not become
overwhelming. It is intended that tutors can use the resources described to give
overview presentations of the topic without needing to overcome problematic
barriers.
2. Case Study : Air pollution in a major city
Air pollution kills people. It is estimated that 29000 thousand people per year in the
UK die early through breathing difficulties at times of low air quality (PHE). The UK
government has imposed targets for reducing the quantities and/or frequencies of
the main pollutants (some figures given below). Local Authorities are responsible for
monitoring and publishing pollution levels in their areas.
Sheffield City Council uses two types of monitoring devices: diffusion tubes and fully
automated processing units.
Fig 1: Nitrogen Dioxide diffusion tube (left) and automated station (right)
Both types of devices are illustrated figure 1 above. There are around 160 diffusion
tube devices and six fully automated processing stations.
The diffusion tubes have the advantage of being spread throughout the city area.
However, they give data only when sent in for analysis, and typically this is once
every six to eight weeks per tube. The results are aggregated to an annual level prior
to publication.
Page 3
The six automated processing stations, named “Groundhogs”, measure a variety of
pollutants, and one also measures temperature and air pressure. Between three and
eight readings are taken per hour. The public can read a log of all readings up to
around one hour previously. The council occasionally may correct readings, or delete
readings from the log. Although some of the stations have been operating since
2000, there are a number of gaps in the data logs. In addition, the stations are
occasionally moved (usually to help investigate new pollution “hot spot” concerns).
The council maintains a web site (SCC-1) giving several informative descriptions as to
the types of pollutants commonly found in air plus a description of the images, and
further descriptions (SCC-2).
Readers (and their students) are invited to visit the SCC-2 web link and then select
“Station info”. This is Figure 2 left on the following page.
Fig 2 - Council pollution monitoring information (left) and automated station results (right)
Fig 2 (left) illustrates a very common problem with data sourced from the internet.
The information is presented as textual descriptions, with no obvious way of
automatically deriving further information. We are told that Groundhog1 is at
“Orphanage Road, Firhill” but it would take a human-based web search to find the
geographical location and then further searches to discover the nature of the
location (residential or industrial area, nearness to main road, etc).
Tutors may use this to introduce a discussion on why data on the internet, even when
is not necessarily considered as “Open Data”, even when user rights are clear. The
Open Data Handbook [ODH] is a useful resource.
Page 4
Readers (and their students) should then visit the “Latest data” tab of the same site,
and then click on any of the Groundhogs (Groundhog1 is often the best choice). This
is Figure 2 right.
Notice that navigation to this page is not designed for automation. The user must
click on a visual map to select the page. Notice too, that the URL for data pages does
not reflect the name of the Groundhog being visited so likewise manual navigation is
necessary. Students may discuss how, in an age of internet-sourced data, URLs can
and should be designed to allow for automated discovery by data harvesting tools.
Note, however, from the above image, that this page allows data (of any user-
selected range) to be downloaded, in a choice of PostScript, “Raw” (i.e. comma
separated values) or Excel formats.
If following this paper as an exercise, readers should attempt to download some
example CSV files. They will find this must be done one groundstation, and one
pollution type at a time. Downloads may be inspected using WordPad (not Notepad,
as the EndOfLines are not compatible), when it can be seen that there are date
(YYMMDD), time (HH:MM) and NO2 reading, roughly 3 to 8 readings per hour.
3. Linked data
The above discussion has highlighted the need for better methods of publishing data
for automated discovery and consumption. An important method is Linked Data
[W3S], which extends standard data by giving information about its relationship to
other data. Chains can be followed, to discover more information about the context
of something of interest (for example to discover about the type of neighbourhood
of each of the Groundhog sites)
Very informally, Linked Data items are presented as a threefold item:
 a unique identifier for the data item that is the subject of a relation
 a predicate describing the nature of the relation
 The information related the object of the relation.
The identifier and predicate take the form of a URI, but the object may be either a
URI or a literal value. URIs are often also the URL of a description file: in this case a
user can fetch human-readable information about the item into a browser.
Chains can be followed from triple to triple to discover more information about the
context of something of interest, for example to discover information about each of
the Groundhog sites. For this reason linked data is said to use a graph database.
Page 5
For example, as illustrated figure 3 below, the data about the “Groundhog1” NO2
sensor shows what it measures, its location, the type of device it is, and actual
measured values for a given date and time.
Fig 3: A subset of the Sheffield Air Quality+ database. Data points in bold, red text are URIs;
One of this paper’s authors created the Air Quality+ database, which allows over-
the-web interrogation of the Sheffield pollution measurements as linked data. Each
of the Groundhog stations have their own URI, and the database holds the
measurements of each sensor, for the Groundhog stations and diffusion tubes; so
Groundhog1 sensors, for example, include not just NO2 but also SO2, micro-particles
(e.g. diesel fumes), air pressure and air temperature. Each of these sensors records
frequent measurements that are archived in the database as triples. Using the URI, all
but literal values can be further investigated, for instance to find out more about the
NO2 compound.
To query the Subject / Predicate / Value triples in the database, we use the SPARQL
query language. SPARQL is designed to facilitate exploring a linked data graph; for
example:
 If given the URI of a schema for Groundhogs, it can return a list of all
Groundhogs.
 If given a URI of a specific Groundhog, it can return its schema, which (for
example) can be used to discover what pollutions that particular Groundhog is set
up to measure.
 If given the URI of a specific pollutant, it can return all the values, i.e. all the
readings for that pollutant.
sensor
Groundhog1
NO2 sensor
sensing
device NO2
type
GroundHog1NO2
2015051300500
observation
value
13.189
4
2015-05-10T05:00Z
type
has
value
end
time
measurement
property
53.40266
latitude
-1.463957
longitude
Page 6
All queries can contain filters, for example to only return values within a selected date
range.
From the above informal description the reader may be able to see that SPARQL can
be used to programmatically discover what Groundhogs there are, what pollutants
each monitors, and then get the readings of those pollutants. Most SPARQL systems
offer a range of formats for retrieving results, including CVS (useful for databases),
JSON (useful for Javascript web pages), XML (useful for further transpositions) etc.
Figure 3 - The SPARQL editor written by C. Boisvert for the Sheffield project
Figure 3 illustrates a SPARQL query asking for hourly readings from all available
Groundhogs between selected dates, together with the results of this query.
One of this paper’s authors installed SPARQL on a hosted server, and set up the
necessary triples to allow over-the-web interrogation of the Groundhog database. A
second author fed the SPARQL syntax definition into standard open editor tools to
generate an editor for queries. Users are invited to use the editor (Boisvert). Some
example queries for Groundhogs are available (BWDS).
There are also some very good, student appropriate, tutorial guides for SPARQL, e.g.
(CAM), accessing some well documented knowledge stores, including Wikipedia’s
triple store.
Page 7
Tutors following this paper should note that if students cannot master using SPARQL
for downloading Groundhog data then the manual page-scraping approach
discussed in the previous section ultimately will deliver the same data.
4. Integration of further data sources
One of the principles of Data Warehousing when used for analytical purposes (as
opposed to “data store housing” for safe custody of data) is to try to give added
context to facts, through Dimension descriptors added from other sources.
In the current case study Groundhog1 lists temperature and air pressure readings.
But other factors may influence pollution formation and/or dispersal as well. Obvious
factors are wind strength and humidity.
Wind direction is also a factor (but is more complex: if the monitor is directly east of
a large polluting factory then a strong wind towards the east will increase
measurement values; if the monitor was directly west of the factor, then the same
strong towards-east wind would remove the pollution from the area of the monitor;
so wind direction is relevant, but the effect is different for each Groundhog)
Detailed historic weather data is commercially valuable, and quite hard to find for
free download. Sheffield is fortunate in having a local enthusiast who had monitored
and published readings at five-minute intervals for all the desired measures.
Unfortunately the data is published in PDF format, with documents of around 200
pages per month of data. The tool “Bytescount PDF viewer” can be used to extract all
pages into one CVS file. All readers are very strongly requested to contact the data
owner to get permissions to use the data for study purposes (any commercial use of
the data could cause the site to be closed).
5. Creation of a Data Warehouse, and ETL processes
All data values were then uploaded into a Microsoft SQL Server with Business
Intelligence database. This software is free (for academic use) to install from
Microsoft Dreamspark (MS1) onto university teaching systems and student laptops.
Alternatively, students can have 150-day free use of the same software from the
Microsoft Azure cloud platform (MS2). MS Azure has convenient setup options for
SQL Server Business Intelligence.
Alternatives, such as the Weka tools [WEKA], are also envisageble for the steps that
follow, and offer opportunities to investigate the processing algorithms further. The
authors have found Microsoft tools more appropriate for their student population.
Once the data is loaded into tables on the SQL Server it needs to be transformed into
formats (discussed below) suitable for data analysis. The case study demonstrates a
Page 8
realistic but manageable number of steps that can be found in many Extract-
Transform-Load systems of Data Warehouses.
All the scripts used for ETL are available from (Love).
One of the useful teaching illustrations of the ETL process is to contrast using the
Server’s menu-driven wizard approaches for uploading files into tables with SQL
scripts that do the same tasks. Students are not always aware that SQL has
commands that allow for manipulation of database structure (as opposed to
manipulation of data values), but quickly start to see the value of relatively short
scripts that can be reused across multiple uploads.
A second useful teaching illustration comes from the “Sheffield Weather Page” data
being at five minute frequency, while the Groundhog readings vary between twelve
and twenty minute frequencies. This was resolved by writing further scripts in SQL
(again downloadable from the above link) that first summarised the respective
Groundhog and the Weather data into hourly readings (taking the means of readings
within each hour, except for wind direction where the most frequent wind direction
was taken), and then integrated these into a single observations table.
A second database on the same server was opened, and the obervations table copied
(via a short script command) into it. Descriptive tables were created that gave
informative names and attributes for the Groudhogs, and descriptive category names
range limits for each of the weather attributes (for example dimWindSpeed: No
wind = 0 kph; very light breeze = 1-3 kph, through to strong winds 20 kph and over).
A script then created a data “Star” based on (Kimball)’s designs, with a single Facts
table linked to relevant rows in each of the Dimension tables.
Having two databases on the same server, one for ETL data acquisition and
preparation, and one for storage of the integrated Star of facts and dimension tables,
helps students see for themselves the concept of a Data Staging area as described
throughout Kimball’s work. Just as Kimball describes, all the “messy” processes
happen, hidden from end-user view, in the Staging area. Clean, usable, subject-
structured data is then published to data marts.
6. Creation of Data Cube from Data Star
SQL Server with Business Intelligence contains a facility for defining Data Cubes for
fast analytical processing. Cubes can source their data directly from the uploaded
CVS files, but students quickly appreciate the simplicity of sourcing from the Star
created in the previous step. Refreshes in the data values in the Star (or even
alterations in the design of the Star) can quickly be pulled through into the Cube.
Page 9
By default, when used for self-serve reporting (illustrated below) cubes automatically
report totals (sums) of data value, aggregated over the user-selected timeframe (or
geographic distribution, etc.). For example, selecting NO2 (Nitrogen Dioxide) would
automatically report on the total readings ever, or totals per year, or totals per
month, or per day, or even per hour, depending on what date-range the use
happened to select. Users usually start at the top level – the most aggregated – and
then “drill down” for more details.
In the current case study the averages of pollution values are much more relevant
than the totals. It is a lot easier to compare calendar months of data if an averages
are used, as this eliminates that some months are longer than others. Peak values
within any selected time frame are also of “headline” interest, but users need to treat
information with caution as a peak may well be caused by a local factor such as a
badly tuned lorry or tractor passing upwind of the monitor station)
The “Calculated Measures” facility of the Data Cube was used to set formulas to
report the means of each of the numeric measures. The formula for mean is as
simple as “Sum of NO2 divided by Count of NO2”: the cube automatically applies
the context of level of drilling for all selected dimensions. Setting up medians is
beyond the scope of this simplified case study, but students can discuss how median
values can be used to ignore the effect of outlier readings.
Many texts on Data Warehousing utilise Inmon’s term of “subject-oriented”. In simple
case studies students often cannot see the difference between the data sources and
the DW subject-orientation. One differentiator is that “business rules” can be
encoded into the data or data presentation within the cubes. For nitrogen dioxide air
pollution, 40 mg is a threshold for concern, and 100 mg is a threshold for serious
concern. Facilities within the Data Cube were used to encode these levels into colours
of presentation. Key Performance Indicators, with “traffic light” colours and “trend”
arrows could also be set up. The threshold values can usefully be explained to
students as examples of Business Metadata, contrasting with Technical Metadata
(such as field data types) more often seen in tutorials.
7. Self Service data exploration
Microsoft’s preferred “self-service” data exploration tool is Excel. Indeed, a single
button click from within the cube development tool will open the cube in Excel. Data
is presented via PivotTables. (Readers should note that the Azure cloud platform
does not contain Excel. Users can set end-points to allow their local copy of Excel to
link to the cloud server. Alternatively users can simply install 30-day trial copies of
Office onto Azure).
Page
10
Figure 4 - A self-service display of data showing Nitrogen Dioxide levels per hour on days of week.
Students will very quickly (within minutes) start making discoveries about the data.
Figure 4 shows nitrogen dioxide pollution levels varying across time for each day of
the week. This image prompted a lot of discussion as to the timing of the apparent
peak times for pollution (the effect of driving?) and the clear difference between
Saturday and Sunday versus the rest of the week.
Figure 5 - Average NO2 levels for categories of temperature
Page
11
Figure 6 - Average NO2 levels for source direction of wind (Groundhog 1 monitor)
It can also be discovered that freezing or near-freezing days are associated with high
NO2 pollution levels, and that for Goundhog1 winds from the east have worse
pollution.
Students can “self-service” discover other relationships between the data. Some are
obvious (winter months tend to have colder days), but student do get to experience
the concept of a data analyst exploring the data themselves. Many students do not
know that displays other than line graphs and bar charts are available, and useful
discussions can be held about using comparative percentages as a means of spotting
patterns or exceptions.
8. Data Mining
Many students of databases get a few introductory classes on Data Mining, but may
not get to build and use a data mining facility for themselves. Having got the
pollution and weather data into SQL Server, the same environment can be used to
develop mining reports within a few minutes and with no further coding.
Page
12
Figure 7 - Data Mining using the SQL Server Business Intelligence suite
Left image : Selecting inputs (weather conditions) and an output - what is to be predicted (NO2
reading)
Right image: Result of running the Cluster data mining tool
In Figure 7 above a Clustering algorithm has identified ten clusters of weather data.
The darker clusters (for example) Cluster 9 contain a high proportion of bad pollution
days, The lighter clusters (for example 3 and 6) contain hardly any bad pollution days.
Figure 8 - Understanding the properties of Cluster 9 (the cluster with a large proportion of High NO2 readings)
Left image - Properties ranked by probability
Right image - Comparison of properties of Cluster 9 compared to all other clusters
Cluster 9 can be understood by viewing further screens (see Figure 9). The left image
appears to show that this cluster (with its high proportion of high pollution days)
tends to dry, low wind, high pressure, high/very high humidity, and cold or near
freezing temperatures. In other words: murky, dry winter mornings.
(It is not necessary for the current discussion, but rainfall absorbs NO2, and sunshine
breaks down NO2 but giving a by-product of ozone: So warm dry days are not
necessarily clear of pollutants).
Page
13
Figure 9 - Further analysis of the Weather / Pollution dataset : Left - Association Rules, Right - Decision
Trees
Figure 9 (left) shows Association Rules. For example, the listed first states that Strong
Winds and Warm temperatures are associated with the lowest levels of NO2.
Figure 9 (right) shows a Decision Tree. The most influential factor for NO2 appears to
be wind speed. Time of day and then Air Temperature are the next deciding factors.
Other Data Mining plots available include Neural Nets, Regression and Naive Bayes.
Our experience is that the default settings for each of the analysis shown produce
interpretable results quickly. Fine-tuning the parameters (controlling the number of
clusters, for example) can generate increases in data interpretation, but often the
effect is marginal. However, discussing with students what the parameters do can
help students understand the concepts of “supervised” versus “unsupervised”
learning. A frequent discussion point is whether Categories can then be fed back into
the Data Warehouse, to fine-tune the Dimension attributes.
9. Summaryand Contributionto the knowledge
This paper has demonstrated the application of Data Warehousing and Data Mining
tools, using data gathered from internet sources. But rather than taking conveniently
prepared datasets, the paper has shown some of the common difficulties met by
database professionals when collecting data from non-traditional sources. Linked
Data has been explained and illustrated as a potentially very helpful way out of these
difficulties.
Page
14
All the tools used, and all the data used, is available for free use in academic
contexts. Please do ask data providers for permissions to use, though.
Perhaps the major contribution of this case study is that although it introduces and
discusses a number of “real-world” issues, particularly around the Extract-Transform-
Load procedures of data warehousing, the scale of the study is not infeasible for
quick comprehension by students. Largely each of the steps can be done taking
default options, and mistakes in the design can be recovered simply by re-running
relevant steps. Of course there is a risk in this – that students may get the impression
that always selecting defaults without comprehension of the alternatives is the
correct thing to do. However, our experience is that it is very helpful to be able to see
the “end-to-end picture” at a relatively early stage, and then be able to revisit the
pieces to see their connection with their mainstream database and data analysis
studies.
10. References All web links accessed 14 May 2015
Boisvert SPARQL editor: http://www.boisvert.me.uk/opendata/sparql_aq+.html
BWDS Example Groundhog queries:
https://github.com/BetterWithDataSociety/ShefAirQualityAgent/wiki/Sample-
SPARQL
CAM SPARQL tutorial (with interactive editing and access to public datasets)
http://www.cambridgesemantics.com/semantic-university/sparql-by-example
Love Scripts for summarising data into hourly frequency & constructing data
Star
http://aces.shu.ac.uk/AirQuality
MS1 Microsoft Dreamspark (Software repository for academic users)
https://www.dreamspark.com
MS2 Microsoft Azure (Cloud-based platform supporting SQL Server)
http://azure.microsoft.com
ODH Open Data Handbook
http://opendatahandbook.org/en/what-is-open-data
PHE Estimating Local Mortality Burdens associated with Particulate Air
Pollution
Public Health England, 2014
Page
15
https://www.gov.uk/government/uploads/system/uploads/attachment_
data/file/332854/PHE_CRCE_010.pdf
SCC1 Sheffield City Council - Air Quality web pages
https://www.sheffield.gov.uk/environment/air-quality/monitoring.html
SCC2 Sheffield City Council - Air Pollution Monitoring data pages
http://sheffieldairquality.gen2training.co.uk/sheffield/index.html
W3S Linked Data
http://www.w3.org/standards/semanticweb/data
WEKA Data Mining tools
http://www.cs.waikato.ac.nz/ml/weka

More Related Content

What's hot (9)

070726 Igarss07 Barcelona
070726 Igarss07 Barcelona070726 Igarss07 Barcelona
070726 Igarss07 Barcelona
 
Macintyre2011
Macintyre2011Macintyre2011
Macintyre2011
 
LinkSUM: Using Link Analysis to Summarize Entity Data
LinkSUM: Using Link Analysis to Summarize Entity DataLinkSUM: Using Link Analysis to Summarize Entity Data
LinkSUM: Using Link Analysis to Summarize Entity Data
 
A gentle introduction to riese
A gentle introduction to rieseA gentle introduction to riese
A gentle introduction to riese
 
2011 ITS World Congress - GO-Sync - A Framework to Synchronize Transit Agency...
2011 ITS World Congress - GO-Sync - A Framework to Synchronize Transit Agency...2011 ITS World Congress - GO-Sync - A Framework to Synchronize Transit Agency...
2011 ITS World Congress - GO-Sync - A Framework to Synchronize Transit Agency...
 
Transparent and scalable open url quality metrics
Transparent and scalable open url quality metricsTransparent and scalable open url quality metrics
Transparent and scalable open url quality metrics
 
At33264269
At33264269At33264269
At33264269
 
Chem Spider Building An Online Database Of Open Spectra
Chem Spider  Building An Online Database Of Open Spectra Chem Spider  Building An Online Database Of Open Spectra
Chem Spider Building An Online Database Of Open Spectra
 
Measuring Open Data Portal User-Orientation: A Computational Approach
Measuring Open Data Portal User-Orientation: A Computational ApproachMeasuring Open Data Portal User-Orientation: A Computational Approach
Measuring Open Data Portal User-Orientation: A Computational Approach
 

Viewers also liked

Budżet obywatelski w Kielcach- raport z monitoringu
Budżet obywatelski w Kielcach- raport z monitoringuBudżet obywatelski w Kielcach- raport z monitoringu
Budżet obywatelski w Kielcach- raport z monitoringuStowarzyszenie Bona Fides
 
PPT Experimental Analysis of Superlooping in Twisted Polymer Line
PPT Experimental Analysis of Superlooping in Twisted Polymer LinePPT Experimental Analysis of Superlooping in Twisted Polymer Line
PPT Experimental Analysis of Superlooping in Twisted Polymer LineMarlen Mahendraratnam
 
curriculum vitae
curriculum vitaecurriculum vitae
curriculum vitaeMark Wagner
 
Planmeca extranet - Janne Pakkanen
Planmeca extranet - Janne PakkanenPlanmeca extranet - Janne Pakkanen
Planmeca extranet - Janne PakkanenKnowit Oy
 
PPG Industries Simplifies with DRM
PPG Industries Simplifies with DRMPPG Industries Simplifies with DRM
PPG Industries Simplifies with DRMAlithya
 
образи на картинах стівена гарднера
образи на картинах стівена гарднераобрази на картинах стівена гарднера
образи на картинах стівена гарднераЛюдмила Квадріціус
 
Интернет-проект: от идеи до клиента!
Интернет-проект: от идеи до клиента!Интернет-проект: от идеи до клиента!
Интернет-проект: от идеи до клиента!Cybermarketing, Moscow
 
Siirtyminen ketteriin menetelmiin Trafissa, Knowit-aamiaisseminaari 8.10.2015...
Siirtyminen ketteriin menetelmiin Trafissa, Knowit-aamiaisseminaari 8.10.2015...Siirtyminen ketteriin menetelmiin Trafissa, Knowit-aamiaisseminaari 8.10.2015...
Siirtyminen ketteriin menetelmiin Trafissa, Knowit-aamiaisseminaari 8.10.2015...Knowit Oy
 
Introduction to Oracle Hyperion Financial Close Suite
Introduction to Oracle Hyperion Financial Close SuiteIntroduction to Oracle Hyperion Financial Close Suite
Introduction to Oracle Hyperion Financial Close SuiteAlithya
 

Viewers also liked (15)

Budżet obywatelski w Kielcach- raport z monitoringu
Budżet obywatelski w Kielcach- raport z monitoringuBudżet obywatelski w Kielcach- raport z monitoringu
Budżet obywatelski w Kielcach- raport z monitoringu
 
PPT Experimental Analysis of Superlooping in Twisted Polymer Line
PPT Experimental Analysis of Superlooping in Twisted Polymer LinePPT Experimental Analysis of Superlooping in Twisted Polymer Line
PPT Experimental Analysis of Superlooping in Twisted Polymer Line
 
Integrated dance and movement and Common Core Standards
Integrated dance and movement and Common Core StandardsIntegrated dance and movement and Common Core Standards
Integrated dance and movement and Common Core Standards
 
Brochure 2k16
Brochure 2k16Brochure 2k16
Brochure 2k16
 
Dayana y david
Dayana y davidDayana y david
Dayana y david
 
MÁQUINAS SIMPLES
MÁQUINAS SIMPLESMÁQUINAS SIMPLES
MÁQUINAS SIMPLES
 
Untitled Presentation
Untitled PresentationUntitled Presentation
Untitled Presentation
 
curriculum vitae
curriculum vitaecurriculum vitae
curriculum vitae
 
Planmeca extranet - Janne Pakkanen
Planmeca extranet - Janne PakkanenPlanmeca extranet - Janne Pakkanen
Planmeca extranet - Janne Pakkanen
 
10215 A 05
10215 A 0510215 A 05
10215 A 05
 
PPG Industries Simplifies with DRM
PPG Industries Simplifies with DRMPPG Industries Simplifies with DRM
PPG Industries Simplifies with DRM
 
образи на картинах стівена гарднера
образи на картинах стівена гарднераобрази на картинах стівена гарднера
образи на картинах стівена гарднера
 
Интернет-проект: от идеи до клиента!
Интернет-проект: от идеи до клиента!Интернет-проект: от идеи до клиента!
Интернет-проект: от идеи до клиента!
 
Siirtyminen ketteriin menetelmiin Trafissa, Knowit-aamiaisseminaari 8.10.2015...
Siirtyminen ketteriin menetelmiin Trafissa, Knowit-aamiaisseminaari 8.10.2015...Siirtyminen ketteriin menetelmiin Trafissa, Knowit-aamiaisseminaari 8.10.2015...
Siirtyminen ketteriin menetelmiin Trafissa, Knowit-aamiaisseminaari 8.10.2015...
 
Introduction to Oracle Hyperion Financial Close Suite
Introduction to Oracle Hyperion Financial Close SuiteIntroduction to Oracle Hyperion Financial Close Suite
Introduction to Oracle Hyperion Financial Close Suite
 

Similar to Tlad better with data - matthew love + charles (2)

Tlad 2015 presentation amin+charles-final
Tlad 2015 presentation   amin+charles-finalTlad 2015 presentation   amin+charles-final
Tlad 2015 presentation amin+charles-finalAmin Chowdhury
 
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...Werner Leyh
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureRoss Mounce
 
Developing Insights Between Urban Air Quality and Public Health Through the E...
Developing Insights Between Urban Air Quality and Public Health Through the E...Developing Insights Between Urban Air Quality and Public Health Through the E...
Developing Insights Between Urban Air Quality and Public Health Through the E...PerkinElmer, Inc.
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGChris Ewing
 
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATAA HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATAgerogepatton
 
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATAA HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATAijaia
 
2003-12-02 Environmental Information Systems for Monitoring, Assessment, and ...
2003-12-02 Environmental Information Systems for Monitoring, Assessment, and ...2003-12-02 Environmental Information Systems for Monitoring, Assessment, and ...
2003-12-02 Environmental Information Systems for Monitoring, Assessment, and ...Rudolf Husar
 
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and SqoopAnalysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoopsushantparte
 
20051031 Biomass Smoke Emissions and Transport: Community-based Satellite and...
20051031 Biomass Smoke Emissions and Transport: Community-based Satellite and...20051031 Biomass Smoke Emissions and Transport: Community-based Satellite and...
20051031 Biomass Smoke Emissions and Transport: Community-based Satellite and...Rudolf Husar
 
Analysing Transportation Data with Open Source Big Data Analytic Tools
Analysing Transportation Data with Open Source Big Data Analytic ToolsAnalysing Transportation Data with Open Source Big Data Analytic Tools
Analysing Transportation Data with Open Source Big Data Analytic Toolsijeei-iaes
 
The Linked Data Lifecycle
The Linked Data LifecycleThe Linked Data Lifecycle
The Linked Data Lifecyclegeoknow
 
Linked Data Generation for the University Data From Legacy Database
Linked Data Generation for the University Data From Legacy Database  Linked Data Generation for the University Data From Legacy Database
Linked Data Generation for the University Data From Legacy Database dannyijwest
 
Sampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social NetworkSampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social NetworkEditor IJCATR
 
Traffic Outlier Detection by Density-Based Bounded Local Outlier Factors
Traffic Outlier Detection by Density-Based Bounded Local Outlier FactorsTraffic Outlier Detection by Density-Based Bounded Local Outlier Factors
Traffic Outlier Detection by Density-Based Bounded Local Outlier FactorsITIIIndustries
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningIIRindia
 
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...IJDKP
 
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...IJDKP
 

Similar to Tlad better with data - matthew love + charles (2) (20)

Tlad 2015 presentation amin+charles-final
Tlad 2015 presentation   amin+charles-finalTlad 2015 presentation   amin+charles-final
Tlad 2015 presentation amin+charles-final
 
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | Future
 
Linked sensor data
Linked sensor dataLinked sensor data
Linked sensor data
 
Developing Insights Between Urban Air Quality and Public Health Through the E...
Developing Insights Between Urban Air Quality and Public Health Through the E...Developing Insights Between Urban Air Quality and Public Health Through the E...
Developing Insights Between Urban Air Quality and Public Health Through the E...
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIG
 
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATAA HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA
 
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATAA HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA
A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA
 
2003-12-02 Environmental Information Systems for Monitoring, Assessment, and ...
2003-12-02 Environmental Information Systems for Monitoring, Assessment, and ...2003-12-02 Environmental Information Systems for Monitoring, Assessment, and ...
2003-12-02 Environmental Information Systems for Monitoring, Assessment, and ...
 
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and SqoopAnalysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
 
20051031 Biomass Smoke Emissions and Transport: Community-based Satellite and...
20051031 Biomass Smoke Emissions and Transport: Community-based Satellite and...20051031 Biomass Smoke Emissions and Transport: Community-based Satellite and...
20051031 Biomass Smoke Emissions and Transport: Community-based Satellite and...
 
Analysing Transportation Data with Open Source Big Data Analytic Tools
Analysing Transportation Data with Open Source Big Data Analytic ToolsAnalysing Transportation Data with Open Source Big Data Analytic Tools
Analysing Transportation Data with Open Source Big Data Analytic Tools
 
The Linked Data Lifecycle
The Linked Data LifecycleThe Linked Data Lifecycle
The Linked Data Lifecycle
 
Linked Data Generation for the University Data From Legacy Database
Linked Data Generation for the University Data From Legacy Database  Linked Data Generation for the University Data From Legacy Database
Linked Data Generation for the University Data From Legacy Database
 
Sampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social NetworkSampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social Network
 
Traffic Outlier Detection by Density-Based Bounded Local Outlier Factors
Traffic Outlier Detection by Density-Based Bounded Local Outlier FactorsTraffic Outlier Detection by Density-Based Bounded Local Outlier Factors
Traffic Outlier Detection by Density-Based Bounded Local Outlier Factors
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data Mining
 
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
 
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
 

More from Amin Chowdhury

OPPORTUNITIES FOR THE USE OF DIGITAL TECHNOLOGY TOOLS
OPPORTUNITIES FOR THE USE OF DIGITAL TECHNOLOGY TOOLSOPPORTUNITIES FOR THE USE OF DIGITAL TECHNOLOGY TOOLS
OPPORTUNITIES FOR THE USE OF DIGITAL TECHNOLOGY TOOLSAmin Chowdhury
 
Database Project management
Database Project managementDatabase Project management
Database Project managementAmin Chowdhury
 
Database Industry perspective
Database Industry perspectiveDatabase Industry perspective
Database Industry perspectiveAmin Chowdhury
 
090321 - EEHCO Project Plan PSTC- Dhaka
090321 - EEHCO Project Plan PSTC- Dhaka090321 - EEHCO Project Plan PSTC- Dhaka
090321 - EEHCO Project Plan PSTC- DhakaAmin Chowdhury
 
E-commerce Project Development
E-commerce Project DevelopmentE-commerce Project Development
E-commerce Project DevelopmentAmin Chowdhury
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
 

More from Amin Chowdhury (7)

OPPORTUNITIES FOR THE USE OF DIGITAL TECHNOLOGY TOOLS
OPPORTUNITIES FOR THE USE OF DIGITAL TECHNOLOGY TOOLSOPPORTUNITIES FOR THE USE OF DIGITAL TECHNOLOGY TOOLS
OPPORTUNITIES FOR THE USE OF DIGITAL TECHNOLOGY TOOLS
 
Database Project management
Database Project managementDatabase Project management
Database Project management
 
Database Industry perspective
Database Industry perspectiveDatabase Industry perspective
Database Industry perspective
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
 
090321 - EEHCO Project Plan PSTC- Dhaka
090321 - EEHCO Project Plan PSTC- Dhaka090321 - EEHCO Project Plan PSTC- Dhaka
090321 - EEHCO Project Plan PSTC- Dhaka
 
E-commerce Project Development
E-commerce Project DevelopmentE-commerce Project Development
E-commerce Project Development
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 

Tlad better with data - matthew love + charles (2)

  • 1. TLAD 2015, The 13th International Workshop on Teaching, Learning and Assessment in Databases. Birmingham, UK, 3rd July. Better with Data: A case study in sourcing Linked Data into a Business Intelligence analysis Matthew Love Dept. of Computing Sheffield Hallam University Sheffield S1 1WB m.love@shu.ac.uk Charles Boisvert Dept. of Computing Sheffield Hallam University Sheffield S1 1WB c.boisvert@shu.ac.u k Amin Chowdhury Dept. of Computing Sheffield Hallam University Sheffield S1 1WB mdamin.chowdhur y@gmail.com Ian Ibbotson Better with Data Society ianibbo@gmail.co m http://betterwithda ta.co Abstract This paper describes a case study investigating the relationship between weather conditions and levels of air pollution. The case study illustrates aspects of finding and accessing Open Data, exploring Linked Data, sections of the Extract-Transform- Load processes of data warehousing, building an analytic cube, and application of data mining tools. The paper is intended to aid tutors and students of databases by providing a study that gives a practical and repeatable case example giving an overview of how several topics in the area of data collection and analysis integrate together. All the data sources and tools used are free for use in academic contexts (permissions from data source owners should be sought). Web links are given to all resources discussed. Keywords Business Intelligence, Data Mining, Data Warehousing, SPARQL, Linked Data 1. Introduction This paper (and associated downloads) outlines a number of topics related to finding and accessing Open Data, merging sources, and analysing data using self-service and
  • 2. Page 2 data mining tools. All the data and the software tools are available to academics/students free of charge (though permission should be sought for some of the data). The case study presented involves more data than might normally be used for overviews of the subject and is ‘rich’ enough in content to help highlight genuine issues, but remains sufficiently structured and simple as to not become overwhelming. It is intended that tutors can use the resources described to give overview presentations of the topic without needing to overcome problematic barriers. 2. Case Study : Air pollution in a major city Air pollution kills people. It is estimated that 29000 thousand people per year in the UK die early through breathing difficulties at times of low air quality (PHE). The UK government has imposed targets for reducing the quantities and/or frequencies of the main pollutants (some figures given below). Local Authorities are responsible for monitoring and publishing pollution levels in their areas. Sheffield City Council uses two types of monitoring devices: diffusion tubes and fully automated processing units. Fig 1: Nitrogen Dioxide diffusion tube (left) and automated station (right) Both types of devices are illustrated figure 1 above. There are around 160 diffusion tube devices and six fully automated processing stations. The diffusion tubes have the advantage of being spread throughout the city area. However, they give data only when sent in for analysis, and typically this is once every six to eight weeks per tube. The results are aggregated to an annual level prior to publication.
  • 3. Page 3 The six automated processing stations, named “Groundhogs”, measure a variety of pollutants, and one also measures temperature and air pressure. Between three and eight readings are taken per hour. The public can read a log of all readings up to around one hour previously. The council occasionally may correct readings, or delete readings from the log. Although some of the stations have been operating since 2000, there are a number of gaps in the data logs. In addition, the stations are occasionally moved (usually to help investigate new pollution “hot spot” concerns). The council maintains a web site (SCC-1) giving several informative descriptions as to the types of pollutants commonly found in air plus a description of the images, and further descriptions (SCC-2). Readers (and their students) are invited to visit the SCC-2 web link and then select “Station info”. This is Figure 2 left on the following page. Fig 2 - Council pollution monitoring information (left) and automated station results (right) Fig 2 (left) illustrates a very common problem with data sourced from the internet. The information is presented as textual descriptions, with no obvious way of automatically deriving further information. We are told that Groundhog1 is at “Orphanage Road, Firhill” but it would take a human-based web search to find the geographical location and then further searches to discover the nature of the location (residential or industrial area, nearness to main road, etc). Tutors may use this to introduce a discussion on why data on the internet, even when is not necessarily considered as “Open Data”, even when user rights are clear. The Open Data Handbook [ODH] is a useful resource.
  • 4. Page 4 Readers (and their students) should then visit the “Latest data” tab of the same site, and then click on any of the Groundhogs (Groundhog1 is often the best choice). This is Figure 2 right. Notice that navigation to this page is not designed for automation. The user must click on a visual map to select the page. Notice too, that the URL for data pages does not reflect the name of the Groundhog being visited so likewise manual navigation is necessary. Students may discuss how, in an age of internet-sourced data, URLs can and should be designed to allow for automated discovery by data harvesting tools. Note, however, from the above image, that this page allows data (of any user- selected range) to be downloaded, in a choice of PostScript, “Raw” (i.e. comma separated values) or Excel formats. If following this paper as an exercise, readers should attempt to download some example CSV files. They will find this must be done one groundstation, and one pollution type at a time. Downloads may be inspected using WordPad (not Notepad, as the EndOfLines are not compatible), when it can be seen that there are date (YYMMDD), time (HH:MM) and NO2 reading, roughly 3 to 8 readings per hour. 3. Linked data The above discussion has highlighted the need for better methods of publishing data for automated discovery and consumption. An important method is Linked Data [W3S], which extends standard data by giving information about its relationship to other data. Chains can be followed, to discover more information about the context of something of interest (for example to discover about the type of neighbourhood of each of the Groundhog sites) Very informally, Linked Data items are presented as a threefold item:  a unique identifier for the data item that is the subject of a relation  a predicate describing the nature of the relation  The information related the object of the relation. The identifier and predicate take the form of a URI, but the object may be either a URI or a literal value. URIs are often also the URL of a description file: in this case a user can fetch human-readable information about the item into a browser. Chains can be followed from triple to triple to discover more information about the context of something of interest, for example to discover information about each of the Groundhog sites. For this reason linked data is said to use a graph database.
  • 5. Page 5 For example, as illustrated figure 3 below, the data about the “Groundhog1” NO2 sensor shows what it measures, its location, the type of device it is, and actual measured values for a given date and time. Fig 3: A subset of the Sheffield Air Quality+ database. Data points in bold, red text are URIs; One of this paper’s authors created the Air Quality+ database, which allows over- the-web interrogation of the Sheffield pollution measurements as linked data. Each of the Groundhog stations have their own URI, and the database holds the measurements of each sensor, for the Groundhog stations and diffusion tubes; so Groundhog1 sensors, for example, include not just NO2 but also SO2, micro-particles (e.g. diesel fumes), air pressure and air temperature. Each of these sensors records frequent measurements that are archived in the database as triples. Using the URI, all but literal values can be further investigated, for instance to find out more about the NO2 compound. To query the Subject / Predicate / Value triples in the database, we use the SPARQL query language. SPARQL is designed to facilitate exploring a linked data graph; for example:  If given the URI of a schema for Groundhogs, it can return a list of all Groundhogs.  If given a URI of a specific Groundhog, it can return its schema, which (for example) can be used to discover what pollutions that particular Groundhog is set up to measure.  If given the URI of a specific pollutant, it can return all the values, i.e. all the readings for that pollutant. sensor Groundhog1 NO2 sensor sensing device NO2 type GroundHog1NO2 2015051300500 observation value 13.189 4 2015-05-10T05:00Z type has value end time measurement property 53.40266 latitude -1.463957 longitude
  • 6. Page 6 All queries can contain filters, for example to only return values within a selected date range. From the above informal description the reader may be able to see that SPARQL can be used to programmatically discover what Groundhogs there are, what pollutants each monitors, and then get the readings of those pollutants. Most SPARQL systems offer a range of formats for retrieving results, including CVS (useful for databases), JSON (useful for Javascript web pages), XML (useful for further transpositions) etc. Figure 3 - The SPARQL editor written by C. Boisvert for the Sheffield project Figure 3 illustrates a SPARQL query asking for hourly readings from all available Groundhogs between selected dates, together with the results of this query. One of this paper’s authors installed SPARQL on a hosted server, and set up the necessary triples to allow over-the-web interrogation of the Groundhog database. A second author fed the SPARQL syntax definition into standard open editor tools to generate an editor for queries. Users are invited to use the editor (Boisvert). Some example queries for Groundhogs are available (BWDS). There are also some very good, student appropriate, tutorial guides for SPARQL, e.g. (CAM), accessing some well documented knowledge stores, including Wikipedia’s triple store.
  • 7. Page 7 Tutors following this paper should note that if students cannot master using SPARQL for downloading Groundhog data then the manual page-scraping approach discussed in the previous section ultimately will deliver the same data. 4. Integration of further data sources One of the principles of Data Warehousing when used for analytical purposes (as opposed to “data store housing” for safe custody of data) is to try to give added context to facts, through Dimension descriptors added from other sources. In the current case study Groundhog1 lists temperature and air pressure readings. But other factors may influence pollution formation and/or dispersal as well. Obvious factors are wind strength and humidity. Wind direction is also a factor (but is more complex: if the monitor is directly east of a large polluting factory then a strong wind towards the east will increase measurement values; if the monitor was directly west of the factor, then the same strong towards-east wind would remove the pollution from the area of the monitor; so wind direction is relevant, but the effect is different for each Groundhog) Detailed historic weather data is commercially valuable, and quite hard to find for free download. Sheffield is fortunate in having a local enthusiast who had monitored and published readings at five-minute intervals for all the desired measures. Unfortunately the data is published in PDF format, with documents of around 200 pages per month of data. The tool “Bytescount PDF viewer” can be used to extract all pages into one CVS file. All readers are very strongly requested to contact the data owner to get permissions to use the data for study purposes (any commercial use of the data could cause the site to be closed). 5. Creation of a Data Warehouse, and ETL processes All data values were then uploaded into a Microsoft SQL Server with Business Intelligence database. This software is free (for academic use) to install from Microsoft Dreamspark (MS1) onto university teaching systems and student laptops. Alternatively, students can have 150-day free use of the same software from the Microsoft Azure cloud platform (MS2). MS Azure has convenient setup options for SQL Server Business Intelligence. Alternatives, such as the Weka tools [WEKA], are also envisageble for the steps that follow, and offer opportunities to investigate the processing algorithms further. The authors have found Microsoft tools more appropriate for their student population. Once the data is loaded into tables on the SQL Server it needs to be transformed into formats (discussed below) suitable for data analysis. The case study demonstrates a
  • 8. Page 8 realistic but manageable number of steps that can be found in many Extract- Transform-Load systems of Data Warehouses. All the scripts used for ETL are available from (Love). One of the useful teaching illustrations of the ETL process is to contrast using the Server’s menu-driven wizard approaches for uploading files into tables with SQL scripts that do the same tasks. Students are not always aware that SQL has commands that allow for manipulation of database structure (as opposed to manipulation of data values), but quickly start to see the value of relatively short scripts that can be reused across multiple uploads. A second useful teaching illustration comes from the “Sheffield Weather Page” data being at five minute frequency, while the Groundhog readings vary between twelve and twenty minute frequencies. This was resolved by writing further scripts in SQL (again downloadable from the above link) that first summarised the respective Groundhog and the Weather data into hourly readings (taking the means of readings within each hour, except for wind direction where the most frequent wind direction was taken), and then integrated these into a single observations table. A second database on the same server was opened, and the obervations table copied (via a short script command) into it. Descriptive tables were created that gave informative names and attributes for the Groudhogs, and descriptive category names range limits for each of the weather attributes (for example dimWindSpeed: No wind = 0 kph; very light breeze = 1-3 kph, through to strong winds 20 kph and over). A script then created a data “Star” based on (Kimball)’s designs, with a single Facts table linked to relevant rows in each of the Dimension tables. Having two databases on the same server, one for ETL data acquisition and preparation, and one for storage of the integrated Star of facts and dimension tables, helps students see for themselves the concept of a Data Staging area as described throughout Kimball’s work. Just as Kimball describes, all the “messy” processes happen, hidden from end-user view, in the Staging area. Clean, usable, subject- structured data is then published to data marts. 6. Creation of Data Cube from Data Star SQL Server with Business Intelligence contains a facility for defining Data Cubes for fast analytical processing. Cubes can source their data directly from the uploaded CVS files, but students quickly appreciate the simplicity of sourcing from the Star created in the previous step. Refreshes in the data values in the Star (or even alterations in the design of the Star) can quickly be pulled through into the Cube.
  • 9. Page 9 By default, when used for self-serve reporting (illustrated below) cubes automatically report totals (sums) of data value, aggregated over the user-selected timeframe (or geographic distribution, etc.). For example, selecting NO2 (Nitrogen Dioxide) would automatically report on the total readings ever, or totals per year, or totals per month, or per day, or even per hour, depending on what date-range the use happened to select. Users usually start at the top level – the most aggregated – and then “drill down” for more details. In the current case study the averages of pollution values are much more relevant than the totals. It is a lot easier to compare calendar months of data if an averages are used, as this eliminates that some months are longer than others. Peak values within any selected time frame are also of “headline” interest, but users need to treat information with caution as a peak may well be caused by a local factor such as a badly tuned lorry or tractor passing upwind of the monitor station) The “Calculated Measures” facility of the Data Cube was used to set formulas to report the means of each of the numeric measures. The formula for mean is as simple as “Sum of NO2 divided by Count of NO2”: the cube automatically applies the context of level of drilling for all selected dimensions. Setting up medians is beyond the scope of this simplified case study, but students can discuss how median values can be used to ignore the effect of outlier readings. Many texts on Data Warehousing utilise Inmon’s term of “subject-oriented”. In simple case studies students often cannot see the difference between the data sources and the DW subject-orientation. One differentiator is that “business rules” can be encoded into the data or data presentation within the cubes. For nitrogen dioxide air pollution, 40 mg is a threshold for concern, and 100 mg is a threshold for serious concern. Facilities within the Data Cube were used to encode these levels into colours of presentation. Key Performance Indicators, with “traffic light” colours and “trend” arrows could also be set up. The threshold values can usefully be explained to students as examples of Business Metadata, contrasting with Technical Metadata (such as field data types) more often seen in tutorials. 7. Self Service data exploration Microsoft’s preferred “self-service” data exploration tool is Excel. Indeed, a single button click from within the cube development tool will open the cube in Excel. Data is presented via PivotTables. (Readers should note that the Azure cloud platform does not contain Excel. Users can set end-points to allow their local copy of Excel to link to the cloud server. Alternatively users can simply install 30-day trial copies of Office onto Azure).
  • 10. Page 10 Figure 4 - A self-service display of data showing Nitrogen Dioxide levels per hour on days of week. Students will very quickly (within minutes) start making discoveries about the data. Figure 4 shows nitrogen dioxide pollution levels varying across time for each day of the week. This image prompted a lot of discussion as to the timing of the apparent peak times for pollution (the effect of driving?) and the clear difference between Saturday and Sunday versus the rest of the week. Figure 5 - Average NO2 levels for categories of temperature
  • 11. Page 11 Figure 6 - Average NO2 levels for source direction of wind (Groundhog 1 monitor) It can also be discovered that freezing or near-freezing days are associated with high NO2 pollution levels, and that for Goundhog1 winds from the east have worse pollution. Students can “self-service” discover other relationships between the data. Some are obvious (winter months tend to have colder days), but student do get to experience the concept of a data analyst exploring the data themselves. Many students do not know that displays other than line graphs and bar charts are available, and useful discussions can be held about using comparative percentages as a means of spotting patterns or exceptions. 8. Data Mining Many students of databases get a few introductory classes on Data Mining, but may not get to build and use a data mining facility for themselves. Having got the pollution and weather data into SQL Server, the same environment can be used to develop mining reports within a few minutes and with no further coding.
  • 12. Page 12 Figure 7 - Data Mining using the SQL Server Business Intelligence suite Left image : Selecting inputs (weather conditions) and an output - what is to be predicted (NO2 reading) Right image: Result of running the Cluster data mining tool In Figure 7 above a Clustering algorithm has identified ten clusters of weather data. The darker clusters (for example) Cluster 9 contain a high proportion of bad pollution days, The lighter clusters (for example 3 and 6) contain hardly any bad pollution days. Figure 8 - Understanding the properties of Cluster 9 (the cluster with a large proportion of High NO2 readings) Left image - Properties ranked by probability Right image - Comparison of properties of Cluster 9 compared to all other clusters Cluster 9 can be understood by viewing further screens (see Figure 9). The left image appears to show that this cluster (with its high proportion of high pollution days) tends to dry, low wind, high pressure, high/very high humidity, and cold or near freezing temperatures. In other words: murky, dry winter mornings. (It is not necessary for the current discussion, but rainfall absorbs NO2, and sunshine breaks down NO2 but giving a by-product of ozone: So warm dry days are not necessarily clear of pollutants).
  • 13. Page 13 Figure 9 - Further analysis of the Weather / Pollution dataset : Left - Association Rules, Right - Decision Trees Figure 9 (left) shows Association Rules. For example, the listed first states that Strong Winds and Warm temperatures are associated with the lowest levels of NO2. Figure 9 (right) shows a Decision Tree. The most influential factor for NO2 appears to be wind speed. Time of day and then Air Temperature are the next deciding factors. Other Data Mining plots available include Neural Nets, Regression and Naive Bayes. Our experience is that the default settings for each of the analysis shown produce interpretable results quickly. Fine-tuning the parameters (controlling the number of clusters, for example) can generate increases in data interpretation, but often the effect is marginal. However, discussing with students what the parameters do can help students understand the concepts of “supervised” versus “unsupervised” learning. A frequent discussion point is whether Categories can then be fed back into the Data Warehouse, to fine-tune the Dimension attributes. 9. Summaryand Contributionto the knowledge This paper has demonstrated the application of Data Warehousing and Data Mining tools, using data gathered from internet sources. But rather than taking conveniently prepared datasets, the paper has shown some of the common difficulties met by database professionals when collecting data from non-traditional sources. Linked Data has been explained and illustrated as a potentially very helpful way out of these difficulties.
  • 14. Page 14 All the tools used, and all the data used, is available for free use in academic contexts. Please do ask data providers for permissions to use, though. Perhaps the major contribution of this case study is that although it introduces and discusses a number of “real-world” issues, particularly around the Extract-Transform- Load procedures of data warehousing, the scale of the study is not infeasible for quick comprehension by students. Largely each of the steps can be done taking default options, and mistakes in the design can be recovered simply by re-running relevant steps. Of course there is a risk in this – that students may get the impression that always selecting defaults without comprehension of the alternatives is the correct thing to do. However, our experience is that it is very helpful to be able to see the “end-to-end picture” at a relatively early stage, and then be able to revisit the pieces to see their connection with their mainstream database and data analysis studies. 10. References All web links accessed 14 May 2015 Boisvert SPARQL editor: http://www.boisvert.me.uk/opendata/sparql_aq+.html BWDS Example Groundhog queries: https://github.com/BetterWithDataSociety/ShefAirQualityAgent/wiki/Sample- SPARQL CAM SPARQL tutorial (with interactive editing and access to public datasets) http://www.cambridgesemantics.com/semantic-university/sparql-by-example Love Scripts for summarising data into hourly frequency & constructing data Star http://aces.shu.ac.uk/AirQuality MS1 Microsoft Dreamspark (Software repository for academic users) https://www.dreamspark.com MS2 Microsoft Azure (Cloud-based platform supporting SQL Server) http://azure.microsoft.com ODH Open Data Handbook http://opendatahandbook.org/en/what-is-open-data PHE Estimating Local Mortality Burdens associated with Particulate Air Pollution Public Health England, 2014
  • 15. Page 15 https://www.gov.uk/government/uploads/system/uploads/attachment_ data/file/332854/PHE_CRCE_010.pdf SCC1 Sheffield City Council - Air Quality web pages https://www.sheffield.gov.uk/environment/air-quality/monitoring.html SCC2 Sheffield City Council - Air Pollution Monitoring data pages http://sheffieldairquality.gen2training.co.uk/sheffield/index.html W3S Linked Data http://www.w3.org/standards/semanticweb/data WEKA Data Mining tools http://www.cs.waikato.ac.nz/ml/weka