Introduction
This report discusses the programming process which I would developed and used to produce the required data suitable for part two and three. The main measurement that I used to generate data is one region, particularly in two month period of time. This period information is required to generate from particular years 2011, 2012 and 2013. This data contains two different types of information which are climatic conditions recorded and power consumption that are related to that period of time.
Climatic conditions
The program that I developed using C programming related to weather data was focusing on years. What is supposed to do is processing a bunch of dataset containing information that is climatic conditions recorded across various regions. Which means reducing it down to just the data values that are relevant or meaningful to the desired region (Auckland) to be able to get its details on January and February in particular years .The idea is collecting the 2011+2013 desired information and generating it in a separate excel file then so on for 2013.
Power consumption
It is the same idea for power consumption, what I accomplished was using two processes in a huge number of data file to generate a filtered file. Although that huge file contained only the required years, there were unwanted months details that needed to be excluded. The first process was using C codes programming to get the desired two months by printing out the first two months of each year. So, during printing process, it had to be stopped at the end of the second months of each year and jumping on the following year to complete the process. The second process was combining every two rows of the filtered file as each row taken every 5 minutes power consuming recorded but the requirement was ten minutes reading for each row.
After achieving all of that processes and generating the filtered files, we need to use these files information with Weka to undertake a data modelling task. Then using this modelling task in different visualization techniques to see how well the performance of the task predictive is. The following sections show how to use the generated data both the weather data and power consumption in data mining and data visualization.
STAT390-14B (Ham): Directed Study Project
Individual Project Focus: Work vs. Play
Project co-ordinator: Associate Professor David Bainbridge
Process the weather data for Auckland in January and February in the given
dataset (10 minute readings) and experiment with various data mining
techniques to see if a model can be generated that predicts power
consumption for Monday-Friday (work), Saturday, and Sunday (play). Is it
easier to predict the power usage for one of time periods? Trial having
Saturday and Sunday represented as a single entity (i.e. the weekend) and as
separate days.
The aim of this directed study project is combine the programming skills learnt in COMP5002 (BoPP)
with the Data Min.
IntroductionThis report discusses the programming process whic.docx
1. Introduction
This report discusses the programming process which I would
developed and used to produce the required data suitable for
part two and three. The main measurement that I used to
generate data is one region, particularly in two month period of
time. This period information is required to generate from
particular years 2011, 2012 and 2013. This data contains two
different types of information which are climatic conditions
recorded and power consumption that are related to that period
of time.
Climatic conditions
The program that I developed using C programming related to
weather data was focusing on years. What is supposed to do is
processing a bunch of dataset containing information that is
climatic conditions recorded across various regions. Which
means reducing it down to just the data values that are relevant
or meaningful to the desired region (Auckland) to be able to get
its details on January and February in particular years .The idea
is collecting the 2011+2013 desired information and generating
it in a separate excel file then so on for 2013.
Power consumption
It is the same idea for power consumption, what I accomplished
was using two processes in a huge number of data file to
generate a filtered file. Although that huge file contained only
the required years, there were unwanted months details that
needed to be excluded. The first process was using C codes
programming to get the desired two months by printing out the
first two months of each year. So, during printing process, it
had to be stopped at the end of the second months of each year
and jumping on the following year to complete the process. The
second process was combining every two rows of the filtered
file as each row taken every 5 minutes power consuming
recorded but the requirement was ten minutes reading for each
2. row.
After achieving all of that processes and generating the filtered
files, we need to use these files information with Weka to
undertake a data modelling task. Then using this modelling task
in different visualization techniques to see how well the
performance of the task predictive is. The following sections
show how to use the generated data both the weather data and
power consumption in data mining and data visualization.
STAT390-14B (Ham): Directed Study Project
Individual Project Focus: Work vs. Play
Project co-ordinator: Associate Professor David Bainbridge
Process the weather data for Auckland in January and February
in the given
dataset (10 minute readings) and experiment with various data
mining
techniques to see if a model can be generated that predicts
power
consumption for Monday-Friday (work), Saturday, and Sunday
(play). Is it
easier to predict the power usage for one of time periods? Trial
having
Saturday and Sunday represented as a single entity (i.e. the
3. weekend) and as
separate days.
The aim of this directed study project is combine the
programming skills learnt in COMP5002 (BoPP)
with the Data Mining MOOCs that were studied earlier in the
semester at Waikato, and the
JavaScript skills for web use taught in COMP 223 (TGA) in the
A semester of this year.
The central theme to this project—shared across all the projects
being run in this course—is to
investigate the relationship between power usage in New
Zealand, and chronological data (the time
of day and the time of year) and meteorological data (the
weather!) to see if any patterns exist;
more specifically, to see whether the latter information helps
predict the former. Each project
investigates a separate aspect within this theme, applying Data
Mining techniques to publically
available data produced by both Transpower and the National
Institute of Water and Atmospheric
Research (NIWA), from which a range of visualizations will be
generated.
The key steps to the project are:
4. 1. Undertake data cleaning and processing of a rich dataset
containing information that
captures power consumption and climatic conditions recorded
across various regions of
New Zealand.
2. Feed the processed data into Weka to undertake a data
modelling task.
3. Produce a set of visualizations that provide insight into the
generated data
Two types of visualization will be produced: the first is focused
on showing how well the predictive
modelling is performing; the second is a more open-ended task,
with the aim of showing “something
interesting” in the data related to the project’s focus. An
example of “something interesting” could
be a time-based geographical map showing power usage in the
different regions of New Zealand
enriched with what is happening in terms of temperature in the
different regions. For further ideas
see Prof Apperley’s Data Visualization slides, available through
the STAT390 web site:
www.cs.waikato.ac.nz/~davidb/stat390/
5. The dataset provided for this project (also available for
download through the same web site) is in
the form of a set of Comma Separated Value (CSV) files. The
files span a mixture of years and
locations within New Zealand. While each project is different,
there is one common dimension to
how the data is to be used:
training the Data Mining
models;
establishing the accuracy of the models
developed.
We will now go through and detail what is involved in the three
keys steps to the project. The
schedule (see below) allows 1 week for each of these steps,
although it should be noted there is
some flexibility around this, as long as the final deadlines—a
presentation and a report, due in the
final week—are met. If at any point during the project you wish
to go back to an earlier step and
revise/adjust what you have done, this is not only permissible it
is actively encouraged (!), as it
6. reflects an increased level of understanding. At the end of each
week, a 2–3 page “mini” report is
requested describing the work you have done that corresponds
to the relevant step in the schedule.
The intention of each mini report is to help you develop a
section of the final report. Feedback on
mini reports submitted according to the schedule will be given
to assist you in developing the final
report.
Step 1: Data processing and cleaning
One of the first things you will need to do in this project is to
process the provided dataset into a
more amenable form, reducing it down to just the data values
that are meaningful to your project.
Example C code is provided on the course web site for reading
in CSV files, breaking each line into
individual fields, and then writing out a selection of those
fields.
The code you need to write needs to go beyond this. The fields
you select will be motivated by what
type of data you have been directed to focus on for the Data
Mining step. You will also need to
develop ways of controlling which lines of the CSV files make
it through to the next stage of
7. processing: filtered, for example by time, or location—the exact
details again will be determined by
the task you have been assigned in your project.
There are also undefined values to be aware of. These are
typically represented as a hyphen (-) in
the CSV files. Sometimes you might find an entire column will
consist of hyphens (for the particular
lines of the data you have filtered down to), other times most of
the values will be there, with only
the occasional hyphen.
In preparing the way for the Data Mining step, something you
might consider doing is to merge data:
fields (either in rows or columns). For example, 12 power
readings taken every 5 minutes could be
combined to provide an hourly figure instead, which would fit
more nicely with weather data
reported every hour.
Your 2-3 page “mini” report for this step of the project should
detail the decisions you made in how
the data needed to be processed, and how that was
accomplished.
8. Step 2: Data Mining with Weka
The second step to the project is to load the processed data into
Weka and start experimenting with
the data to develop a model that can predict power usage. To
reiterate what was stated above, use
data from 2011 and 2012 to train your models, and then run it
on the 2013 values (test data) to
establish how accurate the predictions are. While a technique
such as 10-fold cross-validation is a
quick and convenient way to gauge how well a model is
performing (in general)—and you may very
well use this in early stages of testing—the needs of this project
is to produce a model that can go on
to be used to make predictions on other (previously unseen)
data.
It is anticipated that the Explorer tool will be the most likely
sub-system you will work with in Weka,
and within that the Classifier section; however, there are no
hard-and-fast rules here. Use what you
have learned in the Data Mining MOOCs wisely. When data is
“flying around” at speed it is easy to
overlook important details that turns what would otherwise be a
highly successful model to
garbage! Similarly, accidentally including the feature you wish
9. to learn in the set of attributes used
to train a model is a mistake that is easy enough to make if you
are not careful. In such cases it leads
to results that are amazingly high. If the result you are getting
seem “too good to be true” … that
might very well be exactly what is going on! In short, “know
your data.”
The key ability for this stage of the project is to be able to train
a model on the 2011 and 2012 data,
from which a run can be made against a test set (2013) with the
predictions made saved in a
machine readable form, ready for processing by Step 3 (Data
Visualization)
The “mini” report for this step of the project should provide an
overview of the different methods
you experimented with, along with the one that you established
performed the best, and the
reasons for why that was.
Step 3: Data Visualization
There are two parts to the data visualization step.
Charts which shows how well the
chosen Data Mining model is performing in making predictions
10. about power usage;
visualization software if desired:
the key requirement for this part of the project is to visually
show something interesting
about the cleaned up and processed data that has been produced.
Data Model Accuracy visualized with Google Charts
Google Charts (https://developers.google.com/chart/) is a web-
based technology for presenting data
in a variety of forms. There are over 25 standard forms to
choose from. See Google’s web site for
extensive documentation, and the course web site for some
selected examples that are more closely
aligned with the needs of the Directed Study project.
Produce “Something Interesting”
For this final part of the project you might choose to visualize
something interesting that has been
produced as a result of your experimentation using Weka, but
equally it might be something that is
already present in the data produced in Step 1 of the project (no
Data Mining required).
11. The overall intention for this part of the project is to think back
to (and look back at, since the slides
are on the course web site!) the Data Visualization examples
given by Prof Apperley in the first week
of the project, and be inspired by this to produce a visualization
that shows “something interesting”
in the dataset you have been working with.
To achieve this, the scope for this part of the project can be
widened. For example, if the focus of
the project had been to compare extremes of latitude, using
Auckland and Invercargill as the two
extremes, then in the “something interesting” visualization it is
permissible to broaden this to other
centres: the visualization produced could be, say, a map of New
Zealand showing power-usage and
temperature data across all the main centres of population in the
country, over time. Going further,
if the visuals drawn on the map per centre make more sense if
normalized by city population, then
this information too can be added in to the dataset used to
produce the visualization.
Given such an open ended brief, if you are at all unsure what to
attempt for this final part of the
project then please consult with me for guidance as to what is a
12. reasonable expectation.
As a final remark, for this visualization do not feel constrained
to working with Google Charts
(although that is a valid option). There are several interactive
Data Visualizations resources on-line,
such as IBM’s ManyEyes, (http://www.ibm.com/manyeyes) that
allow you to upload datasets to
their web-site from which you can then develop your
visualizations.
Schedule
-
5pm G1.15)
-report on Step 1 submitted
: Feedback on mini-reports can be collected
from department office
-
5pm G1.15)
-report on Step 2 submitted
-reports can be collected
from department office
Deadlines