2. Domain Analysis and Description
1.1. Describe domain, provide motivation
The domain of our data warehouse should include all necessary information about students of
Free University of BozenBolzano. It consists of 5 faculties (Computer Science, Economics and
Management, Education, Design and Art, Science and Technology). Each faculty has own study
programs. For example Computer Science faculty now offering 3 study programs (Bachelor in
Computer Science and Engineering, Master of Science in Computer Science, PhD in Computer
Science). So each study program has own students. This university is trilingual, so some of the
students should know at least three languages.
1.2. Business processes
1.2.1. Student career
Process “Student career” shows activities what student can do in university (e.g. enroll,
graduate, study) and also related information about student for statistics like what languages
student knows, what universities student finished before and also for which country student
come from. Also it is possible to get information about internship.
Business questions:
About enrollment
● How many students enrolled from nonEurope countries in 2012?
● How many local students (from South Tyrol) enrolled in 2011?
● What is the percentage of Italian students enrolled last year?
● How many student enrolled in Computer Science faculty in 2011?
About graduation
● How many students from Asia finished master degree last year?
● How many students graduated from those who enrolled in year 2010?
● What is the average time for graduation?
About studying process
● How many students are studying in Econimics Masters?
● How many non regular students (ERASMUS+, etc) are studying in Computer Science
bachelor?
● How many incoming/outgoing students (mobility) are in year 2014?
● How many students terminated (stoped) the studies in year 2012?
About languages
● What is the percentage of students from abroad who have primary language german in
their home university?
● How many students have better level in italian language then B1 than level?
2
4. 1.3. Bus matrix
Bus matrix shows relations between business processes and dimensions, because different
business processes can use the same dimensions. Dimensions are used by different processes
in case if they need the same information.
E
x
a
c
t
D
a
t
e
S
t
u
d
e
n
t
E
x
a
m
C
o
m
m
i
s
i
o
n
P
e
o
p
l
e
A
c
t
i
o
n
S
t
u
d
y
p
r
o
g
r
a
m
E
v
e
n
t
I
n
t
e
r
n
s
h
i
p
L
a
n
g
u
a
g
e
Student career fact X X X X X X
Student grade fact X X X X
Card usage fact X X X
2. Conceptual Design
Data warehouse of the university is divided to 3 bigger fact dimensions: students career fact ,
study exam fact, card usage fact.
In students carrier fact it is possible to get information about student, student actions at
the university (enrolled, graduated, started studies, paused studies, continued studies, stoped
studies), internship information and information about student knowledge in languages.
Exam fact store information about exam, students who enrolled to exam and in which
faculty student studies. It is possible to get information about exam as status, grade, course and
faculty.
In card usage fact it is possible to get information about activities which are done with
card. Each activity information about card owner and specific information for each activity like in
which building activity happen.
2.1. Student career fact
● Dimensions
○ Exact Date. This dimension stores information about happened events
time. ExactDate columns stores date type. For better query performance,
date is divided to smaller partitions: month, year, semester (Winter,
4
6. In schema it is visible that not all of hierarchy nodes are mandatory. Some of them are
optional (e.g. Zip code, City, Province). In schema also can be found shared dimensions:
dimensions Internship, Study program and Student share information about location.
2.2. Student grade fact
● Dimensions
○ Student. Information about student. Student id, name, surname, student
type(regular, free listener [student which attends only in a few courses and does
not belong in any of study programs], working in uni student), birth date, gender,
nationality, native language, study program. There are also not required data
fields: email, highschool, highschool type, home location.
○ Exam. Dimension stores information about exam as teacher, subject, credit
points, room and type.
○ Date. This dimension stores information about happened events time. Date
columns stores date type. For better query performance, date is divided to
smaller partitions: month, year, semester (Winter, Spring), academic year , week
of year, day of year, day of month, day of week (in words).
○ Commision. Dimension stores information about people who are evaluating.
6
7. ● Measures
○ Passed
○ Grade
2.3. Card usage fact
● Dimensions
○ People. The dimension shows information about all people in university who have
card.
○ Action. The dimension stores information all activities what can be done with
card. Each activity has type and place where this activity happened. Place is
divided to smaller partitions: object, place, sector, building. Activity can have
product and price.
○ Date. This dimension stores information about happened events time. Columns
stores date type. For better query performance, date is divided to smaller
partitions: month, year, semester (Winter, Spring), academic year , week of year,
day of year, day of month, day of week (in words).
● Measures
○ Count.
○ Price per unit with discount
○ Amount
7
10. 3.2. Student grade fact
What is the average grade per student?
select s.name, s.surname, avg(g.grade)
from
jre_grade g, jre_student s
where
g.studentid = s.studentid
group by s.name, s.surname;
Student
Grade
10
11.
3.3. Card usage fact
Class People will select data from two different tables: student and other people so as not to
store the student data twice.
How many paper sheets are used for printing in each month?
select e.year, e.month, sum(count)
from
jre_action a, jre_cardusage c, jre_exactdate e
where
a.actionid = c.actionid
and c.dateid = e.dateid
and a.actiontype = 'Print'
group by e.year, e.month;
11
14.
5. Advanced Querying
● Ranking query using NTILE,
Divide all students in 4 buckets which are studying in university in 2015 years and know
language in A or B level and order a query by students count.
select
lal.language_name,
lal.language_level,
count(language_level) as language_count,
ntile(4) over (order by count(language_level) )
from
jre_languages_and_levels lal,
jre_exactdate ed,
jre_student_action_fact saf
where
saf.language_id = lal.language_id and
saf.date_id = ed.dateid and
saf.event_id in (6) and
ed.year = 2015 and
lal.language_level in ('A1', 'A2','B1', 'B2')
group by
ed.year,
lal.language_name,
14
17. ● Windowing query using the windowing clause
How much money are spent with card every week and accumulated amount for every
week every year?
select
d.Year, d.month, d.WEEKOFYEAR,
sum(c.amount) weekAmount,
sum(sum(c.amount)) OVER (Partition by d.year ORDER BY d.WEEKOFYEAR
asc) as accumulated
from
jre_action a, jre_exactdate d, allpeople p, jre_cardusage c
where
a.actionid = c.ACTIONID
and d.dateid = c.dateid
and p.id = c.peopleid
and a.actiontype = 'Buy' and a.place = 'Mensa'
group by d.Year, d.month, d.WEEKOFYEAR
order by d.Year, d.WEEKOFYEAR;
● Periodtoperiod comparison query (a query comparing values across time periods, e.g.,
compare sales for every week of the current year with the sales of the corresponding
weeks in the past year).
How much money students spent with card in cafeteria every date? (amount for every
date compared with previous and next date when was transactions)
select
d.EXACTDATE,
sum(c.amount) currentdate,
LAG(SUM (c.amount),1) OVER (Order By d.EXACTDATE) as previousdate,
LEAD(SUM (c.amount),1) OVER (Order By d.EXACTDATE) as nextdate
from jre_action a, jre_exactdate d, allpeople p, jre_cardusage c
where
a.actionid = c.ACTIONID
and d.dateid = c.dateid
and p.id = c.peopleid
and a.actiontype = 'Buy' and a.place = 'Mensa'
17
18. and p.type = 'Student'
group by d.EXACTDATE
order by d.EXACTDATE asc;
6. Query performance
Three most frequently used queries:
● How much money students spent with card in cafeteria every date? (amount for every
date compared with previous and next date when was transactions)
select
d.EXACTDATE,
sum(c.amount) currentdate,
LAG(SUM (c.amount),1) OVER (Order By d.EXACTDATE) as previousdate,
LEAD(SUM (c.amount),1) OVER (Order By d.EXACTDATE) as nextdate
from jre_action a, jre_exactdate d, allpeople p, jre_cardusage c
where
a.actionid = c.ACTIONID
and d.dateid = c.dateid
and p.id = c.peopleid
and a.actiontype = 'Buy' and a.place = 'Mensa'
and p.type = 'Student'
group by d.EXACTDATE
order by d.EXACTDATE asc;
● How much money students spent with card in cafeteria in specific year, month, week?
select d.year, d.month, d.weekofyear, sum(c.amount)
from jre_action a, jre_exactdate d, allpeople p, jre_cardusage c
where
a.actionid = c.ACTIONID
and d.dateid = c.dateid
and p.id = c.peopleid
and a.actiontype = 'Buy' and a.place = 'Mensa'
and p.type = 'Student'
group by rollup(d.year, d.month, d.weekofyear)
order by d.year, d.weekofyear;
18
19.
● Which activity is most used with card in each month of the year?
select
e.month,
a.actiontype as activity,
count(a.actionid) as usedCount,
DENSE_RANK() OVER (ORDER BY count(a.actionid) desc) as rank
from
jre_action a, jre_cardusage c, jre_exactdate e
where
a.actionid = c.ACTIONID
and e.dateid = c.dateid
group by a.actiontype, e.month;
Used dimensions in all three queries:
1. ExactDate, Action, People G1={e,a,p}
2. ExactDate, Action, People G1={e,a,p}
3. ExactDate, Action G2={e,a}
Node relation diagram created using lattice framework.
Using greedy algorithm candidate nodes for creating materialized view are colored in gray:
Decision about materialized view for optimizing select time is to make view G2={e,a}, because
19
21. e.actiontype as activity,
count(e.actionid) as usedCount,
DENSE_RANK() OVER (ORDER BY count(e.actionid) desc) as rank
from jre_mv_date_action_sum e
group by e.actiontype, e.month;
● Gain from materialized view
Speed tests (Query1, Query2, Query3) were taken by using before chosen queries which select
the same data, but one of them use materialized view and other don’t.
Results:
Tests Time without MV Time with MV Time improvement
(without/with)
Query1 0.176 0.109 1.6x
Query2 0.224 0.027 8.3x
Query3 0.112 0.093 1.2x
Test results show that all three queries executes faster with materialized view than without. In
tests difference is not big, because of small data amount, but the real improvement are shown in
column “Time improvement (without/with)”.
Query2 executes 8.3 times faster with materialized view, it means that with large data, it will be
very useful.
● Lose from materialized view
Used space for saved data.
Materialized view use data from dimensions Action and ExactDate and from fact table
Cardusage. View store already calculated data.
The worst case: every day in materialized view table can be generated:
number of data every day = number of all people * number of existing activities.
21
23. Tasks
● Task: Instant Temporal aggregation
● Question: What is the average salary?
● Algorithm:
First cycle
○ Mapper
■ Input is a pair from dataset
■ Select values: salary, start and end time
■ Go through each time instance and create output pair [time
instance;salary]
○ Reducer
■ Input key is a time instance and value is list from salaries
■ Go through all salaries, sum salaries and count how many salaries are
summed
■ Calculate average value divide sum with count
Second cycle
○ Mapper
■ Get time as key and salary as value
■ Output key is salary and value is time
○ Reducer
■ Input key is salary and input value are list with all time intervals
■ List is sorted and then time is grouped as intervals
Algorithm output is all average salaries sorted ascending and for each salary selected all
intervals.
Functions (code)
Cycle 1
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String text = value.toString();
String[] parts = text.split(";");
IntWritable salary = new IntWritable(Integer.parseInt(parts[1]));
int start = Integer.valueOf(parts[2]);
int end = Integer.valueOf(parts[3]);
for(int x = start; x <= end; x = x+1)
{
context.write( new Text(Integer.toString(x)), salary);
}
}
23
24. public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int sum = 0;
int count = 0;
for (IntWritable value : values)
{
sum += Integer.valueOf( value.toString());
count += 1;
}
IntWritable avg = new IntWritable( sum / count);
context.write(key, avg);
}
Cycle 2
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException
{
String text = value.toString();
String[] parts = text.split("t");
Text salary = new Text(parts[1]);
IntWritable time = new IntWritable(Integer.parseInt(parts[0]));
context.write( salary, time);
}
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
String s = "";
List<Integer> list = new ArrayList<Integer>();
for (IntWritable value : values)
{
list.add(Integer.parseInt(value.toString()));
}
Collections.sort(list);
int start = 0; int end = 0;
for (Integer value : list)
{
if(start == 0){
start = value;
end = value;
}
if(value == end+1) end = value;
if(value > end+1){
if(start == end) s = s + " " + start;
else s = s + " [" + start + "" + end + "]";
24
25. start = value;
end = value;
}
}
if(start == end) s = s + " " + start;
else s = s + " [" + start + "" + end + "]";
Text text = new Text(key.toString() + " " + s.toString()) ;
context.write(text, new IntWritable(999));
}
Test example
Input:
0;800;1;14
1;400;3;6
2;300;4;7
0;500;4;5
0;500;7;8
Output after first map reduce cycle:
1 800
10 800
11 800
12 800
13 800
14 800
2 800
3 600
4 500
5 500
6 500
7 533
8 650
9 800
Output after second cycle:
500 [46]
533 7
600 3
650 8
800 [12] [914]
25
26. Speed tests:
Figure 1. The graph of map reduce calculations. where x axis is size in Kilobytes and y axis is
time in seconds.
200K 400K 600K 800K 1Mb
Equal data 7.9 9.9 12.4 15.5 19.4
Random data 213.4 268.8 338.7 426.7 537.7
Seq data 262.6 335.5 426.1 541.7 687.3
Worst data 2261.9 2940.5 3822.7 4969.5 6460.3
Table 1. shows speed tests results where different data sets were used.
● Data sets were used from course datasets:
http://www.inf.unibz.it/dis/teaching/ADMT/proj/data/
In all data sets there are 4 columns. first one means person id, second salary, third
timestamp begin, fourth timestamp end. The size of data sets was 200Kb, 400Kb, 600Kb,
800Kb and 1Mb. The type of data sets was:
● Equal data. This data set using the same time value in ‘timestamp begin’ and
‘timestamp end’ in a rows:
26