ADTMreport

Advanced Data Management
Technologies
Project Module 1 – Data Warehouse

Jonas Monkevičius
Rokas Mačiulaitis
Evija Urtāne

2015

1

Domain Analysis and Description
1.1. Describe domain, provide motivation

The domain of our data warehouse should include all necessary information about students of
Free University of BozenBolzano. It consists of 5 faculties (Computer Science, Economics and
Management, Education, Design and Art, Science and Technology). Each faculty has own study
programs. For example Computer Science faculty now offering 3 study programs (Bachelor in
Computer Science and Engineering, Master of Science in Computer Science, PhD in Computer
Science). So each study program has own students. This university is trilingual, so some of the
students should know at least three languages.
1.2. Business processes
1.2.1. Student career
Process “Student career” shows activities what student can do in university (e.g. enroll,
graduate, study) and also related information about student for statistics like what languages
student knows, what universities student finished before and also for which country student
come from. Also it is possible to get information about internship.
Business questions:
About enrollment
● How many students enrolled from nonEurope countries in 2012?
● How many local students (from South Tyrol) enrolled in 2011?
● What is the percentage of Italian students enrolled last year?
● How many student enrolled in Computer Science faculty in 2011?
About graduation
● How many students from Asia finished master degree last year?
● How many students graduated from those who enrolled in year 2010?
● What is the average time for graduation?
About studying process
● How many students are studying in Econimics Masters?
● How many non regular students (ERASMUS+, etc) are studying in Computer Science
bachelor?
● How many incoming/outgoing students (mobility) are in year 2014?
● How many students terminated (stoped) the studies in year 2012?
About languages
● What is the percentage of students from abroad who have primary language german in
their home university?
● How many students have better level in italian language then B1 than level?
2

● How much percents of students from South Tyrol chose english language as primary in
Bolzano university from 2002 to 2014 years?
Dimensions:
Event, internship, student, languages and levels, exact date, study program
Measures:
Number of students, internship grade
1.2.2. Grades
Process “Grades” shows information about activities related to exams (e.g. student go to exam
and pass/fail it or student don’t show up on exam).
Business questions:
● What is the average grade per student/ course/ lecturer/ study program/ faculty?
● What is the percentage of passed/ failed/ noshow exams per student/ course?
Dimensions:
Exam, Student, Date, Comision
Measures:
Passed, Grade
1.2.3. Card usage
Process “Card usage” shows information about all activities what people can do with card (e.g.
buying food/coffee/snacks, print, scan, copy, open doors, lend things, put money in card).
Business questions:
● How much money students are spending in cafeteria per week?
● Which activity is most used? Student? Teacher? (Printing, buying in cafeteria/ uniBar/
coffee machine/ snacks machine, library, door opening, loans for design faculty)
● How many paper sheets are used for printing in each month?
● What is the percentage of students who are eating in cafeteria?
Dimensions:
People, Action, Date
Measures:
Count, Price per unit with discount, Amount

3

1.3. Bus matrix
Bus matrix shows relations between business processes and dimensions, because different
business processes can use the same dimensions. Dimensions are used by different processes
in case if they need the same information.

E
x
a
c
t
D
a
t
e
S
t
u
d
e
n
t
E
x
a
m
C
o
m
m
i
s
i
o
n
P
e
o
p
l
e
A
c
t
i
o
n
S
t
u
d
y
p
r
o
g
r
a
m
E
v
e
n
t
I
n
t
e
r
n
s
h
i
p
L
a
n
g
u
a
g
e
Student career fact X X X X X X
Student grade fact X X X X
Card usage fact X X X

2. Conceptual Design

Data warehouse of the university is divided to 3 bigger fact dimensions: students career fact ,
study exam fact, card usage fact.
In students carrier fact it is possible to get information about student, student actions at
the university (enrolled, graduated, started studies, paused studies, continued studies, stoped
studies), internship information and information about student knowledge in languages.
Exam fact store information about exam, students who enrolled to exam and in which
faculty student studies. It is possible to get information about exam as status, grade, course and
faculty.
In card usage fact it is possible to get information about activities which are done with
card. Each activity information about card owner and specific information for each activity like in
which building activity happen.
2.1. Student career fact

● Dimensions
○ Exact Date. This dimension stores information about happened events
time. ExactDate columns stores date type. For better query performance,
date is divided to smaller partitions: month, year, semester (Winter,
4

Spring), academic year , week of year, day of year, day of month, day of
week (in words).
○ Student. Information about student. Student id, name, surname, student
type(regular, free listener [student which attends only in a few courses
and does not belong in any of study programs], working in uni student),
birth date, gender, nationality, native language, study program. There are
also not required data fields: email, highschool, highschool type, home
location.
○ Study Program. This dimension stores information about study program. It
holds data fields: university, faculty, study program, curriculum (learning
plan code), study type (full time studies, only one course), degree type
(bachelor, master, PhD), location (location of the faculty).
○ Event. In this dimension there are the types of available events in the
university. The list of the events: enrolled, graduated, knows language,
started internship, completed internship, started studies, paused studies,
continued studies, stoped studies, studying, studying in italian language,
studying in german language, studying in english language. This
dimension also stores each event description.
○ Internship. Information about internship. Grade evaluation by supervisor
in organization after completed internship. other fields: company
(organization name), internship type (winter, summer), supervisor in uni
name, supervisor in uni surname, supervisor in organization name,
supervisor in organization surname, internship location.
○ Languages And Levels. At this dimension there is information about which
languages knows student and in which level. data fields of this dimension
are: language code (code in 3 letters (ITA,ENG,GER,...)), language
name, language description, language level.
● Measures
○ Students count, grade of the internship.
5

In schema it is visible that not all of hierarchy nodes are mandatory. Some of them are
optional (e.g. Zip code, City, Province). In schema also can be found shared dimensions:
dimensions Internship, Study program and Student share information about location.
2.2. Student grade fact
● Dimensions
○ Student. Information about student. Student id, name, surname, student
type(regular, free listener [student which attends only in a few courses and does
not belong in any of study programs], working in uni student), birth date, gender,
nationality, native language, study program. There are also not required data
fields: email, highschool, highschool type, home location.
○ Exam. Dimension stores information about exam as teacher, subject, credit
points, room and type.
○ Date. This dimension stores information about happened events time. Date
columns stores date type. For better query performance, date is divided to
smaller partitions: month, year, semester (Winter, Spring), academic year , week
of year, day of year, day of month, day of week (in words).
○ Commision. Dimension stores information about people who are evaluating.
6

● Measures
○ Passed
○ Grade

2.3. Card usage fact
● Dimensions
○ People. The dimension shows information about all people in university who have
card.
○ Action. The dimension stores information all activities what can be done with
card. Each activity has type and place where this activity happened. Place is
divided to smaller partitions: object, place, sector, building. Activity can have
product and price.
○ Date. This dimension stores information about happened events time. Columns
stores date type. For better query performance, date is divided to smaller
partitions: month, year, semester (Winter, Spring), academic year , week of year,
day of year, day of month, day of week (in words).
● Measures
○ Count.
○ Price per unit with discount
○ Amount
7

3. Logical Design
3.1. Student career fact

8

How many students are studying in Econimics Masters?
select count(distinct student_id)
from
  jre_student_action_fact a, jre_studyprogram s, jre_event e
where
  a.study_program_id = s.studyprogramid
  and a.event_id = e.event_id
  and e.event_type = 'studying'
  and university = 'Free University of Bozen'
  and faculty = 'Economics and Management'
  and degreetype = 'Bachelor';

StudyProgram

Event

StudentActionFact

9

3.2. Student grade fact

What is the average grade per student?
select s.name, s.surname, avg(g.grade)
from
jre_grade g, jre_student s
where
g.studentid = s.studentid
group by s.name, s.surname;

Student

Grade
10

3.3. Card usage fact

Class People will select data from two different tables: student and other people so as not to
store the student data twice.

How many paper sheets are used for printing in each month?
select e.year, e.month, sum(count)
from
  jre_action a, jre_cardusage c, jre_exactdate e
where
  a.actionid = c.actionid
  and c.dateid = e.dateid
  and a.actiontype = 'Print'
group by e.year, e.month;

11

Action

ExactDate

CardUsage

4. Implementation

● One query that uses the ROLLUP, CUBE or GROUPING SETS operator

How much money students spent with card in cafeteria in specific year, month, week?

select d.year, d.month, d.weekofyear, sum(c.amount)
from jre_action a, jre_exactdate d, allpeople p, jre_cardusage c
where
a.actionid = c.ACTIONID
and d.dateid = c.dateid
and p.id = c.peopleid
and a.actiontype = 'Buy' and a.place = 'Mensa'
12

and p.type = 'Student'
group by rollup(d.year, d.month, d.weekofyear)
order by d.year, d.weekofyear;

● One query that uses the GROUPING ID and/or GROUP ID function.

How many students studying in different language in each faculty?

select
  decode(grouping_id(e.event_type), 1, 'All languages', e.event_type)
language,
  decode(grouping_id(sp.faculty), 1, 'All faculty', sp.faculty) faculty,
  count(saf.fact_id) students_count,
  grouping_id(e.event_type, sp.faculty) grouping_id
from
  jre_studyprogram sp,
  jre_event e,
  jre_student_action_fact saf
where
  saf.study_program_id = sp.studyprogramid and
  saf.event_id = e.event_id and
  saf.event_id in (20,22,24)
group by cube (
  e.event_type,
  sp.faculty)
order by
  e.event_type,
  sp.faculty;

13

5. Advanced Querying

● Ranking query using NTILE,

            Divide all students in 4 buckets which are studying in university in 2015 years and know
language in A or B level and order a query by students count.

           select
  lal.language_name,
  lal.language_level,
  count(language_level) as language_count,
  ntile(4) over (order by count(language_level) )
from
  jre_languages_and_levels lal,
  jre_exactdate ed,
  jre_student_action_fact saf
where
  saf.language_id = lal.language_id and
  saf.date_id = ed.dateid and
  saf.event_id in (6) and
  ed.year = 2015 and
  lal.language_level in ('A1', 'A2','B1', 'B2')
group by
  ed.year,
  lal.language_name,
14

lal.language_level
order by
  language_count desc;

● RANK or DENSE RANK functions

What is the average grade per student in Computer Science faculty?

SELECT
JRE_student.name, JRE_student.surname,
ROUND(AVG(JRE_grade.grade),2) AS "AVERAGE GRADE", RANK() OVER(
ORDER BY AVG(JRE_grade.grade) DESC) AS "RANK"
FROM
JRE_student, JRE_grade, JRE_studyProgram
WHERE
JRE_student.studentid=JRE_grade.studentid
AND JRE_student.studyprogramid=JRE_studyprogram.studyprogramid
AND JRE_studyprogram.faculty='Computer Science'
GROUP BY
JRE_student.name, JRE_student.surname;

15

What is the average grade per student and rank in faculty?
SELECT
JRE_studyprogram.faculty, JRE_student.name, JRE_student.surname,
ROUND(AVG(JRE_grade.grade),2) AS "AVERAGE GRADE", RANK() OVER(
PARTITION BY JRE_studyprogram.faculty ORDER BY AVG(JRE_grade.grade)
DESC) AS "RANK IN Faculty"
FROM
JRE_student, JRE_grade, JRE_studyprogram
WHERE
JRE_student.studentid=JRE_grade.studentid
AND JRE_student.studyprogramid=JRE_studyprogram.studyprogramid
GROUP BY
JRE_student.name, JRE_student.surname, JRE_studyprogram.faculty;

Which activity is most used with card?

select
e.month, a.actiontype as activity, count(a.actionid) as usedCount,
DENSE_RANK() OVER (ORDER BY count(a.actionid) desc) as rank
from
where
and e.dateid = c.dateid
group by a.actiontype, e.month;

16

● Windowing query using the windowing clause

How much money are spent with card every week and accumulated amount for every
week every year?

select
d.Year, d.month, d.WEEKOFYEAR,
sum(c.amount) weekAmount,
sum(sum(c.amount)) OVER (Partition by d.year ORDER BY d.WEEKOFYEAR
asc) as accumulated
from
jre_action a, jre_exactdate d, allpeople p, jre_cardusage c
where
group by d.Year, d.month, d.WEEKOFYEAR
order by d.Year, d.WEEKOFYEAR;

● Periodtoperiod comparison query (a query comparing values across time periods, e.g.,
compare sales for every week of the current year with the sales of the corresponding
weeks in the past year).

How much money students spent with card in cafeteria every date? (amount for every
date compared with previous and next date when was transactions)

select
d.EXACTDATE,
sum(c.amount) currentdate,
LAG(SUM (c.amount),1) OVER (Order By d.EXACTDATE) as previousdate,
LEAD(SUM (c.amount),1) OVER (Order By d.EXACTDATE) as nextdate
where
17

group by d.EXACTDATE
order by d.EXACTDATE asc;

6. Query performance

Three most frequently used queries:
● How much money students spent with card in cafeteria every date? (amount for every
date compared with previous and next date when was transactions)
select
d.EXACTDATE,
LAG(SUM (c.amount),1) OVER (Order By d.EXACTDATE) as previousdate,
LEAD(SUM (c.amount),1) OVER (Order By d.EXACTDATE) as nextdate
where
group by d.EXACTDATE
order by d.EXACTDATE asc;

● How much money students spent with card in cafeteria in specific year, month, week?
select d.year, d.month, d.weekofyear, sum(c.amount)
where
group by rollup(d.year, d.month, d.weekofyear)
order by d.year, d.weekofyear;
18

● Which activity is most used with card in each month of the year?
select
e.month,
a.actiontype as activity,
count(a.actionid) as usedCount,
DENSE_RANK() OVER (ORDER BY count(a.actionid) desc) as rank
from
where
and e.dateid = c.dateid
group by a.actiontype, e.month;

Used dimensions in all three queries:
1. ExactDate, Action, People G1={e,a,p}
2. ExactDate, Action, People G1={e,a,p}
3. ExactDate, Action G2={e,a}

Node relation diagram created using lattice framework.
Using greedy algorithm candidate nodes for creating materialized view are colored in gray:

Decision about materialized view for optimizing select time is to make view G2={e,a}, because
19

● it can be used in all queries;
● it doesn’t contain large amount of data (if compare to G1={e,a,p}).

Materialized view:
create materialized view jre_mv_date_action_sum
as
select
e.EXACTDATE, e.year, e.MONTH, e.WEEKOFYEAR, a.ACTIONTYPE, a.PLACE,
c.PEOPLEID, sum(c.amount) as amount, a.actionid
from jre_action a, jre_exactdate e, jre_cardusage c
where
  and e.DATEID = c.DATEID
group by
e.EXACTDATE, e.year, e.MONTH, e.WEEKOFYEAR, a.ACTIONTYPE, a.PLACE,
c.PEOPLEID, a.actionid
order by e.EXACTDATE, e.year, e.MONTH, e.WEEKOFYEAR, a.ACTIONTYPE,
a.PLACE;

First query changed to:
select
c.EXACTDATE,
LAG(SUM (c.amount),1) OVER (Order By c.EXACTDATE) as previousdate,
LEAD(SUM (c.amount),1) OVER (Order By c.EXACTDATE) as nextdate
from allpeople p, jre_mv_date_action_sum c
where
p.id = c.peopleid
and c.actiontype = 'Buy' and c.place = 'Mensa'
group by c.EXACTDATE
order by c.EXACTDATE asc;

Second query changed to:
select c.year, c.month, c.weekofyear, sum(c.amount)
from jre_mv_date_action_sum c, allpeople p
where
p.id = c.peopleid
and c.actiontype = 'Buy' and c.place = 'Mensa'
group by rollup(c.year, c.month, c.weekofyear)
order by c.year, c.weekofyear;

Third query changed to:
select
e.month,
20

e.actiontype as activity,
count(e.actionid) as usedCount,
DENSE_RANK() OVER (ORDER BY count(e.actionid) desc) as rank
from jre_mv_date_action_sum e
group by e.actiontype, e.month;

● Gain from materialized view

Speed tests (Query1, Query2, Query3) were taken by using before chosen queries which select
the same data, but one of them use materialized view and other don’t.

Results:

Tests Time without MV Time with MV Time improvement
(without/with)
Query1 0.176 0.109 1.6x
Query2 0.224 0.027 8.3x
Query3 0.112 0.093 1.2x

Test results show that all three queries executes faster with materialized view than without. In
tests difference is not big, because of small data amount, but the real improvement are shown in
column “Time improvement (without/with)”.
Query2 executes 8.3 times faster with materialized view, it means that with large data, it will be
very useful.

● Lose from materialized view
Used space for saved data.

Materialized view use data from dimensions Action and ExactDate and from fact table
Cardusage. View store already calculated data.

The worst case: every day in materialized view table can be generated:
number of data every day = number of all people * number of existing activities.
21

Advanced Data Management
Technologies
Project Module 2 – Map Reduce

Jonas Monkevičius
Rokas Mačiulaitis
Evija Urtāne

2015
22

Tasks
● Task: Instant Temporal aggregation
● Question: What is the average salary?
● Algorithm:

First cycle
○ Mapper
■ Input is a pair from dataset
■ Select values: salary, start and end time
■ Go through each time instance and create output pair [time
instance;salary]

○ Reducer
■ Input key is a time instance and value is list from salaries
■ Go through all salaries, sum salaries and count how many salaries are
summed
■ Calculate average value divide sum with count
Second cycle
○ Mapper
■ Get time as key and salary as value
■ Output key is salary and value is time
○ Reducer
■ Input key is salary and input value are list with all time intervals
■ List is sorted and then time is grouped as intervals
Algorithm output is all average salaries sorted ascending and for each salary selected all
intervals.

Functions (code)
Cycle 1
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
      String text = value.toString();
String[] parts = text.split(";");
IntWritable salary = new IntWritable(Integer.parseInt(parts[1]));
int start = Integer.valueOf(parts[2]);
int end = Integer.valueOf(parts[3]);

for(int x = start; x <= end; x = x+1)
{
context.write( new Text(Integer.toString(x)), salary);
}
}
23

public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int sum = 0;
int count = 0;
for (IntWritable value : values)
{
sum += Integer.valueOf( value.toString());
count += 1;
}
IntWritable avg =  new IntWritable( sum / count);
context.write(key, avg);
}

Cycle 2
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException
{
String text = value.toString();
String[] parts = text.split("t");
Text salary = new Text(parts[1]);
IntWritable time = new IntWritable(Integer.parseInt(parts[0]));
context.write( salary, time);
}
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
String s = "";
List<Integer> list = new ArrayList<Integer>();
for (IntWritable value : values)
{
list.add(Integer.parseInt(value.toString()));
}
Collections.sort(list);
int start = 0; int end = 0;
for (Integer value : list)
{
if(start == 0){
start = value;
end = value;
}
if(value == end+1) end = value;
if(value > end+1){
if(start == end) s = s + " " + start;
else s = s + " [" + start + "" + end + "]";
24

start = value;
end = value;
}
}
if(start == end) s = s + " " + start;
else s = s + " [" + start + "" + end + "]";
Text text = new Text(key.toString() +  " " + s.toString()) ;
context.write(text, new IntWritable(999));
}

Test example
Input:

0;800;1;14
1;400;3;6
2;300;4;7
0;500;4;5
0;500;7;8

Output after first map reduce cycle:

1 800
10 800
11 800
12 800
13 800
14 800
2 800
3 600
4 500
5 500
6 500
7 533
8 650
9 800

Output after second cycle:

500  [46]
533  7
600  3
650  8
800  [12] [914]
25

Speed tests:

Figure 1. The graph of map reduce calculations. where x axis is size in Kilobytes and y axis is
time in seconds.

   200K 400K 600K 800K 1Mb
Equal data 7.9 9.9 12.4 15.5 19.4
Random data 213.4 268.8 338.7 426.7 537.7
Seq data 262.6 335.5 426.1 541.7 687.3
Worst data 2261.9 2940.5 3822.7 4969.5 6460.3

  Table 1. shows speed tests results where different data sets were used.

● Data sets were used from course datasets:
http://www.inf.unibz.it/dis/teaching/ADMT/proj/data/
     In  all data sets there are 4 columns. first one means person id, second salary, third
timestamp begin, fourth timestamp end. The size of data sets was 200Kb, 400Kb, 600Kb,
800Kb and 1Mb. The type of data sets was:
● Equal data. This data set using the same time value in ‘timestamp begin’ and
‘timestamp end’ in a rows:
26

0;383;24048;24059
1;886;24048;24059
…; ...;   ...     ;   ...
9;421;24048;24059

● Random data. In this data set the ‘timestamp begin’ and ‘ timestamp end’ are not
ordered by time, but the difference between begin and end of the time is not big:
0;383;886;1663
1;915;593;1728
…; …; … ; ...
9;123;67;1202

● Seq data. In this data set the ‘timestamp begin’ and  ‘timestamp end’ columns
are ordered from 0 to max time period:
0;383;0;199
1;886;200;399
.;...    ; …..; ….
0;362;2000;2199

● Worst data. As we understand from the name of the data set, to calculate the
results you need a lot of time. Here, the timestamps data have increasing and
decreasing values at ‘timestamp begin’ and ‘timestamp end’ respectfully. Also the
interval between timestamp begin and end is very big:
0;383;12000;40000000
1;886;12001;39999999
.; …. ; ….     ; …     ...
9;421;12009;39999991

In conclusion need to say, that  data set with equal data was calculated very fast and
smoothly in few seconds. Sequential and random data were calculated in quite the same time
and time was not that big as in worst data set. Worst data set calculations took a lot of time to
calculate.

To do map reduce jobs these software were used:
● Hadoop framework v 1.2.1
● Java SDK 1.6
● Eclipse IDE for debug and test results

Source code is available at:
https://github.com/jmonkevicius/admt/blob/master/WordCount1.java
27

ADTMreport

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (10)

Similar to ADTMreport

Similar to ADTMreport (20)

ADTMreport