Data Warehouse for Professional Soccer
Organization
(Designing and Implementing Data Warehouse)
Team 5:
Rasmeet Kaur
Sagar Singh
Sunitha Shyam
Sayali Thakur
IS 6480/4480: Data Warehousing Version 1.4
21st
March 2017
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 2 | 34
Contents
EXECUTIVE SUMMARY..................................................................................................................................3
INTRODUCTION.............................................................................................................................................4
1.1 Vision and Objectives for the Organization: .......................................................................................4
1.2 Product/Services Provided by the Organization:................................................................................4
1.3 How a Data Warehousing or Business Analytics(DW/BA) fits into organization vision and
objectives? ................................................................................................................................................5
PRIORTIZING REQUIREMENTS ......................................................................................................................6
LOGICAL DIMENSIONAL MODEL FOR DATA WAREHOUSE ...........................................................................7
Fig 1: Dimensional Model/Logical Design.............................................................................................8
PHYSICAL DESIGN, ETL PROCESSES AND DEPLOYMENT ...............................................................................9
Figure 2: ETL process of the data warehouse life cycle ........................................................................9
Figure 3: Steps to populate the dimension tables..............................................................................10
Figure 4: Steps to populate the fact table ..........................................................................................11
CREATING OLAP CUBE AND PRESENTING THE DATA FOR ANALYSIS..........................................................12
Figure 5: The schema built in Schema Workbench.............................................................................12
Figure 6: The Jpivot created from the fact cube.................................................................................13
DATA ANALYSIS...........................................................................................................................................14
SQL QUERY SAMPLE USED FOR THE REPORTS:.......................................................................................24
TABLEAU VISUALIZATION........................................................................................................................25
DATA ANALYSIS SUMMARY AND CONCLUSION:.........................................................................................30
REFERENCES................................................................................................................................................31
APPENDIX....................................................................................................................................................32
Detailed hours spent on different project tasks by each team member: ................................................33
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 3 | 34
EXECUTIVE SUMMARY
The use of data in soccer, the most popular sport in the world - has seen huge growth in recent
years. This sport is fully realizing the power of performance analysis and gaining objective
information with better efficiency to gain honest competitive advantage. The widespread use of
advanced data in soccer provided a significant analysis on passing moves leading to goals. This
process is returning a valuable insight showing that most goals are scored following a chain of
three or fewer interrupted passes. The collection of data also can aid the recreation of past
situations that led to successful outcomes. These principles remain the basis of effective
analysis in soccer.
Within a few short years there had been a great shift, with the majority of top level clubs of
soccer adopting data warehousing techniques for even the most basic analysis which enable
them to access high-level knowledge consisting of important statistics at the click of the button.
The first phase of the strategic analysis focusing on moving towards how the available data is
used and implemented. The basic use of data is no longer enough to maintain a competitive
advantage. In the modern game, the use of data is all about finding the extra 1%, the detail that
can exploit even the slightest weakness in the opposition and make the difference between
winning and losing. Rather than simply applying data to tactical performance, objective
information is now used throughout to enhance efficiency and develop processes that enable
the organization to be as well-prepared as possible from the pitch to the boardroom.
In our group project, we will be analyzing the ways to increase the fan satisfaction by making up
a strong offensive team of soccer players without having much impact on the revenue. By
looking at the dataset, it is conspicuous that acquiring excellent players and winning games
with them have an impact on the fan loyalty and the increase in revenue. For better results, the
data sets need to be integrated, fed to the data warehouse for processing to extract
information that will help in making a physical model to be presented for further knowledge. To
achieve this goal, we have planned to start with making dimension tables and fact tables that
will provide some insight on the parameters affecting the fan satisfaction without largely
affecting the revenue.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 4 | 34
INTRODUCTION
Professional Soccer Organization is commonly known as United States Soccer Federation(USSF).
It is the official governing body of the sport of soccer in the United States. The headquarter is in
Chicago, Illinois and is governed by the FIFA members. It has focused on US amateur and
professional soccer, including the men’s, women’s, youth, beach soccer, futsal, and Paralympic
national teams.
1.1 Vision and Objectives for the Organization:
Vision:
“Since the start of organization, the mission statement has been clear and simple: to make
soccer, in all its forms, a pre-eminent sport in the United States and to continue the
development of soccer at all recreational and competitive levels.”
Objectives:
In context of improving the performance of the teams, the Professional Soccer Organization is
more inclined towards strategic objectives, which are:
Winning games are largely associated with increasing revenue and fan loyalty.
Having players fit and competing aggressively throughout the game has a large impact on fan
loyalty and fan satisfaction.
Having high offensive production has a high impact on fan satisfaction and a small impact on
revenue.
Having a chance to win each game (regardless of outcome) has small impacts on revenue and
fan loyalty.
Having marquis players healthy/playing in games has medium impact on fan satisfaction, fan
loyalty and revenue.
1.2 Product/Services Provided by the Organization:
Professional Soccer Organization has developed an operation named- “Value-Creating Process
Concept”. It provides with the following services:
 Manage players’ personnel strategy, which includes:
o Acquiring or stripping away players according to their performance.
o Developing the existing players.
 Selecting and preparing players for the season.
 Manage players’ personnel tactics and game/opponent tactics, which includes:
o Preparing players for the final match.
o Managing the match throughout.
o Entertaining fans to gain fan loyalty.
 Managing injuries of the players all through the process.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 5 | 34
 Managing fitness of the players all through the process.
1.3 How a Data Warehousing or Business Analytics(DW/BA) fits into organization vision
and objectives?
Professional Soccer Organization largely focuses on two main objectives that is increasing
revenue and increasing fan loyalty or satisfaction. This is can be achieved using DW/BA as it
provides with proper planning on how to acquire the objective. DW/BA planning has three
phases, namely:
 Strategic Alignment: It helps in identifying the needs to achieve the objective. In our
project, we are looking for increasing fan loyalty. So, this part of DW/BA will allow the
organization to locate all the needs and requirements by interviewing players and
coaches to win the game and hearts of their fans.
 Feasibility Analysis: It guides the organization in determining whether the applied
approach to achieve the fan following will be a success or not. It basically identifies the
risks involved in proceeding towards a process. There are three major components in
this analysis:
o Technical Feasibility: This will provide insight based on the historic data whether
the techniques being applied on improving the tactics and skills of the players
will result in majority winning or not.
o Economic Feasibility: This will provide knowledge on whether the money being
spend on the health, fitness and coaching of the players will be worth or not.
Because, if the facilities provided to the players is giving no positive output in the
field, then it’s an economic loss to the organization.
o Organizational Feasibility: This will allow the people in the analytics to know
whether the planning to reach the goal is being supported by the executives or
not.
 Project Plan Task Analysis: This will provide a list of primary responsibilities,
involvement, informing results for the input provided by the coaches, team players and
fans’ responses respectively.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 6 | 34
PRIORTIZING REQUIREMENTS
With respect to the scope of the project, we have limited our requirement into targeting the
high offensive production which has a high impact on fan satisfaction and a small impact on
revenue. For this we decided to concentrate on the following six outcomes in a game.
1. The red card count for a team
2. The yellow card count for a team
3. The number of fouls committed by both teams
4. The number of goals secured by the team
5. The number of penalty shots by the team
6. The number of times Goal keeper saved the ball from shot by the opponent team into
the goal post
To identify these, we went through and started analyzing the given input data files.
GIVEN INPUT DATASETS:
• Event dataset:
• File name: mls-data-for-DW-class_event_vx.x.tab
• Description:
• This data comes from “camera-based data collection system 1”
• Each observation (i.e. record) is an event in the game
• Event times are at the sub-second level
• Some events have a single player; others have two players involved
• Player location dataset:
• File name: mls-data-for-DW-class_location_vx.x.tab
• Description:
• This data comes from “camera-based data collection system 2” (i.e. a
different system from the event data)
• Each observation (i.e. record) is a player’s x,y-coordinate at a given time
in the game
• Player locations are captured at the second level
The location tab file contained ~840,000 records whereas the event tab file contained ~90,000
records. We tried narrowing it down depending on our priorities.
First, we imported both the csv file into the location_tab and event_tab respectively. Then we
saw that not all records of the location_tab had records associated with the event tab. We did a
left join keeping the event_tab as a left table.
After joining the tables, we imported the resulting table into a new csv file. We used the new
csv file for all the analysis.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 7 | 34
LOGICAL DIMENSIONAL MODEL FOR DATA WAREHOUSE
A logical design is abstract and conceptual as we cannot deal with the physical implementation
details just yet. Logical design helps in defining the type of information required to build a data
warehouse. A technique to model the organization’s logical information requirements is entity
relationship modelling. Entity-relationship modeling involves identifying the entities of
importance, the properties of these entities called as attributes, and how they are related to
one another. In dimensional modeling, instead of discovering atomic units of information that
are entities and attributes, we focus on identifying which information belongs to a central fact
table and which information belongs to its associated dimension tables.
Logical design basically refers to the dimensional modeling that tends to present data in a
structure that is:
 Intuitive for data access: All the SQL queries are easy to write, understand and
implement.
 High performance data access: The logical design will allow queries to return results
quickly ideally at the speed of the thought.
Referring to our project, the dimensional model can be presented as:
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 8 | 34
Fig 1: Dimensional Model/Logical Design
The dimensional model basically consists of two tables:
 Fact table: Attributes in fact tables are measurements for analysis or contents in reports.
 Dimension Table: Attributes in dimension tables are constraints for the measurements
or headers in reports.
To build the data warehouse, six dimensions were identified and one fact table which would be
associated with those dimensions. The dimensions identified are as below:
1. Time_in_game_dimension: This dimension captures the time spend by various teams
playing the game. It includes two attributes:
a. time_from_zero: the time in seconds from the start of the game
b. period: indicates whether it’s the first half of the game or second half
2. Game_dimension: This dimension captures the details of the two teams playing the game.
It includes two attributes:
a. v_team_gen: Team one playing the game
b. h_team_gen: Team two playing the game
3. Player_dimension: This dimension captures the details of the players of the two teams
and their respective teams. It includes four attributes:
a. Player_name_1_gen: Player one involved in the event
b. Player_name_2_gen: Player two involved in the event
c. Player_1_team_gen: The team player one belongs to
d. Player_2_team_gen: The team player two belongs to
4. Event_type_dimension: This dimension captures the details of the events that occur in the
game. It includes one attribute, event_type which indicates the particular event in the
game such as red card, yellow_card, goal, etc.
5. Position_dimension: This dimension captures the details of the position of the players of
the two teams while playing the game. It includes two attributes:
a. X: Represents the x coordinate of the player
b. Y: Represents the y coordinate of the player
6. Game_date_dimension: This dimension captures the details of the dates of the games. It
includes four attributes:
a. Game_date_gen: Represents date when the game is held
b. Weekend: Captures whether the game was held on a weekend
c. Month: Captures the month when the game was held
d. Day_of_week : Captures the day of the week the game was held
7. Fact table: The soccer_fact fact table includes a reference to all the primary keys of the
six dimensions
All the dimensions hold a 1: n identifying relation with fact table except the event table which
shares a 1: n non identifying relation.
Using the forward engineering method in MySQL we create the dimensional and fact tables in
the schema dw_mls1 as per our logical structure.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 9 | 34
PHYSICAL DESIGN, ETL PROCESSES AND DEPLOYMENT
Once the logical model which highlights the structure for the data warehouse is formed, the
next step would be to actually implement or create and populate the dimensional and fact
tables in the schema. We can see that the complexity increases from conceptual to logical to
physical. This is why we always first start with the conceptual data model so that we
understand at high level what are the different entities in our data and how they relate to one
another, then move on to the logical data model so that we understand the details of our data
without worrying about how they will actually be implemented, and finally the physical data
model so that we know exactly how to implement our data model in the database of choice.
Figure 2: ETL process of the data warehouse life cycle
Technology used: Using the Data Integration tool of Pentaho, steps were created to populate
the dimension tables and the fact tables.
Description of data: The .csv file which contained the cleaned data of the soccer matches was
used as the input to extract all the fields required. The combination look up functionality was
used to select the attributes for the respective dimension tables. An additional calculator was
used to extract the month of the game from the game date and value mapper was used to
extract if the day is a weekend. Accordingly, these fields were added for the
game_date_dimension table. Upon launching this transition, the dimensional tables of the
schema were populated.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 10 | 34
Figure 3: Steps to populate the dimension tables
To populate the fact table, the same .csv was used as an input and combination lookup was
used to look up all the required primary keys of the associated dimensions. All the lookup for
the primary keys of the dimensional tables was used as an input for the soccer_fact table. Upon
launching this transition, the soccer_fact table was populated with the all the primary key
values.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 11 | 34
Figure 4: Steps to populate the fact table
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 12 | 34
CREATING OLAP CUBE AND PRESENTING THE DATA FOR ANALYSIS
An OLAP cube is a technology that stores data in an optimized way to provide a quick response
to various types of complex queries by using dimensions and measures. Another name for
dimensional model is cube. Each cube represents one fact table and several dimensional tables.
This model should be useful for reporting and analysis about the data in the fact table.
Using Schema Workbench, a new schema was formed to include the dimensional model details
and create a cube with all dimension details. Upon building the cube, a jpivot was formed to
see the count of various events across all the dimensions. This helps us see how each event
type was associated with each game, team, and player which facilitates the analysis of our main
objective which is better fan satisfaction. The Jpivot was filtered based on the team. Team 2
was taken for analysis and the foul, red card and yellow card events were tracked to observe
their offensive production.
Figure 5: The schema built in Schema Workbench
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 13 | 34
Figure 6: The Jpivot created from the fact cube
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 14 | 34
DATA ANALYSIS
For the analysis, we focused on one game date ‘2014-06-14’. Analysis was done for the match
played on that date for better focus.
Game date: 2014-06-14
Home team: Team 2
Visiting team: Team 18
Team which won: Team 18
Further analysis on the goals, red card, yellow card, fouls, Goal Keeper saves and penalty shots
were made using the Pentaho Report designer tool. Reports were generated separately for
Team 2 and Team 18 for better understanding.
Foul count of Team 2:
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 15 | 34
Foul count of Team 18:
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 16 | 34
Team 2 did not score any Red cards during the game and Team 18 scored 1 Red cards.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 17 | 34
Team 2 did not get any penalty shots whereas Team 18 got 1 penalty shots.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 18 | 34
Yellow card count of Team 2:
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 19 | 34
Yellow card count of Team 18
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 20 | 34
Goals secured by both Team 2 and Team 18:
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 21 | 34
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 22 | 34
Count of the Goal Keeper save’s for both teams
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 23 | 34
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 24 | 34
SQL QUERY SAMPLE USED FOR THE REPORTS:
Foul count for Team 18
select gdd.game_date_gen, gd.h_team_gen, gd.v_team_gen, ed.event_type,
count(pd.player_1_team_gen)
from soccer_fact f
join game_date_dimension gdd on f.game_date_key = gdd.game_date_key
join game_dimension_1 gd on f.game_key = gd.game_key
join event_dimension ed on f.event_key = ed.event_key
join player_dimension pd on f.player_key = pd.player_key
where gd.h_team_gen = 'team2' and ed.event_type = 'Foul'
and gdd.game_date_gen = 20140614 and pd.player_1_team_gen = 'team18'
;
Goal count for Team 2
select gdd.game_date_gen, gd.h_team_gen, gd.v_team_gen, ed.event_type,
count(pd.player_1_team_gen)
from soccer_fact f
join game_date_dimension gdd on f.game_date_key = gdd.game_date_key
join game_dimension_1 gd on f.game_key = gd.game_key
join event_dimension ed on f.event_key = ed.event_key
join player_dimension pd on f.player_key = pd.player_key
where gd.h_team_gen = 'team2' and ed.event_type = 'Goal'
and gdd.game_date_gen = 20140614 and pd.player_1_team_gen = 'team2'
;
Red card count for Team 18
select gdd.game_date_gen, gd.h_team_gen, gd.v_team_gen, ed.event_type,
count(pd.player_1_team_gen)
from soccer_fact f
join game_date_dimension gdd on f.game_date_key = gdd.game_date_key
join game_dimension_1 gd on f.game_key = gd.game_key
join event_dimension ed on f.event_key = ed.event_key
join player_dimension pd on f.player_key = pd.player_key
where gd.h_team_gen = 'team2' and ed.event_type = 'Red Card'
and gdd.game_date_gen = 20140614 and pd.player_1_team_gen = 'team18'
;
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 25 | 34
Joins were performed on the Soccer_fact, game_date_dimension, game_dimension,
event_dimension, player_dimension to fetch records.
TABLEAU VISUALIZATION
To have a better visualization and understanding of the analysis we have used Tableau.
1. Player positioning
The above sheet depicts the positioning of the players of team15 and team4. We can see clearly
that there are more numbers of players defensively in the team15 and only 4 playing
attackingly in comparison to team4 where almost 50% of the players are playing defensively
and 50% attacking. The team15 players have a defensive mindset whereas the team 4 players
have an attacking mindset.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 26 | 34
When we plot, the goals scored by both the teams we find that the team4 have scored more
goals (9 goals) than team15(7 goals).
We can make a conclusion that a balanced team (i.e. 50% attacking and 50% defensive) wins
more matches and are more likely to win as well as increase more revenue and fan satisfaction.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 27 | 34
2. Comparison of the Goalkeeper.
The dashboard depicts the X coordinate of the two players of two different teams. In the left is
the team2 player 168 and in the right player44 of team10. With the X coordinate one can easily
guess that that the two players are goalkeepers since most of the time the players are not
moving beyond the 30 mark which is around the same distance of the D line from the center.
In the first sheet, we can see that the movement of the goalkeeper is -50 to -40 whereas in the
second graph we can see that the goalkeeper mark varies between 30 to 40 which is relatively
higher in the pitch.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 28 | 34
Also, when we plot the number of saves the goalkeeper makes we find that the number of
saves was more for the goalkeeper of the team10 (16 saves) than team2 goalkeeper (8 saves).
We conclude that the going higher the pitch or in other words playing offensively decreases the
chances of conceding the goal thereby increasing the fan satisfaction.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 29 | 34
3. Winning home game.
The above sheet shows that the position from where the goals have been scored in a match
between team 15 and team10 where the team15 was the host and won the match by 2-1. So,
when the teams win its game in the home ground it increases the fan satisfaction and the
revenue.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 30 | 34
DATA ANALYSIS SUMMARY AND CONCLUSION:
From the analysis done Team 18 has played more offensively securing red cards, yellow cards,
and fouls more than team 2. Team 18 secured 3 goals and won the game when team 2 could
secure only 2. Based on the observation one can observe that Team 18 would have had more
fan satisfaction than team 2 and would eventually generate more revenue in their upcoming
matches because of their performance. The players will also be on demand among the stake
holders who own the clubs. They would want to buy the team or players who had maximum fan
satisfaction. By this way, the revenue of a team would increase.
ISSUES:
Revenue here was not calculated in numeric values or counts here because; to calculate
revenue we need the data about number of people who turned in for the matches, ticket prices
and also data about bidding prices for the players or teams, worth of a club, their capacity to
buy a player etc. So, it was assumed based on the team’s performance.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 31 | 34
REFERENCES
http://www.smartdatacollective.com/bernardmarr/332906/how-big-data-and-analytics-are-
changing-football
http://www.ussoccer.com/about/
The DataWarehouse Toolkit (3rd
edition) – Ralph Kimball and Margy Ross
https:// WWW.google.com/
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 32 | 34
APPENDIX
Table explaining alignment of course material with the project:
S.No. Project Topic Course Topic Alignment
1. How a Data Warehousing
or Business Analytics fits
into organization vision
and objectives?
Data Warehouse Life
Cycle- Planning Phase
To meet the objectives, this
project requires strategic
alignment, feasibility study
and project plan task
analysis which is defined by
DWL planning phase.
2. Prioritizing requirements Data Warehouse Life
Cycle- Requirements
Phase
To prioritize requirements
for the project, we need to
identify and collect the
needs, for which DWL
requirements phase shows
the steps and various ways.
3. Logical Dimensional
Model
Data Warehouse Life
Cycle- Logical Design
Phase
Fact tables and dimensional
tables of the project are
created by referring to this
phase.
4. Physical Design Data Warehouse Life
Cycle- Physical Design
Phase
Basic theory is related to
DWL physical design and
implementation is done
using Pentaho Data
Integration.
5. ETL process Data Warehouse Life
Cycle- Data Integration
Phase
Implementation is done
using Pentaho Data
Integration.
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 33 | 34
Detailed hours spent on different project tasks by each team member:
Date Team
Member
Hours
Spent
Description of Work Addit
ional
Com
ment
s
03/01/2017
–
03/21/2017
Sunitha 60
hours
1. Worked along with Sagar, Rasmeet and
sayali in designing of dimension model
along with the team and created EER in
workbench and created the schema.
2. Worked on creating the reports using
Pentaho report designer.
3. Worked on adding parts of analysis in the
summary report along with Rasmeet
4. Worked along with sayali in creating the
Jpivot
03/01/2017-
03/21/2017
Sagar Singh 60
hours
1. Worked along with Rasmeet, Sayali and
Sunitha for Data Cleaning and importing
of the dataset and joining of the tables.
2. Worked on Tableau along with Rasmeet
to show visualization and analysis.
03/01/2017
–
03/21/2017
Sayali
Thakur
60
hours
1. Worked along with Sunitha and Sagar in
analyzing the data, cleaning and design
dimension model
2. Worked on creating the physical model
using Pentaho Data Integration tool
3. Worked on creating the OLAP cube using
Schema Workbench tool
4. Worked on adding content to the
summary report
03/01/2017 Rasmeet 60 1. Worked along with sagar, sunitha and
Data Warehouse for Professional Soccer Organization | Team 4
P a g e 34 | 34
–
03/21/2017
Kaur hours sayali in analyzing the data and business .
2. Worked on creating the powerpoint
presentation and project summary report.
3. Worked along with Sagar in creating
Tableau reports

Data warehouse Soccer Project

  • 1.
    Data Warehouse forProfessional Soccer Organization (Designing and Implementing Data Warehouse) Team 5: Rasmeet Kaur Sagar Singh Sunitha Shyam Sayali Thakur IS 6480/4480: Data Warehousing Version 1.4 21st March 2017
  • 2.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 2 | 34 Contents EXECUTIVE SUMMARY..................................................................................................................................3 INTRODUCTION.............................................................................................................................................4 1.1 Vision and Objectives for the Organization: .......................................................................................4 1.2 Product/Services Provided by the Organization:................................................................................4 1.3 How a Data Warehousing or Business Analytics(DW/BA) fits into organization vision and objectives? ................................................................................................................................................5 PRIORTIZING REQUIREMENTS ......................................................................................................................6 LOGICAL DIMENSIONAL MODEL FOR DATA WAREHOUSE ...........................................................................7 Fig 1: Dimensional Model/Logical Design.............................................................................................8 PHYSICAL DESIGN, ETL PROCESSES AND DEPLOYMENT ...............................................................................9 Figure 2: ETL process of the data warehouse life cycle ........................................................................9 Figure 3: Steps to populate the dimension tables..............................................................................10 Figure 4: Steps to populate the fact table ..........................................................................................11 CREATING OLAP CUBE AND PRESENTING THE DATA FOR ANALYSIS..........................................................12 Figure 5: The schema built in Schema Workbench.............................................................................12 Figure 6: The Jpivot created from the fact cube.................................................................................13 DATA ANALYSIS...........................................................................................................................................14 SQL QUERY SAMPLE USED FOR THE REPORTS:.......................................................................................24 TABLEAU VISUALIZATION........................................................................................................................25 DATA ANALYSIS SUMMARY AND CONCLUSION:.........................................................................................30 REFERENCES................................................................................................................................................31 APPENDIX....................................................................................................................................................32 Detailed hours spent on different project tasks by each team member: ................................................33
  • 3.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 3 | 34 EXECUTIVE SUMMARY The use of data in soccer, the most popular sport in the world - has seen huge growth in recent years. This sport is fully realizing the power of performance analysis and gaining objective information with better efficiency to gain honest competitive advantage. The widespread use of advanced data in soccer provided a significant analysis on passing moves leading to goals. This process is returning a valuable insight showing that most goals are scored following a chain of three or fewer interrupted passes. The collection of data also can aid the recreation of past situations that led to successful outcomes. These principles remain the basis of effective analysis in soccer. Within a few short years there had been a great shift, with the majority of top level clubs of soccer adopting data warehousing techniques for even the most basic analysis which enable them to access high-level knowledge consisting of important statistics at the click of the button. The first phase of the strategic analysis focusing on moving towards how the available data is used and implemented. The basic use of data is no longer enough to maintain a competitive advantage. In the modern game, the use of data is all about finding the extra 1%, the detail that can exploit even the slightest weakness in the opposition and make the difference between winning and losing. Rather than simply applying data to tactical performance, objective information is now used throughout to enhance efficiency and develop processes that enable the organization to be as well-prepared as possible from the pitch to the boardroom. In our group project, we will be analyzing the ways to increase the fan satisfaction by making up a strong offensive team of soccer players without having much impact on the revenue. By looking at the dataset, it is conspicuous that acquiring excellent players and winning games with them have an impact on the fan loyalty and the increase in revenue. For better results, the data sets need to be integrated, fed to the data warehouse for processing to extract information that will help in making a physical model to be presented for further knowledge. To achieve this goal, we have planned to start with making dimension tables and fact tables that will provide some insight on the parameters affecting the fan satisfaction without largely affecting the revenue.
  • 4.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 4 | 34 INTRODUCTION Professional Soccer Organization is commonly known as United States Soccer Federation(USSF). It is the official governing body of the sport of soccer in the United States. The headquarter is in Chicago, Illinois and is governed by the FIFA members. It has focused on US amateur and professional soccer, including the men’s, women’s, youth, beach soccer, futsal, and Paralympic national teams. 1.1 Vision and Objectives for the Organization: Vision: “Since the start of organization, the mission statement has been clear and simple: to make soccer, in all its forms, a pre-eminent sport in the United States and to continue the development of soccer at all recreational and competitive levels.” Objectives: In context of improving the performance of the teams, the Professional Soccer Organization is more inclined towards strategic objectives, which are: Winning games are largely associated with increasing revenue and fan loyalty. Having players fit and competing aggressively throughout the game has a large impact on fan loyalty and fan satisfaction. Having high offensive production has a high impact on fan satisfaction and a small impact on revenue. Having a chance to win each game (regardless of outcome) has small impacts on revenue and fan loyalty. Having marquis players healthy/playing in games has medium impact on fan satisfaction, fan loyalty and revenue. 1.2 Product/Services Provided by the Organization: Professional Soccer Organization has developed an operation named- “Value-Creating Process Concept”. It provides with the following services:  Manage players’ personnel strategy, which includes: o Acquiring or stripping away players according to their performance. o Developing the existing players.  Selecting and preparing players for the season.  Manage players’ personnel tactics and game/opponent tactics, which includes: o Preparing players for the final match. o Managing the match throughout. o Entertaining fans to gain fan loyalty.  Managing injuries of the players all through the process.
  • 5.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 5 | 34  Managing fitness of the players all through the process. 1.3 How a Data Warehousing or Business Analytics(DW/BA) fits into organization vision and objectives? Professional Soccer Organization largely focuses on two main objectives that is increasing revenue and increasing fan loyalty or satisfaction. This is can be achieved using DW/BA as it provides with proper planning on how to acquire the objective. DW/BA planning has three phases, namely:  Strategic Alignment: It helps in identifying the needs to achieve the objective. In our project, we are looking for increasing fan loyalty. So, this part of DW/BA will allow the organization to locate all the needs and requirements by interviewing players and coaches to win the game and hearts of their fans.  Feasibility Analysis: It guides the organization in determining whether the applied approach to achieve the fan following will be a success or not. It basically identifies the risks involved in proceeding towards a process. There are three major components in this analysis: o Technical Feasibility: This will provide insight based on the historic data whether the techniques being applied on improving the tactics and skills of the players will result in majority winning or not. o Economic Feasibility: This will provide knowledge on whether the money being spend on the health, fitness and coaching of the players will be worth or not. Because, if the facilities provided to the players is giving no positive output in the field, then it’s an economic loss to the organization. o Organizational Feasibility: This will allow the people in the analytics to know whether the planning to reach the goal is being supported by the executives or not.  Project Plan Task Analysis: This will provide a list of primary responsibilities, involvement, informing results for the input provided by the coaches, team players and fans’ responses respectively.
  • 6.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 6 | 34 PRIORTIZING REQUIREMENTS With respect to the scope of the project, we have limited our requirement into targeting the high offensive production which has a high impact on fan satisfaction and a small impact on revenue. For this we decided to concentrate on the following six outcomes in a game. 1. The red card count for a team 2. The yellow card count for a team 3. The number of fouls committed by both teams 4. The number of goals secured by the team 5. The number of penalty shots by the team 6. The number of times Goal keeper saved the ball from shot by the opponent team into the goal post To identify these, we went through and started analyzing the given input data files. GIVEN INPUT DATASETS: • Event dataset: • File name: mls-data-for-DW-class_event_vx.x.tab • Description: • This data comes from “camera-based data collection system 1” • Each observation (i.e. record) is an event in the game • Event times are at the sub-second level • Some events have a single player; others have two players involved • Player location dataset: • File name: mls-data-for-DW-class_location_vx.x.tab • Description: • This data comes from “camera-based data collection system 2” (i.e. a different system from the event data) • Each observation (i.e. record) is a player’s x,y-coordinate at a given time in the game • Player locations are captured at the second level The location tab file contained ~840,000 records whereas the event tab file contained ~90,000 records. We tried narrowing it down depending on our priorities. First, we imported both the csv file into the location_tab and event_tab respectively. Then we saw that not all records of the location_tab had records associated with the event tab. We did a left join keeping the event_tab as a left table. After joining the tables, we imported the resulting table into a new csv file. We used the new csv file for all the analysis.
  • 7.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 7 | 34 LOGICAL DIMENSIONAL MODEL FOR DATA WAREHOUSE A logical design is abstract and conceptual as we cannot deal with the physical implementation details just yet. Logical design helps in defining the type of information required to build a data warehouse. A technique to model the organization’s logical information requirements is entity relationship modelling. Entity-relationship modeling involves identifying the entities of importance, the properties of these entities called as attributes, and how they are related to one another. In dimensional modeling, instead of discovering atomic units of information that are entities and attributes, we focus on identifying which information belongs to a central fact table and which information belongs to its associated dimension tables. Logical design basically refers to the dimensional modeling that tends to present data in a structure that is:  Intuitive for data access: All the SQL queries are easy to write, understand and implement.  High performance data access: The logical design will allow queries to return results quickly ideally at the speed of the thought. Referring to our project, the dimensional model can be presented as:
  • 8.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 8 | 34 Fig 1: Dimensional Model/Logical Design The dimensional model basically consists of two tables:  Fact table: Attributes in fact tables are measurements for analysis or contents in reports.  Dimension Table: Attributes in dimension tables are constraints for the measurements or headers in reports. To build the data warehouse, six dimensions were identified and one fact table which would be associated with those dimensions. The dimensions identified are as below: 1. Time_in_game_dimension: This dimension captures the time spend by various teams playing the game. It includes two attributes: a. time_from_zero: the time in seconds from the start of the game b. period: indicates whether it’s the first half of the game or second half 2. Game_dimension: This dimension captures the details of the two teams playing the game. It includes two attributes: a. v_team_gen: Team one playing the game b. h_team_gen: Team two playing the game 3. Player_dimension: This dimension captures the details of the players of the two teams and their respective teams. It includes four attributes: a. Player_name_1_gen: Player one involved in the event b. Player_name_2_gen: Player two involved in the event c. Player_1_team_gen: The team player one belongs to d. Player_2_team_gen: The team player two belongs to 4. Event_type_dimension: This dimension captures the details of the events that occur in the game. It includes one attribute, event_type which indicates the particular event in the game such as red card, yellow_card, goal, etc. 5. Position_dimension: This dimension captures the details of the position of the players of the two teams while playing the game. It includes two attributes: a. X: Represents the x coordinate of the player b. Y: Represents the y coordinate of the player 6. Game_date_dimension: This dimension captures the details of the dates of the games. It includes four attributes: a. Game_date_gen: Represents date when the game is held b. Weekend: Captures whether the game was held on a weekend c. Month: Captures the month when the game was held d. Day_of_week : Captures the day of the week the game was held 7. Fact table: The soccer_fact fact table includes a reference to all the primary keys of the six dimensions All the dimensions hold a 1: n identifying relation with fact table except the event table which shares a 1: n non identifying relation. Using the forward engineering method in MySQL we create the dimensional and fact tables in the schema dw_mls1 as per our logical structure.
  • 9.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 9 | 34 PHYSICAL DESIGN, ETL PROCESSES AND DEPLOYMENT Once the logical model which highlights the structure for the data warehouse is formed, the next step would be to actually implement or create and populate the dimensional and fact tables in the schema. We can see that the complexity increases from conceptual to logical to physical. This is why we always first start with the conceptual data model so that we understand at high level what are the different entities in our data and how they relate to one another, then move on to the logical data model so that we understand the details of our data without worrying about how they will actually be implemented, and finally the physical data model so that we know exactly how to implement our data model in the database of choice. Figure 2: ETL process of the data warehouse life cycle Technology used: Using the Data Integration tool of Pentaho, steps were created to populate the dimension tables and the fact tables. Description of data: The .csv file which contained the cleaned data of the soccer matches was used as the input to extract all the fields required. The combination look up functionality was used to select the attributes for the respective dimension tables. An additional calculator was used to extract the month of the game from the game date and value mapper was used to extract if the day is a weekend. Accordingly, these fields were added for the game_date_dimension table. Upon launching this transition, the dimensional tables of the schema were populated.
  • 10.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 10 | 34 Figure 3: Steps to populate the dimension tables To populate the fact table, the same .csv was used as an input and combination lookup was used to look up all the required primary keys of the associated dimensions. All the lookup for the primary keys of the dimensional tables was used as an input for the soccer_fact table. Upon launching this transition, the soccer_fact table was populated with the all the primary key values.
  • 11.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 11 | 34 Figure 4: Steps to populate the fact table
  • 12.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 12 | 34 CREATING OLAP CUBE AND PRESENTING THE DATA FOR ANALYSIS An OLAP cube is a technology that stores data in an optimized way to provide a quick response to various types of complex queries by using dimensions and measures. Another name for dimensional model is cube. Each cube represents one fact table and several dimensional tables. This model should be useful for reporting and analysis about the data in the fact table. Using Schema Workbench, a new schema was formed to include the dimensional model details and create a cube with all dimension details. Upon building the cube, a jpivot was formed to see the count of various events across all the dimensions. This helps us see how each event type was associated with each game, team, and player which facilitates the analysis of our main objective which is better fan satisfaction. The Jpivot was filtered based on the team. Team 2 was taken for analysis and the foul, red card and yellow card events were tracked to observe their offensive production. Figure 5: The schema built in Schema Workbench
  • 13.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 13 | 34 Figure 6: The Jpivot created from the fact cube
  • 14.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 14 | 34 DATA ANALYSIS For the analysis, we focused on one game date ‘2014-06-14’. Analysis was done for the match played on that date for better focus. Game date: 2014-06-14 Home team: Team 2 Visiting team: Team 18 Team which won: Team 18 Further analysis on the goals, red card, yellow card, fouls, Goal Keeper saves and penalty shots were made using the Pentaho Report designer tool. Reports were generated separately for Team 2 and Team 18 for better understanding. Foul count of Team 2:
  • 15.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 15 | 34 Foul count of Team 18:
  • 16.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 16 | 34 Team 2 did not score any Red cards during the game and Team 18 scored 1 Red cards.
  • 17.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 17 | 34 Team 2 did not get any penalty shots whereas Team 18 got 1 penalty shots.
  • 18.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 18 | 34 Yellow card count of Team 2:
  • 19.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 19 | 34 Yellow card count of Team 18
  • 20.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 20 | 34 Goals secured by both Team 2 and Team 18:
  • 21.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 21 | 34
  • 22.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 22 | 34 Count of the Goal Keeper save’s for both teams
  • 23.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 23 | 34
  • 24.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 24 | 34 SQL QUERY SAMPLE USED FOR THE REPORTS: Foul count for Team 18 select gdd.game_date_gen, gd.h_team_gen, gd.v_team_gen, ed.event_type, count(pd.player_1_team_gen) from soccer_fact f join game_date_dimension gdd on f.game_date_key = gdd.game_date_key join game_dimension_1 gd on f.game_key = gd.game_key join event_dimension ed on f.event_key = ed.event_key join player_dimension pd on f.player_key = pd.player_key where gd.h_team_gen = 'team2' and ed.event_type = 'Foul' and gdd.game_date_gen = 20140614 and pd.player_1_team_gen = 'team18' ; Goal count for Team 2 select gdd.game_date_gen, gd.h_team_gen, gd.v_team_gen, ed.event_type, count(pd.player_1_team_gen) from soccer_fact f join game_date_dimension gdd on f.game_date_key = gdd.game_date_key join game_dimension_1 gd on f.game_key = gd.game_key join event_dimension ed on f.event_key = ed.event_key join player_dimension pd on f.player_key = pd.player_key where gd.h_team_gen = 'team2' and ed.event_type = 'Goal' and gdd.game_date_gen = 20140614 and pd.player_1_team_gen = 'team2' ; Red card count for Team 18 select gdd.game_date_gen, gd.h_team_gen, gd.v_team_gen, ed.event_type, count(pd.player_1_team_gen) from soccer_fact f join game_date_dimension gdd on f.game_date_key = gdd.game_date_key join game_dimension_1 gd on f.game_key = gd.game_key join event_dimension ed on f.event_key = ed.event_key join player_dimension pd on f.player_key = pd.player_key where gd.h_team_gen = 'team2' and ed.event_type = 'Red Card' and gdd.game_date_gen = 20140614 and pd.player_1_team_gen = 'team18' ;
  • 25.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 25 | 34 Joins were performed on the Soccer_fact, game_date_dimension, game_dimension, event_dimension, player_dimension to fetch records. TABLEAU VISUALIZATION To have a better visualization and understanding of the analysis we have used Tableau. 1. Player positioning The above sheet depicts the positioning of the players of team15 and team4. We can see clearly that there are more numbers of players defensively in the team15 and only 4 playing attackingly in comparison to team4 where almost 50% of the players are playing defensively and 50% attacking. The team15 players have a defensive mindset whereas the team 4 players have an attacking mindset.
  • 26.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 26 | 34 When we plot, the goals scored by both the teams we find that the team4 have scored more goals (9 goals) than team15(7 goals). We can make a conclusion that a balanced team (i.e. 50% attacking and 50% defensive) wins more matches and are more likely to win as well as increase more revenue and fan satisfaction.
  • 27.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 27 | 34 2. Comparison of the Goalkeeper. The dashboard depicts the X coordinate of the two players of two different teams. In the left is the team2 player 168 and in the right player44 of team10. With the X coordinate one can easily guess that that the two players are goalkeepers since most of the time the players are not moving beyond the 30 mark which is around the same distance of the D line from the center. In the first sheet, we can see that the movement of the goalkeeper is -50 to -40 whereas in the second graph we can see that the goalkeeper mark varies between 30 to 40 which is relatively higher in the pitch.
  • 28.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 28 | 34 Also, when we plot the number of saves the goalkeeper makes we find that the number of saves was more for the goalkeeper of the team10 (16 saves) than team2 goalkeeper (8 saves). We conclude that the going higher the pitch or in other words playing offensively decreases the chances of conceding the goal thereby increasing the fan satisfaction.
  • 29.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 29 | 34 3. Winning home game. The above sheet shows that the position from where the goals have been scored in a match between team 15 and team10 where the team15 was the host and won the match by 2-1. So, when the teams win its game in the home ground it increases the fan satisfaction and the revenue.
  • 30.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 30 | 34 DATA ANALYSIS SUMMARY AND CONCLUSION: From the analysis done Team 18 has played more offensively securing red cards, yellow cards, and fouls more than team 2. Team 18 secured 3 goals and won the game when team 2 could secure only 2. Based on the observation one can observe that Team 18 would have had more fan satisfaction than team 2 and would eventually generate more revenue in their upcoming matches because of their performance. The players will also be on demand among the stake holders who own the clubs. They would want to buy the team or players who had maximum fan satisfaction. By this way, the revenue of a team would increase. ISSUES: Revenue here was not calculated in numeric values or counts here because; to calculate revenue we need the data about number of people who turned in for the matches, ticket prices and also data about bidding prices for the players or teams, worth of a club, their capacity to buy a player etc. So, it was assumed based on the team’s performance.
  • 31.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 31 | 34 REFERENCES http://www.smartdatacollective.com/bernardmarr/332906/how-big-data-and-analytics-are- changing-football http://www.ussoccer.com/about/ The DataWarehouse Toolkit (3rd edition) – Ralph Kimball and Margy Ross https:// WWW.google.com/
  • 32.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 32 | 34 APPENDIX Table explaining alignment of course material with the project: S.No. Project Topic Course Topic Alignment 1. How a Data Warehousing or Business Analytics fits into organization vision and objectives? Data Warehouse Life Cycle- Planning Phase To meet the objectives, this project requires strategic alignment, feasibility study and project plan task analysis which is defined by DWL planning phase. 2. Prioritizing requirements Data Warehouse Life Cycle- Requirements Phase To prioritize requirements for the project, we need to identify and collect the needs, for which DWL requirements phase shows the steps and various ways. 3. Logical Dimensional Model Data Warehouse Life Cycle- Logical Design Phase Fact tables and dimensional tables of the project are created by referring to this phase. 4. Physical Design Data Warehouse Life Cycle- Physical Design Phase Basic theory is related to DWL physical design and implementation is done using Pentaho Data Integration. 5. ETL process Data Warehouse Life Cycle- Data Integration Phase Implementation is done using Pentaho Data Integration.
  • 33.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 33 | 34 Detailed hours spent on different project tasks by each team member: Date Team Member Hours Spent Description of Work Addit ional Com ment s 03/01/2017 – 03/21/2017 Sunitha 60 hours 1. Worked along with Sagar, Rasmeet and sayali in designing of dimension model along with the team and created EER in workbench and created the schema. 2. Worked on creating the reports using Pentaho report designer. 3. Worked on adding parts of analysis in the summary report along with Rasmeet 4. Worked along with sayali in creating the Jpivot 03/01/2017- 03/21/2017 Sagar Singh 60 hours 1. Worked along with Rasmeet, Sayali and Sunitha for Data Cleaning and importing of the dataset and joining of the tables. 2. Worked on Tableau along with Rasmeet to show visualization and analysis. 03/01/2017 – 03/21/2017 Sayali Thakur 60 hours 1. Worked along with Sunitha and Sagar in analyzing the data, cleaning and design dimension model 2. Worked on creating the physical model using Pentaho Data Integration tool 3. Worked on creating the OLAP cube using Schema Workbench tool 4. Worked on adding content to the summary report 03/01/2017 Rasmeet 60 1. Worked along with sagar, sunitha and
  • 34.
    Data Warehouse forProfessional Soccer Organization | Team 4 P a g e 34 | 34 – 03/21/2017 Kaur hours sayali in analyzing the data and business . 2. Worked on creating the powerpoint presentation and project summary report. 3. Worked along with Sagar in creating Tableau reports