through 4 sprints we have treated ANAP 2020 Challenge(https://www.datascience.net/fr/challenge/28/details)
Analyses and specification sprint
Business intelligence sprint
Data mining Sprint
Big Data Sprint
1. Rapport PEBI: Anap
2020
Realized by:
Ben Amor Hela
Gassab Ali
Gasmi Cyrine
Rekik Habib
Zaïbi Chaima
College year:2016-2017
Projet Pe Bi
2. Table of Content
General Introduction..................................................................................................................... 7
Chapter 1...................................................................................................................................... 9
The Project Context....................................................................................................................... 9
Introduction:............................................................................................................................10
1. State of the problematic....................................................................................................10
2. Presentation of the host Organization................................................................................11
3. Case study :......................................................................................................................12
4. Solution............................................................................................................................12
Conclusion...............................................................................................................................13
Chapter 2.....................................................................................................................................14
Analysis and specification of requirements....................................................................................14
Introduction.............................................................................................................................15
1. Identification of Actors......................................................................................................15
2. Functional Requirements...................................................................................................15
3. Non Functional Requirements............................................................................................15
4. Methodology....................................................................................................................16
Conclusion...............................................................................................................................16
Chapter 3.....................................................................................................................................17
Sprint Business Intelligence ..........................................................................................................17
Introduction.............................................................................................................................18
1. Extracting data..................................................................................................................18
2. Design..............................................................................................................................20
3. Loading data.....................................................................................................................22
3.3. Description of FactAnap2020......................................................................................22
3.4. Description of Fact_Qualite........................................................................................22
3.5. Description of Fact_Finance........................................................................................23
3.6. Description of Fact_Process........................................................................................23
3.7. Description of Fact_RH...............................................................................................24
3.8. Description of Dimensions..........................................................................................24
4. Analytical Processing.........................................................................................................27
5. Reporting.........................................................................................................................27
6. Used technologies.............................................................................................................29
Conclusion...............................................................................................................................30
3. Chapter 4.....................................................................................................................................31
Sprint Data Mining.......................................................................................................................31
Introduction.............................................................................................................................32
1. Descriptive analysis...........................................................................................................32
2. Predictive analysis.............................................................................................................40
Conclusion...............................................................................................................................47
3. Time series analysis...........................................................................................................48
Chapter 5.....................................................................................................................................60
Sprint Big Data.............................................................................................................................60
Chapter 6.....................................................................................................................................62
Marketing Strategy.......................................................................................................................62
Introduction.............................................................................................................................63
1. Strategic Analysis (SWOT)..................................................................................................63
Conclusion...............................................................................................................................65
4. Table of Figures
Figure 1 Methodology scrum ........................................................................................................16
Figure 2 El....................................................................................................................................19
Figure 3 EL Hopitaux_ODS ............................................................................................................19
Figure 4 EL Data_ODS...................................................................................................................19
Figure 5 Design ............................................................................................................................20
Figure 6 ETL .................................................................................................................................21
Figure 7 The structure of the Data WareHouse ..............................................................................21
Figure 8 FactAnap2020.................................................................................................................22
Figure 9 Fact_Qualite ...................................................................................................................23
Figure 10 Fact_Finance.................................................................................................................23
Figure 11 Fact_Process.................................................................................................................24
Figure 12 Fact_RH ........................................................................................................................24
Figure 13 Year Dimension.............................................................................................................25
Figure 14 RS Dimension................................................................................................................25
Figure 15 Age Dimension..............................................................................................................26
Figure 16 Patients origin Dimension..............................................................................................26
Figure 17 DA Dimension ...............................................................................................................26
Figure 18 Reporting 1...................................................................................................................28
Figure 19 Reporting 2...................................................................................................................28
Figure 20 Reporting 3...................................................................................................................28
Figure 21 Importing datafrom the DataWareHouse.......................................................................33
Figure 22 Descriptive statistics......................................................................................................33
Figure 23 Kmeans commands 1.....................................................................................................35
Figure 24 Kmeans commands 2.....................................................................................................35
Figure 25 Single matrix commands................................................................................................36
Figure 26 Pairs Graph ...................................................................................................................37
Figure 27 ACP / K-Means combination...........................................................................................37
Figure 28 Plot Acp Graph..............................................................................................................38
Figure 29 CAH..............................................................................................................................38
Figure 30 Dendogram...................................................................................................................39
Figure 31 Circle of correlations Commands....................................................................................39
Figure 32 function to eliminate NA values......................................................................................40
Figure 33 Import Data ..................................................................................................................40
Figure 34 Summary commands.....................................................................................................41
Figure 35 10000 lines selection command .....................................................................................41
Figure 36 center and reduce.........................................................................................................41
Figure 37 Prediction using linear regression...................................................................................42
Figure 38 StepAic.........................................................................................................................43
Figure 39 cbind command ............................................................................................................44
Figure 40 Residuals vs Filtted Graph..............................................................................................45
Figure 41 Normal Q-Q graph.........................................................................................................46
Figure 42 Neural network.............................................................................................................46
Figure 43 Neural network Graph...................................................................................................47
Figure 44 importing data commands.............................................................................................48
Figure 45 Correlation analysis ......................................................................................................49
Figure 46 Mcor Commands...........................................................................................................49
5. Figure 47 Pearson correlation.......................................................................................................49
Figure 48 corrplot comand............................................................................................................50
Figure 49 corrplot graph...............................................................................................................50
Figure 50 temporal series commands............................................................................................50
Figure 51 temporal series graph....................................................................................................51
Figure 52 acf command ................................................................................................................51
Figure 53 graph series ..................................................................................................................52
Figure 54 correlogram graph.........................................................................................................53
Figure 55 Time series commands ..................................................................................................53
Figure 56 decompose commands..................................................................................................53
Figure 57 Decomposition of additive time series............................................................................54
Figure 58 Decomposition of additive time series............................................................................55
Figure 59 Decomposition of additive time series............................................................................56
Figure 60 Prevision commands......................................................................................................56
Figure 61 Summary(fit) result........................................................................................................57
Figure 62 Acf residuals graph........................................................................................................57
Figure 63 plot forecast..................................................................................................................57
Figure 64 Forcats grap ..................................................................................................................58
Figure 65 Swot graph..........................................................................Error! Bookmark not defined.
6. Abstract
This document is a summary of our work in the framework of the integration project carried
out within ESPRIT in conjunction with the company ANAP2020. We have been fortunate
during this period to work on a BI project.
The aim of this work is to create a decision-making layer to meet the needs expressed by
ANAP2020 makers to gain visibility into the true state of purchases, etc...
KEYWORDS: Data Warehouse, Business Intelligence, data integration, Open Source solutions
7. General Introduction
Business intelligence is a data driven decision-making.
It is the practice of taking large amounts of corporate data and turning it to usable information.
This practice enables companies to derive analysis that can be used to make profitable actions.
The process of converting corporate data to usable information goes beyond data collection and
crunching, into how companies can gain from big data and data mining.
This means that business intelligence is not confined to technology, it includes the business
processes and data analysis procedures that facilitate the collection of big data. It is time
consuming and involves various factors such as data models, data sources, data warehouses,
and business models among others.
Big data and data mining are completely different concepts. However, both concepts involve
the use of large data sets to handle the collection or reporting of data that helps businesses or
clients make better decisions. However, the two concepts are used in two different elements of
this operation.
The term big data can be defined simply as large data sets that outgrow simple databases and
data handling architectures. For example, data that cannot be easily handled in Excel
spreadsheets may be referred to as big data.
Data mining relates to the process of going through large sets of data to identify relevant or
pertinent information. Businesses often collect large data sets that may be automatically
collected. However, decision makers need access to smaller, more specific pieces of data and
use data mining to identify specific data that may help their businesses make better leadership
and management decisions.
Various software packages and analytical tools can be used for data mining. The process can
be automated or be done manually. Data mining allows individual workers to send specific
queries for information to archives and databases so that they can obtain targeted or specific
results.
8. Many successful companies have invested large amounts of money in business intelligence’s
tools and data warehousing technologies. They believe that the updated, accurate and integrated
information from their supply chain, products and customers are essential to their survival.
In the parts below, a detailed explanation will be presented to illustrate our BI solution to the
ANAP 2020 project.
10. Chapter 1: The project concept
Introduction:
Systems and information technologies have become the bases of modern organizations today.
They touch, without exception, all activities and all areas of the business, both internally and
externally.
Current businesses found in these new information processing systems effective tools to
innovate in terms of improving working methods.
In this sense, a new professional environment has been created, a context that values both bodies
by the quality of the management of contracts, and the richness of their information systems,
particularly through the optimal exploitation the latter, thus reducing costs, delays and
improving responsiveness and quality.
It is in this light that ANAP 2020 expressed a need for setting up a platform of reporting: an
optical objective, which is to predict the future goals and the means you will need to also
identify market challenges the competitive pressures and evolving technologies.
In this chapter we will start by presenting ANAP 2020 then we will present a study of an
existing solution and finally specify the proposed solution to provide the most effective service.
1. State of the problematic
The healthcare industry is on the brink of transformation.
There are many reasons for that. Growth in the healthcare industry is at an all-time high, and
healthcare organizations are seeking new ways to improve operating efficiency and reduce
costs.
The recent and unprecedented changes occurring in healthcare have sent organizations
scrambling to extract critical information from the mountains of disparate data they possess so
that they can drive optimal performance. That’s where a data warehouse comes in. A data
warehouse is a must-have commodity for any organization seeking to do the following:
Understand and manage patient populations.
Support and defend clinical decisions.
11. Allocate scarce resources.
Reduce waste.
Improve quality of care.
Optimize clinical, financial and operational performance.
In this context, ANAP has a large mass of data, from heterogeneous sources which are hard
to interpret and to understand by decision makers.
2. Presentation of the host Organization
ANAP is a public agency that assists health and medical-social institutions in improving the
service provided to patients and users by developing recommendations and tools to optimize
their management and organization.
ATIH is a public agency responsible for collecting and analyzing health facility data: activity,
costs, organization and quality of care, finances, human resources...
It carries out studies on hospital costs and participates in the mechanisms Institutions.
ANAP and ATIH, within the framework of their respective missions, agree on an essential
fact: a considerable amount of data is produced by the different actors of health. This wealth
and the variety of information collected open up great prospects for exploitation in terms of
understanding the health system and its evolution.
Opening the data can make it possible to multiply the operating and analysis capacity and thus
to make the most of the richness of this information.
The ANAP-ATIH 2020 Project is an opportunity to raise awareness on this issue and it has
three main objectives:
● Promote a fun and attractive initiative to teach the issues of the exploitation of health data,
● Demonstrate the value and potential of data exploitation for health policy at a local or
12. national level and provide a concrete illustration of this major issue of adapting the
organization of The increase of chronic pathologies,
● Explore the ability of data scientists from different backgrounds to contribute to the
resolution of problems based on better exploitation of data.
3. Case study:
Analyzing existing similar projects is crucial in order to better understand the context and start
implementing our solution. In this part we have chosen to work on LifeBridge Health.
LifeBridge Health is a regional health care organization based in northwest Baltimore and its
surrounding counties. LifeBridge Health consists of Sinai Hospital of Baltimore, Northwest
Hospital, Carroll Hospital, Levindale Hebrew Geriatric Center and Hospital, LifeBridge Health
& Fitness, hundreds of primary care and specialty physicians throughout the region, and many
affiliated health-related partners.
As one of the largest, most comprehensive and most highly respected providers of health-related
services to the people of the northwest Baltimore region, LifeBridge Health advocates
preventive services, wellness and fitness services, and programs to educate and support the
communities it serves. Back in 2008, LifeBridge was one of the first healthcare providers to
adopt Cerner's PowerInsight data warehouse solution, which is based on SAP BusinessObjects.
4. Solution
In order to solve ANAP 2020’s problems we established a business intelligence solution after
studying the project and its available different sorts of data.
During this part of the project we:
● Started by centralizing the data
● Assisted decision-makers in decision-making
13. Conclusion
We tried using this chapter to provide an overview of the context of our project and the solution
we proposed. In the next chapter, we will present analysis and specification of requirements.
15. Chapter 2: Analysis and specification of the requirements
Introduction
This chapter is dedicated for detailing features, designing and realizing our solution. We
will illustrate also the methodology adopted for the establishment of the project.
1. Identification of Actors
The actors of a system are all stakeholders who interact directly with the system according to
their roles, so they can bypass the working interface they decide on action.
We have two main actors:
- An administrator :
Able to maintain the solution in case of breakdown
Tracks the sources of errors
Has the ability to control decision makers at different analyzes.
- A decision maker:
Consults the dashboard of the solution to have a decision support.
2. Functional Requirements
Process requirements describe what our solution do. Process requirements relate the entities
and attributes from the data requirements to the users’ needs. State the functional process
requirements in a manner that enables the reader to see broad concepts decomposed into
layers of increasing detail.
The key of Anap2020 is forecast the medium-term evolution of the importance of chronic
disease management for health care facilities. To do this, we have use historical data available
in OpenData, with the Hospi Diag * and the ScanSanté * reporting platform in the forefront.
3. Non Functional Requirements
The system we develop has to be:
Following the evolution: The evolution of information is a constraint that
must be taken into account.
Operational: The system must be responsive and guarantee absolute
reliability.
User friendly: The solution must be easy to use and to maintain
16. Efficient: It has to consume the minimum resources and give the greatest
results.
Maintainable: It has to be easy to maintain in case of a breakdown.
4. Methodology
We used the SCRUM methodology during the realization of our project.
The Scrum approach to project management enables software development organizations to
prioritize the work that matters most and break it down into manageable chunks. Scrum is about
collaborating and communicating both with the people who are doing the work and the people
who need the work done. It’s about delivering often and responding to feedback, increasing
business value by ensuring that customers get what they actually want.
Figure 1 Methodology scrum
Conclusion
The needs assessment is a step more than necessary in a Data Warehouse project.
In fact, through this study, we can decide how to design the data warehouse architecture.
The needs are identified, the Data Warehouse modeling can begin. This modeling will be in the
next chapter.
18. Chapter 3: Sprint Business intelligence
Introduction
Once we identified our requirements, actors and modeling method, we can begin by defining
the concept of Business intelligence so simply business intelligence is a computer based
technique used in spotting, digging-out, and analyzing business data.
During this chapter, we are going to explain how we applied Business intelligence tools in the
development of our project.
1. Extracting data
We started by collecting ANAP’s data from its information systems.
In order to preserve our data structure we used an “ODS”.
An operational data store (or "ODS") is a database designed to integrate data from multiple
sources for additional operations on the data. Unlike a master data store, the data is not passed
back to operational systems.
Because the data originate from multiple sources, the integration often involves cleaning,
resolving redundancy and checking against business rules for integrity. An ODS is usually
designed to contain low-level or atomic (indivisible) data (such as transactions and prices) with
limited history that is captured "real time" or "near real time" as opposed to the much greater
volumes of data stored in the data warehouse generally on a less-frequent basis.
We used ODS tools in order to extract data from ANAP information system in order to storage
data into dimensions to build our ODS.
20. 2. Design
During this step we are going to supply our Data Warehouse.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area.
For example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example,
source A and source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve
data from 3 months, 6 months, 12 months, or even older data from a data warehouse.
This contrasts with a transactions system, where often only the most recent data is kept.
For example, a transaction system may hold the most recent address of a customer,
where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data
in a d ata warehouse should never be altered.
Figure 5 Design
21. First of all, we created the Data Warehouse in SQL Server Management Studio to define the
structure of each dimensions and facts.
We used the fact constellation schema because we need an architecture that has multiple fact
tables that shares many dimension tables.
Figure 6 ETL
Figure 7 The structure of the Data WareHouse
22. 3. Loading data
In this step, we are going to integrate, load and store data in the Data Warehouse.
3.3. Descriptionof FactAnap2020
This table contains the foreign keys of all the dimensions such asyear dimension, age, rs ( social reason)
and patients origin dimension. It contains also the principle measures that our work is based on which
are Total number of stay and number of MCO stay (Medicine, Surgery, Obstetrics).
Figure 8 FactAnap2020
3.4. Descriptionof Fact_Qualite
This table measures the means used to fight nosocomial infections in institutions, through four
Indicators:
ICALIN: Composite Indicator of Activities for the Control of Nosocomial Infections,
ICSHA Indicator of Consumption of Hydro-Alcoholic Solutions,
ICATB Index Composite of Good Use of Antimicrobials
SURVISO Indicator of realization of a Surveillance of the Infections of the Operating Site.
23. Figure 9 Fact_Qualite
3.5. Descriptionof Fact_Finance
This table contains all the information of the financial indicators of hospitals (margin, debt,
financial need ...). The following information is analyzed:
Medical doctors, including doctors (except anesthetists), including Surgeons (excluding
gynecologists and obstetricians), including Anesthetists, Obstetricians.
Figure 10 Fact_Finance
3.6. Descriptionof Fact_Process
This indicator compares the DMS of medicine from the institution to the standardized one of its case
mix to which are applied the reference MDS of each medical GHM.
It synthesizes the over or under performance of the medical organization of the institution in medicine
(out-of-pocket).
24. Figure 11 Fact_Process
3.7. Descriptionof Fact_RH
This information makes it possible to have a global vision of the medical human resources of
the institution, declined by discipline
Figure 12 Fact_RH
3.8. Descriptionof Dimensions
Year Dimension:
The year dimension is the only dimension that is systematically in any data warehouse, because
in practice any data warehouse is a time series.
25. Figure 13 Year Dimension
RS Dimension:
The RS dimension is “social reason”. It is the dimension that contains the names of the
hospitals and their Finess which is an Id.
Figure 14 RS Dimension
Age Dimension:
The age dimension tells if the age is above or under 75 year.
26. Figure 15 Age Dimension
Patients origin Dimension:
The provenance dimension contains the origin place of the patients.
Figure 16 Patients origin Dimension
DA Dimension (Activity Domain):
Figure 17 DA Dimension
27. 4. Analytical Processing
// manquante olap et calculate w ka ri9
5. Reporting
Reporting means collecting and presenting data so that it can be analyzed.
Reporting is the necessary prerequisite of analysis; as such, it should be viewed in light of the
goal of making data understandable and ready for easy, efficient and accurate analysis.
Collecting and presenting data ready to be analyzed, including historical data that can
be tracked over time.
Empowering end-users with the knowledge to become experts in their area of business
Having the underlying figures to back up actions and explain decisions
29. 6. Used technologies
SQL Server Data Tool : It transforms database development by introducing a
ubiquitous, declarative model that spans all the phases of database development inside Visual
Studio. We used SSDT Transact-SQL design capabilities to build, debug, maintain, and refactor
databases. We worked with a database project, and also directly with a connected database
instance on or off-premise.
SQL Server Integration Services : is a platform for building enterprise-level
data integration and data transformations solutions. We used Integration Services to solve
complex business problems by copying or downloading files, sending e-mail messages in
response to events, updating data warehouses, cleaning and mining data, and managing SQL
Server objects and data.
SQL Server Analysis Server : is an online analytical and transactional
processing (OLAP) and data mining tool in Microsoft SQL Server.We used SSAS as a tool by
organizations to analyze and make sense of information possibly spread out across multiple
databases, and in disparate tables or files.
SQL Server Reporting Services : It contains a set of graphical and scripting tools
that support the development and use of rich reports in a managed environment. The tool set
includes development tools, configuration and administration tools, and report viewing tools.
This topic gives a brief overview of each tool in Reporting Services and how it can be accessed.
SQl Server Management Studio : is an integrated environment for managing any SQL
infrastructure, from SQL Server to SQL Database. It provides tools to deploy, monitor, and
upgrade the data-tier components, such as databases and data warehouses used in our
applications, and to build queries and scripts.
Power BI : is a Microsoft business analytics service that enabled for us to visualize
and analyze data.
30. Conclusion
A data warehouse maintains a copy of information from the source transaction systems. This
architectural complexity provides the opportunity to congregate data from multiple sources into
a single database so a single query engine can be used to present data also mainly to improve
data quality, by providing consistent codes and descriptions, flagging or even fixing bad data
and make decision–support queries easier to write
32. Chapter 4: Sprint Data mining
Introduction
Data mining is the computing process of discovering patterns in large data sets involving
methods at the intersection of artificial intelligence, machine learning, statistics, and database
systems. It is an interdisciplinary subfield of computer science
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use. Aside from the raw analysis step,
it involves database and data management aspects, data pre-processing, model and inference
considerations, interestingness metrics, complexity considerations, post-processing of
discovered structures, visualization, and online updating.
Our main goal is to:
- Optimize the use of limited resources.
- Partition data such that different classes or categories can be defined.
1. Descriptive analysis
We start by importing data from the DataWareHouse based on an SQL query.
install.packages("RODBC") library(RODBC) dbhandle <- odbcDriverConnect('driver={SQL
Server};server=.;database=DW_Anap;trusted_connection=true')
res2 <- sqlQuery(dbhandle, 'select TOP 100000
a.[Nombre de sejours/seances MCO des patients en ALD],a.[Nombre total de sejours/seances],
a.Pk_Age,a.Pk_DA,a.PK_PP,p.P1,p.P10,p.P11,p.P12,p.P13,p.P14,p.P15,p.P2,p.P3,p.P4,p.P5,p.P6,p.P
7,p.P8,p.P9 ,q.Q1,q.Q2,q.Q3,q.Q4,q.Q5,q.Q6,q.Q7,q.Q8,q.Q9,q.Q10,q.Q11
,r.CI_RH1,r.CI_RH2,r.CI_RH3,r.CI_RH4,r.CI_RH5,r.CI_RH6,r.CI_RH7,r.CI_RH8,r.CI_RH9,r.CI_R
H10,r.CI_RH11 ,r.RH1,r.RH2,r.RH3,r.RH4,r.RH5,r.RH6,r.RH8,r.RH9,r.RH10
,f.CI_F1_D,f.CI_F2_D,f.CI_F3_D,f.CI_F4_D,f.CI_F5_D,f.CI_F6_D,f.CI_F7_D,f.CI_F8_D,f.CI_F9_
D,f.CI_F10_D
,f.CI_F11_D,f.CI_F12_D,f.CI_F13_D,f.CI_F14_D,f.CI_F15_D,f.CI_F16_D,f.CI_F17_D,f.CI_F1_O,f.
CI_F2_O,f.CI_F3_O,f.CI_F4_O,f.CI_F5_O,f.CI_F6_O,f.CI_F7_O,f.CI_F8_O,f.CI_F9_O,f.CI_F10_O
,f.CI_F11_O,f.CI_F12_O,f.CI_F13_O,f.CI_F14_O,f.CI_F15_O,f.CI_F16_O,f.CI_F17_O,f.F1_D,f.F2
_D,f.F3_D,f.F4_D,f.F5_D,f.F6_D,f.F7_D,f.F8_D,f.F9_D,f.F10_D
,f.F11_D,f.F12_D,f.F1_O,f.F2_O,f.F3_O,f.F4_O,f.F5_O,f.F6_O,f.F7_O,f.F8_O,f.F9_O,f.F10_O,a.PK
_annee,a.PK_RS ,f.F11_O,f.F12_O ,an.Annee,ag.[Age (1 >75 ans,0 <= 75 ans)],rss.rs,da.[Domaines d
activites]
from dbo.DimDA,dbo.Dim_RS,dbo.DimAge,dbo.Dim_Annee,dbo.Fact_ANAP2020 as a
inner join dbo.FactProcess p on a.PK_annee=p.FK_annee and a.PK_RS=p.FK_RS
33. inner join dbo.FactQualitee q on a.PK_annee=q.FK_annee and a.PK_RS=q.FK_RS
inner join dbo.RH1 r on a.PK_annee=r.FK_annee and a.PK_RS=r.FK_RS
inner join dbo.FactFinances f on a.PK_annee=f.FK_annee and a.PK_RS=f.FK_RS
inner join dbo.DimDA da on da.Pk_DA = a.Pk_DA
inner join dbo.Dim_RS rss on rss.PK_RS = a.PK_RS
inner join dbo.DimAge ag on ag.Pk_Age = a.Pk_Age
inner join dbo.Dim_Annee an on an.PK_annee = a.PK_annee'
Figure
Provide descriptive statistics:
Figure 22 Descriptive statistics
As we can see on the picture below, we have 5 quantitatifs variables and 4 qualitative
variables.
Each quantitative values have a maximum value and a minimum value.
Clustering:
Clustering is the process of making a group of abstract objects into classes of similar objects.
In order to center and reduce the data, we must start by constructing a Function
"centrage_reduction" that centers and reduces a column, applied to all active (quantitative)
variables with apply (...... ..)
Figure 21 Importing data from the DataWareHouse
34. To do this, we propose to perform the following column standardization function:
centrage_reduction<- function(x)
{
return((x-mean(x))/sqrt(var(x)))
}
Obtaining the centred and reduced data table
res.cr <- apply(res[,1:4],2,centrage_reduction)
apply(res.cr,2,mean)
apply(res.cr,2,var)
Interpretation of column mean and variance:
The closer the variance is, the more homogeneous the population is.
In our case the variance = 1 so we can do the classification and get the desired results.
(If the variance is close to 0, it may be concluded that the population is heterogeneous.)
K-Means:
The K meansinRaimsto divide the dataintogroups(classes)soastominimize the distancesbetween
the points and the centers of each class.
It takes as argument a database and the desired number of groups, it implements an algorithm to
arrive atthisclassification.The centersare randomlyprojectedandthe distance separatingthemfrom
each of the cloudpointsiscalculatedandthe pointsare groupedaccordingto the distance separating
them from each center.
Now, the K-Means algorithm is started on the centred and reduced variables.
We propose to design a partition of two groups, limited to 40 iterations.
The principle of using the function "kmeans (...)"
36. Interpretation of the membership groups:
For the interpretation of the groups, the conditional averages of the original active variables
are calculated. They are collected in a single matrix using the following commands:
Figure 25 Single matrix commands
To project the points illustrated according to their group of belonging, in the planes formed by
the pairs of variables, R shows all its power. The command used is "peers" the result is rich of
lessons: the variables are for the most part highly correlated, almost all pairs of variables make
it possible to distinguish the groups:
>pairs(res[,1:4],pch=21,bg=c("red","blue")[groupe])
37. Figure 26 Pairs Graph
ACP / K-Means combination:
In order to find a tool allowing to locate the groups well, it is proposed to project the points in
the first factorial plane of the Analysis in Principal Components.
To do this, use the following command lines:
acp<- princomp(res.cr,cor=T,scores=T)
print(acp)
print(acp$sdev^2)
print(acp$loadings[,1]*acp$sdev[1])
plot(acp$scores[,1],acp$scores[,2],type="p",pch=21,col=c("red","blue")[groupe])
Figure 27 ACP / K-Means combination
38. The graph obtained is below:
Figure 28 Plot Acp Graph
CAH calculates and returns the distance matrix calculated using the specified distance
measurement to calculate the distances between the rows of a data array.
Figure 29 CAH
39. Now we are going to plot the dendogram to see how many clusters we have and how is their
repartition
Figure 30 Dendogram
To obtain a circle of correlations
We have also developed a function to eliminate NA values
plot(acp$scores[,1],acp$scores[,2],type="p",pch=21,col=c("red","blue")[groupe])
Figure 31 Circle of correlations Commands
40. 2. Predictive analysis
Linear regression :
Basic import, display of data table and change of column names.
Figure 33 Import Data
deleteNA=function(data)
{
for(i in 1: nrow(data))
{
if(is.na(data[i,1]))
{
data[-i,]
i=i+1
}
else if(is.na(data[i,2]))
{
data[-i,]
i=i+1
}
else if(is.na(data[i,3]))
{
data[-i,]
i=i+1
}
else if(is.na(data[i,4]))
{
data[-i,]
i=i+1
}
}
return (data)
}
type=deleteNA(res2)
Figure 32 function to eliminate NA values
41. Figure 34 Summary commands
Since we have a very large number of data and R is limited in number of lines we have chosen
10000 lines with this piece of code.
Figure 35 10000 lines selection command
We will then center and reduce the data in order to reconcile the points.
Figure 36 center and reduce
In the following, we will apply the function lm which makes it possible to calculate the
regression
Linear of a numerical dependent variable as a function of the explanatory variables.
In our case, the function lm is applied to the target variable which is the target.
Predictionusinglinearregression :
42. Figure 37 Prediction using linear regression
Two coefficients of the different characteristics were found.
Two coefficients of the different characteristics were found....
Target = 0.01044276292 – 0.0007731274 * nb_jours_mco +0.0155397031 * nb_jours_total
with :
α 0 : 0.01044276292: cette variable n’est pas prise en considération.
α 1 = -0.0007731274 : le coefficient de la variable nb_jours_mco.
α 2= 0.0155397031: le coefficient de la variable nb_jours_total.
To better explain the result found above, and to deduce a more reliable conclusion we will
apply the reduced model.
The criterion used for the selection of the model will be used. This criterion is obtained by using
the AIC function (object, k =?), with the object it is the model and k by default equal to 2.
43. Figure 38 StepAic
The best model is the one with the lowest AIC, which is in our case AIC -39769.93.
Hence obtaining a reduced model composed only of a variable which is: nb_jours_total.
The reduced model neglected nb_jours_mco, we deduce the new target function:
Target = 0.01044276292 + 0.0155397031 * Total number of days.
with:
β 0 : 0. 01044276292
β 1 : 0.0155397031 : le coefficient de la variable nb_jours_total.
Note: The higher the value of the coefficient of the zero plus the variable is relevant.
For a better comparison, we will apply the function cbind which allows the concatenation of
the two vectors data_paa1 $ m which is the learning warehouse and the model m which is the
44. reduced model, in order to see if the model is close to the model d 'learning
Figure 39 cbind command
This function allowed us to compare the general model with the model each time.
It is noted that the target values of the two models are very close, confirming what we have
already mentioned that an appropriate model can be derived according to the general model.
45. Figure 40 Residuals vs Filtted Graph
It can be seen from this figure that the plot function gives us a more or less linear point cloud
form, we can extract the lines of atypical (non-standard) targets that are very far from the right
such as: Line 9395, 1067 and 9067
By clicking on Enter we get other clouds of points.
46. Figure 41 Normal Q-Q graph
It is also noted that the line of target 3950 is an atypical model.
Neural network
library(readxl)
anap=read.csv("C:/Users/LENOVO/Documents/Visual Studio
2010/Projects/ANAP_2020_Final/Data/data2.csv",sep=";",dec=".")
library(MASS)
str(anap)
anap=anap[0:1000,c(5:8,9)]
library(nnet)
#options du réseau :size = 2, rang = 0.1, decay=1, maxit=500, package: nnet
res=nnet(cible1 ~ ., data = anap, size = 2, rang = 0.1, decay=1, maxit=50);
res=nnet(cible1 ~ ., data = anap, size = 2, rang = 0.1, decay=1, maxit=500);
#on a essayer 50 itération et puis 500 itétrations les résultats converge vers 13.98
library(NeuralNetTools)
plotnet(res, struct = struct)
#prediction sur le résultat du réseau de neurones
pred.nnet<- predict(res,anap)
#confusion matrix /// on ne peux pas des données quanti
Mc<-table(anap$cible1, predict(res,anap));
class(Mc)
Mc
Figure 42 Neural network
47. Figure 43 Neural network Graph
Conclusion
The techniques and use of Data Mining leads us to use a set of algorithms, methods and
technologies in a specific area that is health in our case. In the decision-making process, it has
helped us to demonstrate the interest and potential of data exploitation in the service of a local
or national health policy and to give a concrete illustration of it A major factor in the adaptation
of the organization of the institutions to the increase in chronic pathologies, and led us to
promote a playful and attractive initiative to educate the stakes of the exploitation of health
data.
Thus, we have carried out the time analysis which will enable us to predict future achievements
using the target variable.
48. 3. Time series analysis
In order to facilitate the data mining part, we decided to study the stability and the variation of
our data over time. Our goal in the Data Mining part is to predict the variable “cible 1”. To
accomplish this task, we devoted our efforts in studying the data and it’s correlation with our
target variable.
Connectingand importingdata from the data warehouse:
Figure 44 importing data commands
49. Successful Import of the 1048575 Data Warehouse Lines
Correlation analysis
We want to measure the intensity of the link between our target variable and the quantitative
data.
Figure 46 Mcor Commands
We used the Pearson correlation to have standard deviations.
Figure 47 Pearson correlation
The results obtained show an increasing correlation between:
Finess
Année
NBSejourTotal
NBSjourMCO
Age
Figure 45 Correlation analysis
50. Figure 48 corrplot comand
To support the method cor we use the library corrplot to schematize the result.
Figure 49 corrplot graph
The correlation between the target variable and age is clear, followed by a slight correlation
between the target variable and NBSejourTotal, as well as the target variable and
NBSejourMCO.
Figure 50 temporal series commands
In order to verify this correlation in time, we will create a temporal series. After obtaining the
interval of years from the Summary function we create our temporal series s1.
51. Figure 51 temporal series graph
We notice that our data have the same pace in time but we must test the Acf to see if our data
are correlated over time.
Figure 52 acf command
52. The autocorrelation of a series refers to the fact that in a temporal or spatial series, the
measurement of a phenomenon at a time t can be correlated with the preceding measurements.
Figure 53 graph series
We will focus on our target variable, we note that over time, there is no correlation between age
and our target variable, between NBSejourTotal and our target variable, on the other hand there
is a strong correlation NBSejourMCO and our target variable, in fact the blue line indicates the
critical threshold beyond which the autocorrelation is considered significant.
53. To proof that our series is very correlated with itself we carry out a plot in lag of 3 years.
Figure 54 correlogram graph
We obtained a correlogram which proves that our series is very correlated with itself and
therefore we must decompose it and study its seasonality.
Figure 55 Time series commands
Figure 56 decompose commands
54. We will create two time series, one for the target variable and the other for NBSejourMCO. We
will then decompose them.
Figure 57 Decomposition of additive time series
There is a trend and a seasonality for the time series of the target variable.
56. There is a trend and a seasonality for the time series of the variable NBSéjourMCO.
Figure 59 Decomposition of additive time series
Demonstration of the correlation, seasonality and tendency for our two curves.
Prevision
In this part, we will try to predict our target variable.
Figure 60 Prevision commands
We will use Arima from the Forecast library for prediction.
57. Figure 61 Summary(fit) result
With an AIC <0 the model is good.
We try to see if there is a correlation between our model and the residuals.
Figure 62 Acf residuals graph
No value that exceeds the blue line indicates the critical threshold beyond which the
autocorrelation is considered significant, hence no correlation with the residuals.
Figure 63 plot forecast
58. We use the box-type Box-Pierce because it is the most efficient algorithm with strongly
correlated data.
Figure 64 Forcats grap
Our target was predicted for the next two years (in blue) and the correct values from the test
dataset are in green.
1. Used technologies
R : During the Sprint «Data Mining and Temporary Series » we have used a «R » which
is a language and environment for statistical computing and graphics. It is a GNU project which
is similar to the S language and environment which was developed at Bell Laboratories
(formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be
considered as a different implementation of S. There are some important differences, but much
code written for S runs unaltered under R.
63. Chapter 6: Marketing strategy
Introduction
The economy is not without friction. Supply and demand are not so easily encountered that
consumers have to make efforts to look for goods that satisfy their needs, companies have to
find customers who value the goods they produce and produce. Marketing organizes this
meeting and facilitates the exchange of the transaction to the relationship.
Indeed, people often have a fairly narrow view of marketing. Many think, for example, that
marketing is limited to sales or advertising. But marketing brings together a much wider range
of activities that link an organization to its market. Marketing is also a managerial philosophy
that places the consumer at the heart of the company's concerns. With a marketing approach,
the company's success depends on understanding and satisfying the needs of the consumer.
Therefore, all our intention is based on the customer and his expectations while taking into
consideration all the key elements that will guarantee a total and reciprocal satisfaction of the
customer and our company.
In this chapter, we will show you a complete marketing study of our project.
1. Analyse of needs
64. We Note that costs for chronic diseases are very high, which is 5 times more than a normal
patient. And about 25 % of people with a chronic disease have some of activity restriction…
As we see in this figure chronic illness are not ready to decrease. In contrary of injuries and
communicable Disease they are in stabilization phase and we will see a decrease for 2020.
2. Strategic Analysis (SWOT)
SWOT (Strengths - Weaknesses - Opportunities - Threats) is a strategic analysis tool that
combines the study of the strengths and weaknesses of an organization, a territory, a sector ...
Opportunities and threats of its environment, in order to help define a development strategy.
The aim of the analysis is to take into account in the strategy both internal and external factors,
maximizing the potentials of the forces and opportunities and thus identify the key factors for
success and minimizing the effects of weaknesses and threats to gain the competitive advantage.
65. Figure 65 Strategic Analysis (SWOT)
Conclusion
Now, after this marketing study, we have a clear and concise plan of action that will allow us
on the one hand to position ourselves vis-à-vis our competitors in order to judge our product in
an objective way, On the other hand, to set well our objectives that we want to achieve by being
realistic so as not to be surprised at the end.
66. General Conclusion
Use the data available to the company to give them added value, this is the challenge of modern
business.
In this context, and in order to solve recurring problems in the process of decision making,
Anap 2020 initiated the project to build a data warehouse to allow the establishment of a reliable
and efficient decision-making system.
Throughout our work in design and construction, we tried to follow a mixed approach,
combining thus between two known approaches in the field of Data Warehouse, namely the
"Needs Analysis" approach and the approach "Data sources". This allowed to meet the
expectations and needs of users while making the most of data generated by operating systems
to anticipate unexpressed needs.
Finally, and before citing the prospects of the project, we can say that this project Anap 2020
has allowed us to acquire a very good experience and evolve in an area that was more us at least
unknown namely field of decision-making systems.
We can mention the following perspectives and developments:
Follow the current deployment
Extending the system to other operating systems including human resource.
Making the report available on mobile portals defeated OS for that information is
relevant to wear anywhere and anytime.