SlideShare a Scribd company logo
1 of 67
Rapport PEBI: Anap
2020
Realized by:
 Ben Amor Hela
 Gassab Ali
 Gasmi Cyrine
 Rekik Habib
 Zaïbi Chaima
College year:2016-2017
Projet Pe Bi
Table of Content
General Introduction..................................................................................................................... 7
Chapter 1...................................................................................................................................... 9
The Project Context....................................................................................................................... 9
Introduction:............................................................................................................................10
1. State of the problematic....................................................................................................10
2. Presentation of the host Organization................................................................................11
3. Case study :......................................................................................................................12
4. Solution............................................................................................................................12
Conclusion...............................................................................................................................13
Chapter 2.....................................................................................................................................14
Analysis and specification of requirements....................................................................................14
Introduction.............................................................................................................................15
1. Identification of Actors......................................................................................................15
2. Functional Requirements...................................................................................................15
3. Non Functional Requirements............................................................................................15
4. Methodology....................................................................................................................16
Conclusion...............................................................................................................................16
Chapter 3.....................................................................................................................................17
Sprint Business Intelligence ..........................................................................................................17
Introduction.............................................................................................................................18
1. Extracting data..................................................................................................................18
2. Design..............................................................................................................................20
3. Loading data.....................................................................................................................22
3.3. Description of FactAnap2020......................................................................................22
3.4. Description of Fact_Qualite........................................................................................22
3.5. Description of Fact_Finance........................................................................................23
3.6. Description of Fact_Process........................................................................................23
3.7. Description of Fact_RH...............................................................................................24
3.8. Description of Dimensions..........................................................................................24
4. Analytical Processing.........................................................................................................27
5. Reporting.........................................................................................................................27
6. Used technologies.............................................................................................................29
Conclusion...............................................................................................................................30
Chapter 4.....................................................................................................................................31
Sprint Data Mining.......................................................................................................................31
Introduction.............................................................................................................................32
1. Descriptive analysis...........................................................................................................32
2. Predictive analysis.............................................................................................................40
Conclusion...............................................................................................................................47
3. Time series analysis...........................................................................................................48
Chapter 5.....................................................................................................................................60
Sprint Big Data.............................................................................................................................60
Chapter 6.....................................................................................................................................62
Marketing Strategy.......................................................................................................................62
Introduction.............................................................................................................................63
1. Strategic Analysis (SWOT)..................................................................................................63
Conclusion...............................................................................................................................65
Table of Figures
Figure 1 Methodology scrum ........................................................................................................16
Figure 2 El....................................................................................................................................19
Figure 3 EL Hopitaux_ODS ............................................................................................................19
Figure 4 EL Data_ODS...................................................................................................................19
Figure 5 Design ............................................................................................................................20
Figure 6 ETL .................................................................................................................................21
Figure 7 The structure of the Data WareHouse ..............................................................................21
Figure 8 FactAnap2020.................................................................................................................22
Figure 9 Fact_Qualite ...................................................................................................................23
Figure 10 Fact_Finance.................................................................................................................23
Figure 11 Fact_Process.................................................................................................................24
Figure 12 Fact_RH ........................................................................................................................24
Figure 13 Year Dimension.............................................................................................................25
Figure 14 RS Dimension................................................................................................................25
Figure 15 Age Dimension..............................................................................................................26
Figure 16 Patients origin Dimension..............................................................................................26
Figure 17 DA Dimension ...............................................................................................................26
Figure 18 Reporting 1...................................................................................................................28
Figure 19 Reporting 2...................................................................................................................28
Figure 20 Reporting 3...................................................................................................................28
Figure 21 Importing datafrom the DataWareHouse.......................................................................33
Figure 22 Descriptive statistics......................................................................................................33
Figure 23 Kmeans commands 1.....................................................................................................35
Figure 24 Kmeans commands 2.....................................................................................................35
Figure 25 Single matrix commands................................................................................................36
Figure 26 Pairs Graph ...................................................................................................................37
Figure 27 ACP / K-Means combination...........................................................................................37
Figure 28 Plot Acp Graph..............................................................................................................38
Figure 29 CAH..............................................................................................................................38
Figure 30 Dendogram...................................................................................................................39
Figure 31 Circle of correlations Commands....................................................................................39
Figure 32 function to eliminate NA values......................................................................................40
Figure 33 Import Data ..................................................................................................................40
Figure 34 Summary commands.....................................................................................................41
Figure 35 10000 lines selection command .....................................................................................41
Figure 36 center and reduce.........................................................................................................41
Figure 37 Prediction using linear regression...................................................................................42
Figure 38 StepAic.........................................................................................................................43
Figure 39 cbind command ............................................................................................................44
Figure 40 Residuals vs Filtted Graph..............................................................................................45
Figure 41 Normal Q-Q graph.........................................................................................................46
Figure 42 Neural network.............................................................................................................46
Figure 43 Neural network Graph...................................................................................................47
Figure 44 importing data commands.............................................................................................48
Figure 45 Correlation analysis ......................................................................................................49
Figure 46 Mcor Commands...........................................................................................................49
Figure 47 Pearson correlation.......................................................................................................49
Figure 48 corrplot comand............................................................................................................50
Figure 49 corrplot graph...............................................................................................................50
Figure 50 temporal series commands............................................................................................50
Figure 51 temporal series graph....................................................................................................51
Figure 52 acf command ................................................................................................................51
Figure 53 graph series ..................................................................................................................52
Figure 54 correlogram graph.........................................................................................................53
Figure 55 Time series commands ..................................................................................................53
Figure 56 decompose commands..................................................................................................53
Figure 57 Decomposition of additive time series............................................................................54
Figure 58 Decomposition of additive time series............................................................................55
Figure 59 Decomposition of additive time series............................................................................56
Figure 60 Prevision commands......................................................................................................56
Figure 61 Summary(fit) result........................................................................................................57
Figure 62 Acf residuals graph........................................................................................................57
Figure 63 plot forecast..................................................................................................................57
Figure 64 Forcats grap ..................................................................................................................58
Figure 65 Swot graph..........................................................................Error! Bookmark not defined.
Abstract
This document is a summary of our work in the framework of the integration project carried
out within ESPRIT in conjunction with the company ANAP2020. We have been fortunate
during this period to work on a BI project.
The aim of this work is to create a decision-making layer to meet the needs expressed by
ANAP2020 makers to gain visibility into the true state of purchases, etc...
KEYWORDS: Data Warehouse, Business Intelligence, data integration, Open Source solutions
General Introduction
Business intelligence is a data driven decision-making.
It is the practice of taking large amounts of corporate data and turning it to usable information.
This practice enables companies to derive analysis that can be used to make profitable actions.
The process of converting corporate data to usable information goes beyond data collection and
crunching, into how companies can gain from big data and data mining.
This means that business intelligence is not confined to technology, it includes the business
processes and data analysis procedures that facilitate the collection of big data. It is time
consuming and involves various factors such as data models, data sources, data warehouses,
and business models among others.
Big data and data mining are completely different concepts. However, both concepts involve
the use of large data sets to handle the collection or reporting of data that helps businesses or
clients make better decisions. However, the two concepts are used in two different elements of
this operation.
The term big data can be defined simply as large data sets that outgrow simple databases and
data handling architectures. For example, data that cannot be easily handled in Excel
spreadsheets may be referred to as big data.
Data mining relates to the process of going through large sets of data to identify relevant or
pertinent information. Businesses often collect large data sets that may be automatically
collected. However, decision makers need access to smaller, more specific pieces of data and
use data mining to identify specific data that may help their businesses make better leadership
and management decisions.
Various software packages and analytical tools can be used for data mining. The process can
be automated or be done manually. Data mining allows individual workers to send specific
queries for information to archives and databases so that they can obtain targeted or specific
results.
Many successful companies have invested large amounts of money in business intelligence’s
tools and data warehousing technologies. They believe that the updated, accurate and integrated
information from their supply chain, products and customers are essential to their survival.
In the parts below, a detailed explanation will be presented to illustrate our BI solution to the
ANAP 2020 project.
Chapter 1
The Project Context
Chapter 1: The project concept
Introduction:
Systems and information technologies have become the bases of modern organizations today.
They touch, without exception, all activities and all areas of the business, both internally and
externally.
Current businesses found in these new information processing systems effective tools to
innovate in terms of improving working methods.
In this sense, a new professional environment has been created, a context that values both bodies
by the quality of the management of contracts, and the richness of their information systems,
particularly through the optimal exploitation the latter, thus reducing costs, delays and
improving responsiveness and quality.
It is in this light that ANAP 2020 expressed a need for setting up a platform of reporting: an
optical objective, which is to predict the future goals and the means you will need to also
identify market challenges the competitive pressures and evolving technologies.
In this chapter we will start by presenting ANAP 2020 then we will present a study of an
existing solution and finally specify the proposed solution to provide the most effective service.
1. State of the problematic
The healthcare industry is on the brink of transformation.
There are many reasons for that. Growth in the healthcare industry is at an all-time high, and
healthcare organizations are seeking new ways to improve operating efficiency and reduce
costs.
The recent and unprecedented changes occurring in healthcare have sent organizations
scrambling to extract critical information from the mountains of disparate data they possess so
that they can drive optimal performance. That’s where a data warehouse comes in. A data
warehouse is a must-have commodity for any organization seeking to do the following:
 Understand and manage patient populations.
 Support and defend clinical decisions.
 Allocate scarce resources.
 Reduce waste.
 Improve quality of care.
 Optimize clinical, financial and operational performance.
In this context, ANAP has a large mass of data, from heterogeneous sources which are hard
to interpret and to understand by decision makers.
2. Presentation of the host Organization
ANAP is a public agency that assists health and medical-social institutions in improving the
service provided to patients and users by developing recommendations and tools to optimize
their management and organization.
ATIH is a public agency responsible for collecting and analyzing health facility data: activity,
costs, organization and quality of care, finances, human resources...
It carries out studies on hospital costs and participates in the mechanisms Institutions.
ANAP and ATIH, within the framework of their respective missions, agree on an essential
fact: a considerable amount of data is produced by the different actors of health. This wealth
and the variety of information collected open up great prospects for exploitation in terms of
understanding the health system and its evolution.
Opening the data can make it possible to multiply the operating and analysis capacity and thus
to make the most of the richness of this information.
The ANAP-ATIH 2020 Project is an opportunity to raise awareness on this issue and it has
three main objectives:
● Promote a fun and attractive initiative to teach the issues of the exploitation of health data,
● Demonstrate the value and potential of data exploitation for health policy at a local or
national level and provide a concrete illustration of this major issue of adapting the
organization of The increase of chronic pathologies,
● Explore the ability of data scientists from different backgrounds to contribute to the
resolution of problems based on better exploitation of data.
3. Case study:
Analyzing existing similar projects is crucial in order to better understand the context and start
implementing our solution. In this part we have chosen to work on LifeBridge Health.
LifeBridge Health is a regional health care organization based in northwest Baltimore and its
surrounding counties. LifeBridge Health consists of Sinai Hospital of Baltimore, Northwest
Hospital, Carroll Hospital, Levindale Hebrew Geriatric Center and Hospital, LifeBridge Health
& Fitness, hundreds of primary care and specialty physicians throughout the region, and many
affiliated health-related partners.
As one of the largest, most comprehensive and most highly respected providers of health-related
services to the people of the northwest Baltimore region, LifeBridge Health advocates
preventive services, wellness and fitness services, and programs to educate and support the
communities it serves. Back in 2008, LifeBridge was one of the first healthcare providers to
adopt Cerner's PowerInsight data warehouse solution, which is based on SAP BusinessObjects.
4. Solution
In order to solve ANAP 2020’s problems we established a business intelligence solution after
studying the project and its available different sorts of data.
During this part of the project we:
● Started by centralizing the data
● Assisted decision-makers in decision-making
Conclusion
We tried using this chapter to provide an overview of the context of our project and the solution
we proposed. In the next chapter, we will present analysis and specification of requirements.
Chapter 2
Analysis and specification
of requirements
Chapter 2: Analysis and specification of the requirements
Introduction
This chapter is dedicated for detailing features, designing and realizing our solution. We
will illustrate also the methodology adopted for the establishment of the project.
1. Identification of Actors
The actors of a system are all stakeholders who interact directly with the system according to
their roles, so they can bypass the working interface they decide on action.
We have two main actors:
- An administrator :
 Able to maintain the solution in case of breakdown
 Tracks the sources of errors
 Has the ability to control decision makers at different analyzes.
- A decision maker:
 Consults the dashboard of the solution to have a decision support.
2. Functional Requirements
Process requirements describe what our solution do. Process requirements relate the entities
and attributes from the data requirements to the users’ needs. State the functional process
requirements in a manner that enables the reader to see broad concepts decomposed into
layers of increasing detail.
The key of Anap2020 is forecast the medium-term evolution of the importance of chronic
disease management for health care facilities. To do this, we have use historical data available
in OpenData, with the Hospi Diag * and the ScanSanté * reporting platform in the forefront.
3. Non Functional Requirements
The system we develop has to be:
 Following the evolution: The evolution of information is a constraint that
must be taken into account.
 Operational: The system must be responsive and guarantee absolute
reliability.
 User friendly: The solution must be easy to use and to maintain
 Efficient: It has to consume the minimum resources and give the greatest
results.
 Maintainable: It has to be easy to maintain in case of a breakdown.
4. Methodology
We used the SCRUM methodology during the realization of our project.
The Scrum approach to project management enables software development organizations to
prioritize the work that matters most and break it down into manageable chunks. Scrum is about
collaborating and communicating both with the people who are doing the work and the people
who need the work done. It’s about delivering often and responding to feedback, increasing
business value by ensuring that customers get what they actually want.
Figure 1 Methodology scrum
Conclusion
The needs assessment is a step more than necessary in a Data Warehouse project.
In fact, through this study, we can decide how to design the data warehouse architecture.
The needs are identified, the Data Warehouse modeling can begin. This modeling will be in the
next chapter.
Chapter 3
Sprint Business Intelligence
Chapter 3: Sprint Business intelligence
Introduction
Once we identified our requirements, actors and modeling method, we can begin by defining
the concept of Business intelligence so simply business intelligence is a computer based
technique used in spotting, digging-out, and analyzing business data.
During this chapter, we are going to explain how we applied Business intelligence tools in the
development of our project.
1. Extracting data
We started by collecting ANAP’s data from its information systems.
In order to preserve our data structure we used an “ODS”.
An operational data store (or "ODS") is a database designed to integrate data from multiple
sources for additional operations on the data. Unlike a master data store, the data is not passed
back to operational systems.
Because the data originate from multiple sources, the integration often involves cleaning,
resolving redundancy and checking against business rules for integrity. An ODS is usually
designed to contain low-level or atomic (indivisible) data (such as transactions and prices) with
limited history that is captured "real time" or "near real time" as opposed to the much greater
volumes of data stored in the data warehouse generally on a less-frequent basis.
We used ODS tools in order to extract data from ANAP information system in order to storage
data into dimensions to build our ODS.
Figure 2 El
Figure 3 EL Hopitaux_ODS
Figure 4 EL Data_ODS
2. Design
During this step we are going to supply our Data Warehouse.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision making process.
 Subject-Oriented: A data warehouse can be used to analyze a particular subject area.
For example, "sales" can be a particular subject.
 Integrated: A data warehouse integrates data from multiple data sources. For example,
source A and source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.
 Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve
data from 3 months, 6 months, 12 months, or even older data from a data warehouse.
This contrasts with a transactions system, where often only the most recent data is kept.
For example, a transaction system may hold the most recent address of a customer,
where a data warehouse can hold all addresses associated with a customer.
 Non-volatile: Once data is in the data warehouse, it will not change. So, historical data
in a d ata warehouse should never be altered.
Figure 5 Design
First of all, we created the Data Warehouse in SQL Server Management Studio to define the
structure of each dimensions and facts.
We used the fact constellation schema because we need an architecture that has multiple fact
tables that shares many dimension tables.
Figure 6 ETL
Figure 7 The structure of the Data WareHouse
3. Loading data
In this step, we are going to integrate, load and store data in the Data Warehouse.
3.3. Descriptionof FactAnap2020
This table contains the foreign keys of all the dimensions such asyear dimension, age, rs ( social reason)
and patients origin dimension. It contains also the principle measures that our work is based on which
are Total number of stay and number of MCO stay (Medicine, Surgery, Obstetrics).
Figure 8 FactAnap2020
3.4. Descriptionof Fact_Qualite
This table measures the means used to fight nosocomial infections in institutions, through four
Indicators:
ICALIN: Composite Indicator of Activities for the Control of Nosocomial Infections,
ICSHA Indicator of Consumption of Hydro-Alcoholic Solutions,
ICATB Index Composite of Good Use of Antimicrobials
SURVISO Indicator of realization of a Surveillance of the Infections of the Operating Site.
Figure 9 Fact_Qualite
3.5. Descriptionof Fact_Finance
This table contains all the information of the financial indicators of hospitals (margin, debt,
financial need ...). The following information is analyzed:
Medical doctors, including doctors (except anesthetists), including Surgeons (excluding
gynecologists and obstetricians), including Anesthetists, Obstetricians.
Figure 10 Fact_Finance
3.6. Descriptionof Fact_Process
This indicator compares the DMS of medicine from the institution to the standardized one of its case
mix to which are applied the reference MDS of each medical GHM.
It synthesizes the over or under performance of the medical organization of the institution in medicine
(out-of-pocket).
Figure 11 Fact_Process
3.7. Descriptionof Fact_RH
This information makes it possible to have a global vision of the medical human resources of
the institution, declined by discipline
Figure 12 Fact_RH
3.8. Descriptionof Dimensions
 Year Dimension:
The year dimension is the only dimension that is systematically in any data warehouse, because
in practice any data warehouse is a time series.
Figure 13 Year Dimension
 RS Dimension:
The RS dimension is “social reason”. It is the dimension that contains the names of the
hospitals and their Finess which is an Id.
Figure 14 RS Dimension
 Age Dimension:
The age dimension tells if the age is above or under 75 year.
Figure 15 Age Dimension
 Patients origin Dimension:
The provenance dimension contains the origin place of the patients.
Figure 16 Patients origin Dimension
 DA Dimension (Activity Domain):
Figure 17 DA Dimension
4. Analytical Processing
// manquante olap et calculate w ka ri9
5. Reporting
Reporting means collecting and presenting data so that it can be analyzed.
Reporting is the necessary prerequisite of analysis; as such, it should be viewed in light of the
goal of making data understandable and ready for easy, efficient and accurate analysis.
 Collecting and presenting data ready to be analyzed, including historical data that can
be tracked over time.
 Empowering end-users with the knowledge to become experts in their area of business
 Having the underlying figures to back up actions and explain decisions
Figure 18 Reporting 1
//ne9sin les interprétations
Figure 19 Reporting 2
Figure 20 Reporting 3
6. Used technologies
SQL Server Data Tool : It transforms database development by introducing a
ubiquitous, declarative model that spans all the phases of database development inside Visual
Studio. We used SSDT Transact-SQL design capabilities to build, debug, maintain, and refactor
databases. We worked with a database project, and also directly with a connected database
instance on or off-premise.
SQL Server Integration Services : is a platform for building enterprise-level
data integration and data transformations solutions. We used Integration Services to solve
complex business problems by copying or downloading files, sending e-mail messages in
response to events, updating data warehouses, cleaning and mining data, and managing SQL
Server objects and data.
SQL Server Analysis Server : is an online analytical and transactional
processing (OLAP) and data mining tool in Microsoft SQL Server.We used SSAS as a tool by
organizations to analyze and make sense of information possibly spread out across multiple
databases, and in disparate tables or files.
SQL Server Reporting Services : It contains a set of graphical and scripting tools
that support the development and use of rich reports in a managed environment. The tool set
includes development tools, configuration and administration tools, and report viewing tools.
This topic gives a brief overview of each tool in Reporting Services and how it can be accessed.
SQl Server Management Studio : is an integrated environment for managing any SQL
infrastructure, from SQL Server to SQL Database. It provides tools to deploy, monitor, and
upgrade the data-tier components, such as databases and data warehouses used in our
applications, and to build queries and scripts.
Power BI : is a Microsoft business analytics service that enabled for us to visualize
and analyze data.
Conclusion
A data warehouse maintains a copy of information from the source transaction systems. This
architectural complexity provides the opportunity to congregate data from multiple sources into
a single database so a single query engine can be used to present data also mainly to improve
data quality, by providing consistent codes and descriptions, flagging or even fixing bad data
and make decision–support queries easier to write
Chapter 4
Sprint Data Mining
Chapter 4: Sprint Data mining
Introduction
Data mining is the computing process of discovering patterns in large data sets involving
methods at the intersection of artificial intelligence, machine learning, statistics, and database
systems. It is an interdisciplinary subfield of computer science
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use. Aside from the raw analysis step,
it involves database and data management aspects, data pre-processing, model and inference
considerations, interestingness metrics, complexity considerations, post-processing of
discovered structures, visualization, and online updating.
Our main goal is to:
- Optimize the use of limited resources.
- Partition data such that different classes or categories can be defined.
1. Descriptive analysis
We start by importing data from the DataWareHouse based on an SQL query.
install.packages("RODBC") library(RODBC) dbhandle <- odbcDriverConnect('driver={SQL
Server};server=.;database=DW_Anap;trusted_connection=true')
res2 <- sqlQuery(dbhandle, 'select TOP 100000
a.[Nombre de sejours/seances MCO des patients en ALD],a.[Nombre total de sejours/seances],
a.Pk_Age,a.Pk_DA,a.PK_PP,p.P1,p.P10,p.P11,p.P12,p.P13,p.P14,p.P15,p.P2,p.P3,p.P4,p.P5,p.P6,p.P
7,p.P8,p.P9 ,q.Q1,q.Q2,q.Q3,q.Q4,q.Q5,q.Q6,q.Q7,q.Q8,q.Q9,q.Q10,q.Q11
,r.CI_RH1,r.CI_RH2,r.CI_RH3,r.CI_RH4,r.CI_RH5,r.CI_RH6,r.CI_RH7,r.CI_RH8,r.CI_RH9,r.CI_R
H10,r.CI_RH11 ,r.RH1,r.RH2,r.RH3,r.RH4,r.RH5,r.RH6,r.RH8,r.RH9,r.RH10
,f.CI_F1_D,f.CI_F2_D,f.CI_F3_D,f.CI_F4_D,f.CI_F5_D,f.CI_F6_D,f.CI_F7_D,f.CI_F8_D,f.CI_F9_
D,f.CI_F10_D
,f.CI_F11_D,f.CI_F12_D,f.CI_F13_D,f.CI_F14_D,f.CI_F15_D,f.CI_F16_D,f.CI_F17_D,f.CI_F1_O,f.
CI_F2_O,f.CI_F3_O,f.CI_F4_O,f.CI_F5_O,f.CI_F6_O,f.CI_F7_O,f.CI_F8_O,f.CI_F9_O,f.CI_F10_O
,f.CI_F11_O,f.CI_F12_O,f.CI_F13_O,f.CI_F14_O,f.CI_F15_O,f.CI_F16_O,f.CI_F17_O,f.F1_D,f.F2
_D,f.F3_D,f.F4_D,f.F5_D,f.F6_D,f.F7_D,f.F8_D,f.F9_D,f.F10_D
,f.F11_D,f.F12_D,f.F1_O,f.F2_O,f.F3_O,f.F4_O,f.F5_O,f.F6_O,f.F7_O,f.F8_O,f.F9_O,f.F10_O,a.PK
_annee,a.PK_RS ,f.F11_O,f.F12_O ,an.Annee,ag.[Age (1 >75 ans,0 <= 75 ans)],rss.rs,da.[Domaines d
activites]
from dbo.DimDA,dbo.Dim_RS,dbo.DimAge,dbo.Dim_Annee,dbo.Fact_ANAP2020 as a
inner join dbo.FactProcess p on a.PK_annee=p.FK_annee and a.PK_RS=p.FK_RS
inner join dbo.FactQualitee q on a.PK_annee=q.FK_annee and a.PK_RS=q.FK_RS
inner join dbo.RH1 r on a.PK_annee=r.FK_annee and a.PK_RS=r.FK_RS
inner join dbo.FactFinances f on a.PK_annee=f.FK_annee and a.PK_RS=f.FK_RS
inner join dbo.DimDA da on da.Pk_DA = a.Pk_DA
inner join dbo.Dim_RS rss on rss.PK_RS = a.PK_RS
inner join dbo.DimAge ag on ag.Pk_Age = a.Pk_Age
inner join dbo.Dim_Annee an on an.PK_annee = a.PK_annee'
Figure
 Provide descriptive statistics:
Figure 22 Descriptive statistics
As we can see on the picture below, we have 5 quantitatifs variables and 4 qualitative
variables.
Each quantitative values have a maximum value and a minimum value.
 Clustering:
Clustering is the process of making a group of abstract objects into classes of similar objects.
In order to center and reduce the data, we must start by constructing a Function
"centrage_reduction" that centers and reduces a column, applied to all active (quantitative)
variables with apply (...... ..)
Figure 21 Importing data from the DataWareHouse
To do this, we propose to perform the following column standardization function:
centrage_reduction<- function(x)
{
return((x-mean(x))/sqrt(var(x)))
}
Obtaining the centred and reduced data table
res.cr <- apply(res[,1:4],2,centrage_reduction)
apply(res.cr,2,mean)
apply(res.cr,2,var)
 Interpretation of column mean and variance:
The closer the variance is, the more homogeneous the population is.
In our case the variance = 1 so we can do the classification and get the desired results.
(If the variance is close to 0, it may be concluded that the population is heterogeneous.)
 K-Means:
The K meansinRaimsto divide the dataintogroups(classes)soastominimize the distancesbetween
the points and the centers of each class.
It takes as argument a database and the desired number of groups, it implements an algorithm to
arrive atthisclassification.The centersare randomlyprojectedandthe distance separatingthemfrom
each of the cloudpointsiscalculatedandthe pointsare groupedaccordingto the distance separating
them from each center.
Now, the K-Means algorithm is started on the centred and reduced variables.
We propose to design a partition of two groups, limited to 40 iterations.
 The principle of using the function "kmeans (...)"
Figure 23 Kmeans commands 1
Figure 24 Kmeans commands 2
 Interpretation of the membership groups:
For the interpretation of the groups, the conditional averages of the original active variables
are calculated. They are collected in a single matrix using the following commands:
Figure 25 Single matrix commands
To project the points illustrated according to their group of belonging, in the planes formed by
the pairs of variables, R shows all its power. The command used is "peers" the result is rich of
lessons: the variables are for the most part highly correlated, almost all pairs of variables make
it possible to distinguish the groups:
>pairs(res[,1:4],pch=21,bg=c("red","blue")[groupe])
Figure 26 Pairs Graph
 ACP / K-Means combination:
In order to find a tool allowing to locate the groups well, it is proposed to project the points in
the first factorial plane of the Analysis in Principal Components.
To do this, use the following command lines:
acp<- princomp(res.cr,cor=T,scores=T)
print(acp)
print(acp$sdev^2)
print(acp$loadings[,1]*acp$sdev[1])
plot(acp$scores[,1],acp$scores[,2],type="p",pch=21,col=c("red","blue")[groupe])
Figure 27 ACP / K-Means combination
The graph obtained is below:
Figure 28 Plot Acp Graph
CAH calculates and returns the distance matrix calculated using the specified distance
measurement to calculate the distances between the rows of a data array.
Figure 29 CAH
Now we are going to plot the dendogram to see how many clusters we have and how is their
repartition
Figure 30 Dendogram
To obtain a circle of correlations
We have also developed a function to eliminate NA values
plot(acp$scores[,1],acp$scores[,2],type="p",pch=21,col=c("red","blue")[groupe])
Figure 31 Circle of correlations Commands
2. Predictive analysis
 Linear regression :
Basic import, display of data table and change of column names.
Figure 33 Import Data
deleteNA=function(data)
{
for(i in 1: nrow(data))
{
if(is.na(data[i,1]))
{
data[-i,]
i=i+1
}
else if(is.na(data[i,2]))
{
data[-i,]
i=i+1
}
else if(is.na(data[i,3]))
{
data[-i,]
i=i+1
}
else if(is.na(data[i,4]))
{
data[-i,]
i=i+1
}
}
return (data)
}
type=deleteNA(res2)
Figure 32 function to eliminate NA values
Figure 34 Summary commands
Since we have a very large number of data and R is limited in number of lines we have chosen
10000 lines with this piece of code.
Figure 35 10000 lines selection command
We will then center and reduce the data in order to reconcile the points.
Figure 36 center and reduce
In the following, we will apply the function lm which makes it possible to calculate the
regression
Linear of a numerical dependent variable as a function of the explanatory variables.
In our case, the function lm is applied to the target variable which is the target.
 Predictionusinglinearregression :
Figure 37 Prediction using linear regression
Two coefficients of the different characteristics were found.
Two coefficients of the different characteristics were found....
Target = 0.01044276292 – 0.0007731274 * nb_jours_mco +0.0155397031 * nb_jours_total
with :
α 0 : 0.01044276292: cette variable n’est pas prise en considération.
α 1 = -0.0007731274 : le coefficient de la variable nb_jours_mco.
α 2= 0.0155397031: le coefficient de la variable nb_jours_total.
To better explain the result found above, and to deduce a more reliable conclusion we will
apply the reduced model.
The criterion used for the selection of the model will be used. This criterion is obtained by using
the AIC function (object, k =?), with the object it is the model and k by default equal to 2.
Figure 38 StepAic
The best model is the one with the lowest AIC, which is in our case AIC -39769.93.
Hence obtaining a reduced model composed only of a variable which is: nb_jours_total.
The reduced model neglected nb_jours_mco, we deduce the new target function:
Target = 0.01044276292 + 0.0155397031 * Total number of days.
with:
β 0 : 0. 01044276292
β 1 : 0.0155397031 : le coefficient de la variable nb_jours_total.
Note: The higher the value of the coefficient of the zero plus the variable is relevant.
For a better comparison, we will apply the function cbind which allows the concatenation of
the two vectors data_paa1 $ m which is the learning warehouse and the model m which is the
reduced model, in order to see if the model is close to the model d 'learning
Figure 39 cbind command
This function allowed us to compare the general model with the model each time.
It is noted that the target values of the two models are very close, confirming what we have
already mentioned that an appropriate model can be derived according to the general model.
Figure 40 Residuals vs Filtted Graph
It can be seen from this figure that the plot function gives us a more or less linear point cloud
form, we can extract the lines of atypical (non-standard) targets that are very far from the right
such as: Line 9395, 1067 and 9067
By clicking on Enter we get other clouds of points.
Figure 41 Normal Q-Q graph
It is also noted that the line of target 3950 is an atypical model.
 Neural network
library(readxl)
anap=read.csv("C:/Users/LENOVO/Documents/Visual Studio
2010/Projects/ANAP_2020_Final/Data/data2.csv",sep=";",dec=".")
library(MASS)
str(anap)
anap=anap[0:1000,c(5:8,9)]
library(nnet)
#options du réseau :size = 2, rang = 0.1, decay=1, maxit=500, package: nnet
res=nnet(cible1 ~ ., data = anap, size = 2, rang = 0.1, decay=1, maxit=50);
res=nnet(cible1 ~ ., data = anap, size = 2, rang = 0.1, decay=1, maxit=500);
#on a essayer 50 itération et puis 500 itétrations les résultats converge vers 13.98
library(NeuralNetTools)
plotnet(res, struct = struct)
#prediction sur le résultat du réseau de neurones
pred.nnet<- predict(res,anap)
#confusion matrix /// on ne peux pas des données quanti
Mc<-table(anap$cible1, predict(res,anap));
class(Mc)
Mc
Figure 42 Neural network
Figure 43 Neural network Graph
Conclusion
The techniques and use of Data Mining leads us to use a set of algorithms, methods and
technologies in a specific area that is health in our case. In the decision-making process, it has
helped us to demonstrate the interest and potential of data exploitation in the service of a local
or national health policy and to give a concrete illustration of it A major factor in the adaptation
of the organization of the institutions to the increase in chronic pathologies, and led us to
promote a playful and attractive initiative to educate the stakes of the exploitation of health
data.
Thus, we have carried out the time analysis which will enable us to predict future achievements
using the target variable.
3. Time series analysis
In order to facilitate the data mining part, we decided to study the stability and the variation of
our data over time. Our goal in the Data Mining part is to predict the variable “cible 1”. To
accomplish this task, we devoted our efforts in studying the data and it’s correlation with our
target variable.
Connectingand importingdata from the data warehouse:
Figure 44 importing data commands
Successful Import of the 1048575 Data Warehouse Lines
 Correlation analysis
We want to measure the intensity of the link between our target variable and the quantitative
data.
Figure 46 Mcor Commands
We used the Pearson correlation to have standard deviations.
Figure 47 Pearson correlation
The results obtained show an increasing correlation between:
 Finess
 Année
 NBSejourTotal
 NBSjourMCO
 Age
Figure 45 Correlation analysis
Figure 48 corrplot comand
To support the method cor we use the library corrplot to schematize the result.
Figure 49 corrplot graph
The correlation between the target variable and age is clear, followed by a slight correlation
between the target variable and NBSejourTotal, as well as the target variable and
NBSejourMCO.
Figure 50 temporal series commands
In order to verify this correlation in time, we will create a temporal series. After obtaining the
interval of years from the Summary function we create our temporal series s1.
Figure 51 temporal series graph
We notice that our data have the same pace in time but we must test the Acf to see if our data
are correlated over time.
Figure 52 acf command
The autocorrelation of a series refers to the fact that in a temporal or spatial series, the
measurement of a phenomenon at a time t can be correlated with the preceding measurements.
Figure 53 graph series
We will focus on our target variable, we note that over time, there is no correlation between age
and our target variable, between NBSejourTotal and our target variable, on the other hand there
is a strong correlation NBSejourMCO and our target variable, in fact the blue line indicates the
critical threshold beyond which the autocorrelation is considered significant.
To proof that our series is very correlated with itself we carry out a plot in lag of 3 years.
Figure 54 correlogram graph
We obtained a correlogram which proves that our series is very correlated with itself and
therefore we must decompose it and study its seasonality.
Figure 55 Time series commands
Figure 56 decompose commands
We will create two time series, one for the target variable and the other for NBSejourMCO. We
will then decompose them.
Figure 57 Decomposition of additive time series
There is a trend and a seasonality for the time series of the target variable.
Figure 58 Decomposition of additive time series
There is a trend and a seasonality for the time series of the variable NBSéjourMCO.
Figure 59 Decomposition of additive time series
Demonstration of the correlation, seasonality and tendency for our two curves.
 Prevision
In this part, we will try to predict our target variable.
Figure 60 Prevision commands
We will use Arima from the Forecast library for prediction.
Figure 61 Summary(fit) result
With an AIC <0 the model is good.
We try to see if there is a correlation between our model and the residuals.
Figure 62 Acf residuals graph
No value that exceeds the blue line indicates the critical threshold beyond which the
autocorrelation is considered significant, hence no correlation with the residuals.
Figure 63 plot forecast
We use the box-type Box-Pierce because it is the most efficient algorithm with strongly
correlated data.
Figure 64 Forcats grap
Our target was predicted for the next two years (in blue) and the correct values from the test
dataset are in green.
1. Used technologies
R : During the Sprint «Data Mining and Temporary Series » we have used a «R » which
is a language and environment for statistical computing and graphics. It is a GNU project which
is similar to the S language and environment which was developed at Bell Laboratories
(formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be
considered as a different implementation of S. There are some important differences, but much
code written for S runs unaltered under R.
Chapter 5
Sprint Big Data
Chapter 5: Sprint Big Data
Chapter 6
Marketing Strategy
Chapter 6: Marketing strategy
Introduction
The economy is not without friction. Supply and demand are not so easily encountered that
consumers have to make efforts to look for goods that satisfy their needs, companies have to
find customers who value the goods they produce and produce. Marketing organizes this
meeting and facilitates the exchange of the transaction to the relationship.
Indeed, people often have a fairly narrow view of marketing. Many think, for example, that
marketing is limited to sales or advertising. But marketing brings together a much wider range
of activities that link an organization to its market. Marketing is also a managerial philosophy
that places the consumer at the heart of the company's concerns. With a marketing approach,
the company's success depends on understanding and satisfying the needs of the consumer.
Therefore, all our intention is based on the customer and his expectations while taking into
consideration all the key elements that will guarantee a total and reciprocal satisfaction of the
customer and our company.
In this chapter, we will show you a complete marketing study of our project.
1. Analyse of needs
We Note that costs for chronic diseases are very high, which is 5 times more than a normal
patient. And about 25 % of people with a chronic disease have some of activity restriction…
As we see in this figure chronic illness are not ready to decrease. In contrary of injuries and
communicable Disease they are in stabilization phase and we will see a decrease for 2020.
2. Strategic Analysis (SWOT)
SWOT (Strengths - Weaknesses - Opportunities - Threats) is a strategic analysis tool that
combines the study of the strengths and weaknesses of an organization, a territory, a sector ...
Opportunities and threats of its environment, in order to help define a development strategy.
The aim of the analysis is to take into account in the strategy both internal and external factors,
maximizing the potentials of the forces and opportunities and thus identify the key factors for
success and minimizing the effects of weaknesses and threats to gain the competitive advantage.
Figure 65 Strategic Analysis (SWOT)
Conclusion
Now, after this marketing study, we have a clear and concise plan of action that will allow us
on the one hand to position ourselves vis-à-vis our competitors in order to judge our product in
an objective way, On the other hand, to set well our objectives that we want to achieve by being
realistic so as not to be surprised at the end.
General Conclusion
Use the data available to the company to give them added value, this is the challenge of modern
business.
In this context, and in order to solve recurring problems in the process of decision making,
Anap 2020 initiated the project to build a data warehouse to allow the establishment of a reliable
and efficient decision-making system.
Throughout our work in design and construction, we tried to follow a mixed approach,
combining thus between two known approaches in the field of Data Warehouse, namely the
"Needs Analysis" approach and the approach "Data sources". This allowed to meet the
expectations and needs of users while making the most of data generated by operating systems
to anticipate unexpressed needs.
Finally, and before citing the prospects of the project, we can say that this project Anap 2020
has allowed us to acquire a very good experience and evolve in an area that was more us at least
unknown namely field of decision-making systems.
We can mention the following perspectives and developments:
 Follow the current deployment
 Extending the system to other operating systems including human resource.
 Making the report available on mobile portals defeated OS for that information is
relevant to wear anywhere and anytime.
Netographie
 https://www.datascience.net/fr/challenge/28/details
 http://www.anap.fr/participez-a-notre-action/toute-lactu/detail/actualites/challenge-
data-science-anap-atih-2020-mieux-anticiper-laugmentation-des-maladies-chroniques/
 https://fr.cloudera.com/
 https://cran.r-project.org/

More Related Content

Similar to Rapport ANAP 2020 Challenge

EcoPeace Strategic Plan External PDF
EcoPeace Strategic Plan External PDFEcoPeace Strategic Plan External PDF
EcoPeace Strategic Plan External PDF
Sophia Lloyd-Thomas
 
HSK 9 Chinese Vocabulary List V2020 Sample
HSK 9 Chinese Vocabulary List V2020 SampleHSK 9 Chinese Vocabulary List V2020 Sample
HSK 9 Chinese Vocabulary List V2020 Sample
LEGOO MANDARIN
 
2010 thesis guide
2010 thesis guide2010 thesis guide
2010 thesis guide
tettehfred
 
HSK 5 Mastery: Advanced Exam Skills and Reading Strategies 汉语水平考试五级模拟考题
HSK 5 Mastery: Advanced Exam Skills and Reading Strategies 汉语水平考试五级模拟考题HSK 5 Mastery: Advanced Exam Skills and Reading Strategies 汉语水平考试五级模拟考题
HSK 5 Mastery: Advanced Exam Skills and Reading Strategies 汉语水平考试五级模拟考题
LEGOO MANDARIN
 
IB Mandarin ab initio Chinese Grammar V2021 IB ab initio 中文语法- A Quick Refer...
IB Mandarin ab initio Chinese Grammar V2021 IB ab initio 中文语法-  A Quick Refer...IB Mandarin ab initio Chinese Grammar V2021 IB ab initio 中文语法-  A Quick Refer...
IB Mandarin ab initio Chinese Grammar V2021 IB ab initio 中文语法- A Quick Refer...
LEGOO MANDARIN
 
Format styleguide-reprt
Format styleguide-reprtFormat styleguide-reprt
Format styleguide-reprt
Jarbie Manlabe
 

Similar to Rapport ANAP 2020 Challenge (20)

EcoPeace Strategic Plan External PDF
EcoPeace Strategic Plan External PDFEcoPeace Strategic Plan External PDF
EcoPeace Strategic Plan External PDF
 
Hsk 5 intensive reading for advance learner v2009 h51328 sample
Hsk 5 intensive reading for advance learner v2009 h51328 sampleHsk 5 intensive reading for advance learner v2009 h51328 sample
Hsk 5 intensive reading for advance learner v2009 h51328 sample
 
Hsk 5 chinese intensive reading h51000 mock sample
Hsk 5 chinese intensive reading h51000 mock sampleHsk 5 chinese intensive reading h51000 mock sample
Hsk 5 chinese intensive reading h51000 mock sample
 
rapport-pi_bi-final (1).docx
rapport-pi_bi-final (1).docxrapport-pi_bi-final (1).docx
rapport-pi_bi-final (1).docx
 
HSK 3 Chinese Intensive Reading H31327 汉语水平考试三级模拟考题
HSK 3 Chinese Intensive Reading H31327 汉语水平考试三级模拟考题HSK 3 Chinese Intensive Reading H31327 汉语水平考试三级模拟考题
HSK 3 Chinese Intensive Reading H31327 汉语水平考试三级模拟考题
 
M phil eng literature (fsd)
M phil eng literature (fsd)M phil eng literature (fsd)
M phil eng literature (fsd)
 
HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...
HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...
HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...
 
HSK 9 Chinese Vocabulary List V2020 Sample
HSK 9 Chinese Vocabulary List V2020 SampleHSK 9 Chinese Vocabulary List V2020 Sample
HSK 9 Chinese Vocabulary List V2020 Sample
 
Urban Circular Economy Initiatives
Urban Circular Economy InitiativesUrban Circular Economy Initiatives
Urban Circular Economy Initiatives
 
Using market systems development approach to stimulate livelihood opportuniti...
Using market systems development approach to stimulate livelihood opportuniti...Using market systems development approach to stimulate livelihood opportuniti...
Using market systems development approach to stimulate livelihood opportuniti...
 
2010 thesis guide
2010 thesis guide2010 thesis guide
2010 thesis guide
 
Hsk 3 chinese intensive reading h31000 official mock sample
Hsk 3 chinese intensive reading h31000 official mock sampleHsk 3 chinese intensive reading h31000 official mock sample
Hsk 3 chinese intensive reading h31000 official mock sample
 
HSK 5 Mastery: Advanced Exam Skills and Reading Strategies 汉语水平考试五级模拟考题
HSK 5 Mastery: Advanced Exam Skills and Reading Strategies 汉语水平考试五级模拟考题HSK 5 Mastery: Advanced Exam Skills and Reading Strategies 汉语水平考试五级模拟考题
HSK 5 Mastery: Advanced Exam Skills and Reading Strategies 汉语水平考试五级模拟考题
 
896405 - HSSE_v03
896405 - HSSE_v03896405 - HSSE_v03
896405 - HSSE_v03
 
IB Mandarin ab initio Chinese Grammar V2021 IB ab initio 中文语法- A Quick Refer...
IB Mandarin ab initio Chinese Grammar V2021 IB ab initio 中文语法-  A Quick Refer...IB Mandarin ab initio Chinese Grammar V2021 IB ab initio 中文语法-  A Quick Refer...
IB Mandarin ab initio Chinese Grammar V2021 IB ab initio 中文语法- A Quick Refer...
 
Khóa Luận Tốt Nghiệp Đại Học Hệ Chính Quy Ngành Ngoại Ngữ
Khóa Luận Tốt Nghiệp Đại Học Hệ Chính Quy Ngành Ngoại Ngữ Khóa Luận Tốt Nghiệp Đại Học Hệ Chính Quy Ngành Ngoại Ngữ
Khóa Luận Tốt Nghiệp Đại Học Hệ Chính Quy Ngành Ngoại Ngữ
 
ADDIS ABABA UNIVERSITY ASSESSMENT ON PRACTICES AND CHALLENGES OF CONSULTANCY ...
ADDIS ABABA UNIVERSITY ASSESSMENT ON PRACTICES AND CHALLENGES OF CONSULTANCY ...ADDIS ABABA UNIVERSITY ASSESSMENT ON PRACTICES AND CHALLENGES OF CONSULTANCY ...
ADDIS ABABA UNIVERSITY ASSESSMENT ON PRACTICES AND CHALLENGES OF CONSULTANCY ...
 
Thesis and Dissertation Guide 2013 According to Cornell University
Thesis and Dissertation Guide 2013 According to Cornell UniversityThesis and Dissertation Guide 2013 According to Cornell University
Thesis and Dissertation Guide 2013 According to Cornell University
 
HSK 4 Chinese Intensive Reading 2009 H41330 Sample.pdf
HSK 4 Chinese Intensive Reading  2009 H41330 Sample.pdfHSK 4 Chinese Intensive Reading  2009 H41330 Sample.pdf
HSK 4 Chinese Intensive Reading 2009 H41330 Sample.pdf
 
Format styleguide-reprt
Format styleguide-reprtFormat styleguide-reprt
Format styleguide-reprt
 

Recently uploaded

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 

Recently uploaded (20)

Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 

Rapport ANAP 2020 Challenge

  • 1. Rapport PEBI: Anap 2020 Realized by:  Ben Amor Hela  Gassab Ali  Gasmi Cyrine  Rekik Habib  Zaïbi Chaima College year:2016-2017 Projet Pe Bi
  • 2. Table of Content General Introduction..................................................................................................................... 7 Chapter 1...................................................................................................................................... 9 The Project Context....................................................................................................................... 9 Introduction:............................................................................................................................10 1. State of the problematic....................................................................................................10 2. Presentation of the host Organization................................................................................11 3. Case study :......................................................................................................................12 4. Solution............................................................................................................................12 Conclusion...............................................................................................................................13 Chapter 2.....................................................................................................................................14 Analysis and specification of requirements....................................................................................14 Introduction.............................................................................................................................15 1. Identification of Actors......................................................................................................15 2. Functional Requirements...................................................................................................15 3. Non Functional Requirements............................................................................................15 4. Methodology....................................................................................................................16 Conclusion...............................................................................................................................16 Chapter 3.....................................................................................................................................17 Sprint Business Intelligence ..........................................................................................................17 Introduction.............................................................................................................................18 1. Extracting data..................................................................................................................18 2. Design..............................................................................................................................20 3. Loading data.....................................................................................................................22 3.3. Description of FactAnap2020......................................................................................22 3.4. Description of Fact_Qualite........................................................................................22 3.5. Description of Fact_Finance........................................................................................23 3.6. Description of Fact_Process........................................................................................23 3.7. Description of Fact_RH...............................................................................................24 3.8. Description of Dimensions..........................................................................................24 4. Analytical Processing.........................................................................................................27 5. Reporting.........................................................................................................................27 6. Used technologies.............................................................................................................29 Conclusion...............................................................................................................................30
  • 3. Chapter 4.....................................................................................................................................31 Sprint Data Mining.......................................................................................................................31 Introduction.............................................................................................................................32 1. Descriptive analysis...........................................................................................................32 2. Predictive analysis.............................................................................................................40 Conclusion...............................................................................................................................47 3. Time series analysis...........................................................................................................48 Chapter 5.....................................................................................................................................60 Sprint Big Data.............................................................................................................................60 Chapter 6.....................................................................................................................................62 Marketing Strategy.......................................................................................................................62 Introduction.............................................................................................................................63 1. Strategic Analysis (SWOT)..................................................................................................63 Conclusion...............................................................................................................................65
  • 4. Table of Figures Figure 1 Methodology scrum ........................................................................................................16 Figure 2 El....................................................................................................................................19 Figure 3 EL Hopitaux_ODS ............................................................................................................19 Figure 4 EL Data_ODS...................................................................................................................19 Figure 5 Design ............................................................................................................................20 Figure 6 ETL .................................................................................................................................21 Figure 7 The structure of the Data WareHouse ..............................................................................21 Figure 8 FactAnap2020.................................................................................................................22 Figure 9 Fact_Qualite ...................................................................................................................23 Figure 10 Fact_Finance.................................................................................................................23 Figure 11 Fact_Process.................................................................................................................24 Figure 12 Fact_RH ........................................................................................................................24 Figure 13 Year Dimension.............................................................................................................25 Figure 14 RS Dimension................................................................................................................25 Figure 15 Age Dimension..............................................................................................................26 Figure 16 Patients origin Dimension..............................................................................................26 Figure 17 DA Dimension ...............................................................................................................26 Figure 18 Reporting 1...................................................................................................................28 Figure 19 Reporting 2...................................................................................................................28 Figure 20 Reporting 3...................................................................................................................28 Figure 21 Importing datafrom the DataWareHouse.......................................................................33 Figure 22 Descriptive statistics......................................................................................................33 Figure 23 Kmeans commands 1.....................................................................................................35 Figure 24 Kmeans commands 2.....................................................................................................35 Figure 25 Single matrix commands................................................................................................36 Figure 26 Pairs Graph ...................................................................................................................37 Figure 27 ACP / K-Means combination...........................................................................................37 Figure 28 Plot Acp Graph..............................................................................................................38 Figure 29 CAH..............................................................................................................................38 Figure 30 Dendogram...................................................................................................................39 Figure 31 Circle of correlations Commands....................................................................................39 Figure 32 function to eliminate NA values......................................................................................40 Figure 33 Import Data ..................................................................................................................40 Figure 34 Summary commands.....................................................................................................41 Figure 35 10000 lines selection command .....................................................................................41 Figure 36 center and reduce.........................................................................................................41 Figure 37 Prediction using linear regression...................................................................................42 Figure 38 StepAic.........................................................................................................................43 Figure 39 cbind command ............................................................................................................44 Figure 40 Residuals vs Filtted Graph..............................................................................................45 Figure 41 Normal Q-Q graph.........................................................................................................46 Figure 42 Neural network.............................................................................................................46 Figure 43 Neural network Graph...................................................................................................47 Figure 44 importing data commands.............................................................................................48 Figure 45 Correlation analysis ......................................................................................................49 Figure 46 Mcor Commands...........................................................................................................49
  • 5. Figure 47 Pearson correlation.......................................................................................................49 Figure 48 corrplot comand............................................................................................................50 Figure 49 corrplot graph...............................................................................................................50 Figure 50 temporal series commands............................................................................................50 Figure 51 temporal series graph....................................................................................................51 Figure 52 acf command ................................................................................................................51 Figure 53 graph series ..................................................................................................................52 Figure 54 correlogram graph.........................................................................................................53 Figure 55 Time series commands ..................................................................................................53 Figure 56 decompose commands..................................................................................................53 Figure 57 Decomposition of additive time series............................................................................54 Figure 58 Decomposition of additive time series............................................................................55 Figure 59 Decomposition of additive time series............................................................................56 Figure 60 Prevision commands......................................................................................................56 Figure 61 Summary(fit) result........................................................................................................57 Figure 62 Acf residuals graph........................................................................................................57 Figure 63 plot forecast..................................................................................................................57 Figure 64 Forcats grap ..................................................................................................................58 Figure 65 Swot graph..........................................................................Error! Bookmark not defined.
  • 6. Abstract This document is a summary of our work in the framework of the integration project carried out within ESPRIT in conjunction with the company ANAP2020. We have been fortunate during this period to work on a BI project. The aim of this work is to create a decision-making layer to meet the needs expressed by ANAP2020 makers to gain visibility into the true state of purchases, etc... KEYWORDS: Data Warehouse, Business Intelligence, data integration, Open Source solutions
  • 7. General Introduction Business intelligence is a data driven decision-making. It is the practice of taking large amounts of corporate data and turning it to usable information. This practice enables companies to derive analysis that can be used to make profitable actions. The process of converting corporate data to usable information goes beyond data collection and crunching, into how companies can gain from big data and data mining. This means that business intelligence is not confined to technology, it includes the business processes and data analysis procedures that facilitate the collection of big data. It is time consuming and involves various factors such as data models, data sources, data warehouses, and business models among others. Big data and data mining are completely different concepts. However, both concepts involve the use of large data sets to handle the collection or reporting of data that helps businesses or clients make better decisions. However, the two concepts are used in two different elements of this operation. The term big data can be defined simply as large data sets that outgrow simple databases and data handling architectures. For example, data that cannot be easily handled in Excel spreadsheets may be referred to as big data. Data mining relates to the process of going through large sets of data to identify relevant or pertinent information. Businesses often collect large data sets that may be automatically collected. However, decision makers need access to smaller, more specific pieces of data and use data mining to identify specific data that may help their businesses make better leadership and management decisions. Various software packages and analytical tools can be used for data mining. The process can be automated or be done manually. Data mining allows individual workers to send specific queries for information to archives and databases so that they can obtain targeted or specific results.
  • 8. Many successful companies have invested large amounts of money in business intelligence’s tools and data warehousing technologies. They believe that the updated, accurate and integrated information from their supply chain, products and customers are essential to their survival. In the parts below, a detailed explanation will be presented to illustrate our BI solution to the ANAP 2020 project.
  • 10. Chapter 1: The project concept Introduction: Systems and information technologies have become the bases of modern organizations today. They touch, without exception, all activities and all areas of the business, both internally and externally. Current businesses found in these new information processing systems effective tools to innovate in terms of improving working methods. In this sense, a new professional environment has been created, a context that values both bodies by the quality of the management of contracts, and the richness of their information systems, particularly through the optimal exploitation the latter, thus reducing costs, delays and improving responsiveness and quality. It is in this light that ANAP 2020 expressed a need for setting up a platform of reporting: an optical objective, which is to predict the future goals and the means you will need to also identify market challenges the competitive pressures and evolving technologies. In this chapter we will start by presenting ANAP 2020 then we will present a study of an existing solution and finally specify the proposed solution to provide the most effective service. 1. State of the problematic The healthcare industry is on the brink of transformation. There are many reasons for that. Growth in the healthcare industry is at an all-time high, and healthcare organizations are seeking new ways to improve operating efficiency and reduce costs. The recent and unprecedented changes occurring in healthcare have sent organizations scrambling to extract critical information from the mountains of disparate data they possess so that they can drive optimal performance. That’s where a data warehouse comes in. A data warehouse is a must-have commodity for any organization seeking to do the following:  Understand and manage patient populations.  Support and defend clinical decisions.
  • 11.  Allocate scarce resources.  Reduce waste.  Improve quality of care.  Optimize clinical, financial and operational performance. In this context, ANAP has a large mass of data, from heterogeneous sources which are hard to interpret and to understand by decision makers. 2. Presentation of the host Organization ANAP is a public agency that assists health and medical-social institutions in improving the service provided to patients and users by developing recommendations and tools to optimize their management and organization. ATIH is a public agency responsible for collecting and analyzing health facility data: activity, costs, organization and quality of care, finances, human resources... It carries out studies on hospital costs and participates in the mechanisms Institutions. ANAP and ATIH, within the framework of their respective missions, agree on an essential fact: a considerable amount of data is produced by the different actors of health. This wealth and the variety of information collected open up great prospects for exploitation in terms of understanding the health system and its evolution. Opening the data can make it possible to multiply the operating and analysis capacity and thus to make the most of the richness of this information. The ANAP-ATIH 2020 Project is an opportunity to raise awareness on this issue and it has three main objectives: ● Promote a fun and attractive initiative to teach the issues of the exploitation of health data, ● Demonstrate the value and potential of data exploitation for health policy at a local or
  • 12. national level and provide a concrete illustration of this major issue of adapting the organization of The increase of chronic pathologies, ● Explore the ability of data scientists from different backgrounds to contribute to the resolution of problems based on better exploitation of data. 3. Case study: Analyzing existing similar projects is crucial in order to better understand the context and start implementing our solution. In this part we have chosen to work on LifeBridge Health. LifeBridge Health is a regional health care organization based in northwest Baltimore and its surrounding counties. LifeBridge Health consists of Sinai Hospital of Baltimore, Northwest Hospital, Carroll Hospital, Levindale Hebrew Geriatric Center and Hospital, LifeBridge Health & Fitness, hundreds of primary care and specialty physicians throughout the region, and many affiliated health-related partners. As one of the largest, most comprehensive and most highly respected providers of health-related services to the people of the northwest Baltimore region, LifeBridge Health advocates preventive services, wellness and fitness services, and programs to educate and support the communities it serves. Back in 2008, LifeBridge was one of the first healthcare providers to adopt Cerner's PowerInsight data warehouse solution, which is based on SAP BusinessObjects. 4. Solution In order to solve ANAP 2020’s problems we established a business intelligence solution after studying the project and its available different sorts of data. During this part of the project we: ● Started by centralizing the data ● Assisted decision-makers in decision-making
  • 13. Conclusion We tried using this chapter to provide an overview of the context of our project and the solution we proposed. In the next chapter, we will present analysis and specification of requirements.
  • 14. Chapter 2 Analysis and specification of requirements
  • 15. Chapter 2: Analysis and specification of the requirements Introduction This chapter is dedicated for detailing features, designing and realizing our solution. We will illustrate also the methodology adopted for the establishment of the project. 1. Identification of Actors The actors of a system are all stakeholders who interact directly with the system according to their roles, so they can bypass the working interface they decide on action. We have two main actors: - An administrator :  Able to maintain the solution in case of breakdown  Tracks the sources of errors  Has the ability to control decision makers at different analyzes. - A decision maker:  Consults the dashboard of the solution to have a decision support. 2. Functional Requirements Process requirements describe what our solution do. Process requirements relate the entities and attributes from the data requirements to the users’ needs. State the functional process requirements in a manner that enables the reader to see broad concepts decomposed into layers of increasing detail. The key of Anap2020 is forecast the medium-term evolution of the importance of chronic disease management for health care facilities. To do this, we have use historical data available in OpenData, with the Hospi Diag * and the ScanSanté * reporting platform in the forefront. 3. Non Functional Requirements The system we develop has to be:  Following the evolution: The evolution of information is a constraint that must be taken into account.  Operational: The system must be responsive and guarantee absolute reliability.  User friendly: The solution must be easy to use and to maintain
  • 16.  Efficient: It has to consume the minimum resources and give the greatest results.  Maintainable: It has to be easy to maintain in case of a breakdown. 4. Methodology We used the SCRUM methodology during the realization of our project. The Scrum approach to project management enables software development organizations to prioritize the work that matters most and break it down into manageable chunks. Scrum is about collaborating and communicating both with the people who are doing the work and the people who need the work done. It’s about delivering often and responding to feedback, increasing business value by ensuring that customers get what they actually want. Figure 1 Methodology scrum Conclusion The needs assessment is a step more than necessary in a Data Warehouse project. In fact, through this study, we can decide how to design the data warehouse architecture. The needs are identified, the Data Warehouse modeling can begin. This modeling will be in the next chapter.
  • 18. Chapter 3: Sprint Business intelligence Introduction Once we identified our requirements, actors and modeling method, we can begin by defining the concept of Business intelligence so simply business intelligence is a computer based technique used in spotting, digging-out, and analyzing business data. During this chapter, we are going to explain how we applied Business intelligence tools in the development of our project. 1. Extracting data We started by collecting ANAP’s data from its information systems. In order to preserve our data structure we used an “ODS”. An operational data store (or "ODS") is a database designed to integrate data from multiple sources for additional operations on the data. Unlike a master data store, the data is not passed back to operational systems. Because the data originate from multiple sources, the integration often involves cleaning, resolving redundancy and checking against business rules for integrity. An ODS is usually designed to contain low-level or atomic (indivisible) data (such as transactions and prices) with limited history that is captured "real time" or "near real time" as opposed to the much greater volumes of data stored in the data warehouse generally on a less-frequent basis. We used ODS tools in order to extract data from ANAP information system in order to storage data into dimensions to build our ODS.
  • 19. Figure 2 El Figure 3 EL Hopitaux_ODS Figure 4 EL Data_ODS
  • 20. 2. Design During this step we are going to supply our Data Warehouse. A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.  Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.  Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.  Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.  Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a d ata warehouse should never be altered. Figure 5 Design
  • 21. First of all, we created the Data Warehouse in SQL Server Management Studio to define the structure of each dimensions and facts. We used the fact constellation schema because we need an architecture that has multiple fact tables that shares many dimension tables. Figure 6 ETL Figure 7 The structure of the Data WareHouse
  • 22. 3. Loading data In this step, we are going to integrate, load and store data in the Data Warehouse. 3.3. Descriptionof FactAnap2020 This table contains the foreign keys of all the dimensions such asyear dimension, age, rs ( social reason) and patients origin dimension. It contains also the principle measures that our work is based on which are Total number of stay and number of MCO stay (Medicine, Surgery, Obstetrics). Figure 8 FactAnap2020 3.4. Descriptionof Fact_Qualite This table measures the means used to fight nosocomial infections in institutions, through four Indicators: ICALIN: Composite Indicator of Activities for the Control of Nosocomial Infections, ICSHA Indicator of Consumption of Hydro-Alcoholic Solutions, ICATB Index Composite of Good Use of Antimicrobials SURVISO Indicator of realization of a Surveillance of the Infections of the Operating Site.
  • 23. Figure 9 Fact_Qualite 3.5. Descriptionof Fact_Finance This table contains all the information of the financial indicators of hospitals (margin, debt, financial need ...). The following information is analyzed: Medical doctors, including doctors (except anesthetists), including Surgeons (excluding gynecologists and obstetricians), including Anesthetists, Obstetricians. Figure 10 Fact_Finance 3.6. Descriptionof Fact_Process This indicator compares the DMS of medicine from the institution to the standardized one of its case mix to which are applied the reference MDS of each medical GHM. It synthesizes the over or under performance of the medical organization of the institution in medicine (out-of-pocket).
  • 24. Figure 11 Fact_Process 3.7. Descriptionof Fact_RH This information makes it possible to have a global vision of the medical human resources of the institution, declined by discipline Figure 12 Fact_RH 3.8. Descriptionof Dimensions  Year Dimension: The year dimension is the only dimension that is systematically in any data warehouse, because in practice any data warehouse is a time series.
  • 25. Figure 13 Year Dimension  RS Dimension: The RS dimension is “social reason”. It is the dimension that contains the names of the hospitals and their Finess which is an Id. Figure 14 RS Dimension  Age Dimension: The age dimension tells if the age is above or under 75 year.
  • 26. Figure 15 Age Dimension  Patients origin Dimension: The provenance dimension contains the origin place of the patients. Figure 16 Patients origin Dimension  DA Dimension (Activity Domain): Figure 17 DA Dimension
  • 27. 4. Analytical Processing // manquante olap et calculate w ka ri9 5. Reporting Reporting means collecting and presenting data so that it can be analyzed. Reporting is the necessary prerequisite of analysis; as such, it should be viewed in light of the goal of making data understandable and ready for easy, efficient and accurate analysis.  Collecting and presenting data ready to be analyzed, including historical data that can be tracked over time.  Empowering end-users with the knowledge to become experts in their area of business  Having the underlying figures to back up actions and explain decisions
  • 28. Figure 18 Reporting 1 //ne9sin les interprétations Figure 19 Reporting 2 Figure 20 Reporting 3
  • 29. 6. Used technologies SQL Server Data Tool : It transforms database development by introducing a ubiquitous, declarative model that spans all the phases of database development inside Visual Studio. We used SSDT Transact-SQL design capabilities to build, debug, maintain, and refactor databases. We worked with a database project, and also directly with a connected database instance on or off-premise. SQL Server Integration Services : is a platform for building enterprise-level data integration and data transformations solutions. We used Integration Services to solve complex business problems by copying or downloading files, sending e-mail messages in response to events, updating data warehouses, cleaning and mining data, and managing SQL Server objects and data. SQL Server Analysis Server : is an online analytical and transactional processing (OLAP) and data mining tool in Microsoft SQL Server.We used SSAS as a tool by organizations to analyze and make sense of information possibly spread out across multiple databases, and in disparate tables or files. SQL Server Reporting Services : It contains a set of graphical and scripting tools that support the development and use of rich reports in a managed environment. The tool set includes development tools, configuration and administration tools, and report viewing tools. This topic gives a brief overview of each tool in Reporting Services and how it can be accessed. SQl Server Management Studio : is an integrated environment for managing any SQL infrastructure, from SQL Server to SQL Database. It provides tools to deploy, monitor, and upgrade the data-tier components, such as databases and data warehouses used in our applications, and to build queries and scripts. Power BI : is a Microsoft business analytics service that enabled for us to visualize and analyze data.
  • 30. Conclusion A data warehouse maintains a copy of information from the source transaction systems. This architectural complexity provides the opportunity to congregate data from multiple sources into a single database so a single query engine can be used to present data also mainly to improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data and make decision–support queries easier to write
  • 32. Chapter 4: Sprint Data mining Introduction Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. It is an interdisciplinary subfield of computer science The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Our main goal is to: - Optimize the use of limited resources. - Partition data such that different classes or categories can be defined. 1. Descriptive analysis We start by importing data from the DataWareHouse based on an SQL query. install.packages("RODBC") library(RODBC) dbhandle <- odbcDriverConnect('driver={SQL Server};server=.;database=DW_Anap;trusted_connection=true') res2 <- sqlQuery(dbhandle, 'select TOP 100000 a.[Nombre de sejours/seances MCO des patients en ALD],a.[Nombre total de sejours/seances], a.Pk_Age,a.Pk_DA,a.PK_PP,p.P1,p.P10,p.P11,p.P12,p.P13,p.P14,p.P15,p.P2,p.P3,p.P4,p.P5,p.P6,p.P 7,p.P8,p.P9 ,q.Q1,q.Q2,q.Q3,q.Q4,q.Q5,q.Q6,q.Q7,q.Q8,q.Q9,q.Q10,q.Q11 ,r.CI_RH1,r.CI_RH2,r.CI_RH3,r.CI_RH4,r.CI_RH5,r.CI_RH6,r.CI_RH7,r.CI_RH8,r.CI_RH9,r.CI_R H10,r.CI_RH11 ,r.RH1,r.RH2,r.RH3,r.RH4,r.RH5,r.RH6,r.RH8,r.RH9,r.RH10 ,f.CI_F1_D,f.CI_F2_D,f.CI_F3_D,f.CI_F4_D,f.CI_F5_D,f.CI_F6_D,f.CI_F7_D,f.CI_F8_D,f.CI_F9_ D,f.CI_F10_D ,f.CI_F11_D,f.CI_F12_D,f.CI_F13_D,f.CI_F14_D,f.CI_F15_D,f.CI_F16_D,f.CI_F17_D,f.CI_F1_O,f. CI_F2_O,f.CI_F3_O,f.CI_F4_O,f.CI_F5_O,f.CI_F6_O,f.CI_F7_O,f.CI_F8_O,f.CI_F9_O,f.CI_F10_O ,f.CI_F11_O,f.CI_F12_O,f.CI_F13_O,f.CI_F14_O,f.CI_F15_O,f.CI_F16_O,f.CI_F17_O,f.F1_D,f.F2 _D,f.F3_D,f.F4_D,f.F5_D,f.F6_D,f.F7_D,f.F8_D,f.F9_D,f.F10_D ,f.F11_D,f.F12_D,f.F1_O,f.F2_O,f.F3_O,f.F4_O,f.F5_O,f.F6_O,f.F7_O,f.F8_O,f.F9_O,f.F10_O,a.PK _annee,a.PK_RS ,f.F11_O,f.F12_O ,an.Annee,ag.[Age (1 >75 ans,0 <= 75 ans)],rss.rs,da.[Domaines d activites] from dbo.DimDA,dbo.Dim_RS,dbo.DimAge,dbo.Dim_Annee,dbo.Fact_ANAP2020 as a inner join dbo.FactProcess p on a.PK_annee=p.FK_annee and a.PK_RS=p.FK_RS
  • 33. inner join dbo.FactQualitee q on a.PK_annee=q.FK_annee and a.PK_RS=q.FK_RS inner join dbo.RH1 r on a.PK_annee=r.FK_annee and a.PK_RS=r.FK_RS inner join dbo.FactFinances f on a.PK_annee=f.FK_annee and a.PK_RS=f.FK_RS inner join dbo.DimDA da on da.Pk_DA = a.Pk_DA inner join dbo.Dim_RS rss on rss.PK_RS = a.PK_RS inner join dbo.DimAge ag on ag.Pk_Age = a.Pk_Age inner join dbo.Dim_Annee an on an.PK_annee = a.PK_annee' Figure  Provide descriptive statistics: Figure 22 Descriptive statistics As we can see on the picture below, we have 5 quantitatifs variables and 4 qualitative variables. Each quantitative values have a maximum value and a minimum value.  Clustering: Clustering is the process of making a group of abstract objects into classes of similar objects. In order to center and reduce the data, we must start by constructing a Function "centrage_reduction" that centers and reduces a column, applied to all active (quantitative) variables with apply (...... ..) Figure 21 Importing data from the DataWareHouse
  • 34. To do this, we propose to perform the following column standardization function: centrage_reduction<- function(x) { return((x-mean(x))/sqrt(var(x))) } Obtaining the centred and reduced data table res.cr <- apply(res[,1:4],2,centrage_reduction) apply(res.cr,2,mean) apply(res.cr,2,var)  Interpretation of column mean and variance: The closer the variance is, the more homogeneous the population is. In our case the variance = 1 so we can do the classification and get the desired results. (If the variance is close to 0, it may be concluded that the population is heterogeneous.)  K-Means: The K meansinRaimsto divide the dataintogroups(classes)soastominimize the distancesbetween the points and the centers of each class. It takes as argument a database and the desired number of groups, it implements an algorithm to arrive atthisclassification.The centersare randomlyprojectedandthe distance separatingthemfrom each of the cloudpointsiscalculatedandthe pointsare groupedaccordingto the distance separating them from each center. Now, the K-Means algorithm is started on the centred and reduced variables. We propose to design a partition of two groups, limited to 40 iterations.  The principle of using the function "kmeans (...)"
  • 35. Figure 23 Kmeans commands 1 Figure 24 Kmeans commands 2
  • 36.  Interpretation of the membership groups: For the interpretation of the groups, the conditional averages of the original active variables are calculated. They are collected in a single matrix using the following commands: Figure 25 Single matrix commands To project the points illustrated according to their group of belonging, in the planes formed by the pairs of variables, R shows all its power. The command used is "peers" the result is rich of lessons: the variables are for the most part highly correlated, almost all pairs of variables make it possible to distinguish the groups: >pairs(res[,1:4],pch=21,bg=c("red","blue")[groupe])
  • 37. Figure 26 Pairs Graph  ACP / K-Means combination: In order to find a tool allowing to locate the groups well, it is proposed to project the points in the first factorial plane of the Analysis in Principal Components. To do this, use the following command lines: acp<- princomp(res.cr,cor=T,scores=T) print(acp) print(acp$sdev^2) print(acp$loadings[,1]*acp$sdev[1]) plot(acp$scores[,1],acp$scores[,2],type="p",pch=21,col=c("red","blue")[groupe]) Figure 27 ACP / K-Means combination
  • 38. The graph obtained is below: Figure 28 Plot Acp Graph CAH calculates and returns the distance matrix calculated using the specified distance measurement to calculate the distances between the rows of a data array. Figure 29 CAH
  • 39. Now we are going to plot the dendogram to see how many clusters we have and how is their repartition Figure 30 Dendogram To obtain a circle of correlations We have also developed a function to eliminate NA values plot(acp$scores[,1],acp$scores[,2],type="p",pch=21,col=c("red","blue")[groupe]) Figure 31 Circle of correlations Commands
  • 40. 2. Predictive analysis  Linear regression : Basic import, display of data table and change of column names. Figure 33 Import Data deleteNA=function(data) { for(i in 1: nrow(data)) { if(is.na(data[i,1])) { data[-i,] i=i+1 } else if(is.na(data[i,2])) { data[-i,] i=i+1 } else if(is.na(data[i,3])) { data[-i,] i=i+1 } else if(is.na(data[i,4])) { data[-i,] i=i+1 } } return (data) } type=deleteNA(res2) Figure 32 function to eliminate NA values
  • 41. Figure 34 Summary commands Since we have a very large number of data and R is limited in number of lines we have chosen 10000 lines with this piece of code. Figure 35 10000 lines selection command We will then center and reduce the data in order to reconcile the points. Figure 36 center and reduce In the following, we will apply the function lm which makes it possible to calculate the regression Linear of a numerical dependent variable as a function of the explanatory variables. In our case, the function lm is applied to the target variable which is the target.  Predictionusinglinearregression :
  • 42. Figure 37 Prediction using linear regression Two coefficients of the different characteristics were found. Two coefficients of the different characteristics were found.... Target = 0.01044276292 – 0.0007731274 * nb_jours_mco +0.0155397031 * nb_jours_total with : α 0 : 0.01044276292: cette variable n’est pas prise en considération. α 1 = -0.0007731274 : le coefficient de la variable nb_jours_mco. α 2= 0.0155397031: le coefficient de la variable nb_jours_total. To better explain the result found above, and to deduce a more reliable conclusion we will apply the reduced model. The criterion used for the selection of the model will be used. This criterion is obtained by using the AIC function (object, k =?), with the object it is the model and k by default equal to 2.
  • 43. Figure 38 StepAic The best model is the one with the lowest AIC, which is in our case AIC -39769.93. Hence obtaining a reduced model composed only of a variable which is: nb_jours_total. The reduced model neglected nb_jours_mco, we deduce the new target function: Target = 0.01044276292 + 0.0155397031 * Total number of days. with: β 0 : 0. 01044276292 β 1 : 0.0155397031 : le coefficient de la variable nb_jours_total. Note: The higher the value of the coefficient of the zero plus the variable is relevant. For a better comparison, we will apply the function cbind which allows the concatenation of the two vectors data_paa1 $ m which is the learning warehouse and the model m which is the
  • 44. reduced model, in order to see if the model is close to the model d 'learning Figure 39 cbind command This function allowed us to compare the general model with the model each time. It is noted that the target values of the two models are very close, confirming what we have already mentioned that an appropriate model can be derived according to the general model.
  • 45. Figure 40 Residuals vs Filtted Graph It can be seen from this figure that the plot function gives us a more or less linear point cloud form, we can extract the lines of atypical (non-standard) targets that are very far from the right such as: Line 9395, 1067 and 9067 By clicking on Enter we get other clouds of points.
  • 46. Figure 41 Normal Q-Q graph It is also noted that the line of target 3950 is an atypical model.  Neural network library(readxl) anap=read.csv("C:/Users/LENOVO/Documents/Visual Studio 2010/Projects/ANAP_2020_Final/Data/data2.csv",sep=";",dec=".") library(MASS) str(anap) anap=anap[0:1000,c(5:8,9)] library(nnet) #options du réseau :size = 2, rang = 0.1, decay=1, maxit=500, package: nnet res=nnet(cible1 ~ ., data = anap, size = 2, rang = 0.1, decay=1, maxit=50); res=nnet(cible1 ~ ., data = anap, size = 2, rang = 0.1, decay=1, maxit=500); #on a essayer 50 itération et puis 500 itétrations les résultats converge vers 13.98 library(NeuralNetTools) plotnet(res, struct = struct) #prediction sur le résultat du réseau de neurones pred.nnet<- predict(res,anap) #confusion matrix /// on ne peux pas des données quanti Mc<-table(anap$cible1, predict(res,anap)); class(Mc) Mc Figure 42 Neural network
  • 47. Figure 43 Neural network Graph Conclusion The techniques and use of Data Mining leads us to use a set of algorithms, methods and technologies in a specific area that is health in our case. In the decision-making process, it has helped us to demonstrate the interest and potential of data exploitation in the service of a local or national health policy and to give a concrete illustration of it A major factor in the adaptation of the organization of the institutions to the increase in chronic pathologies, and led us to promote a playful and attractive initiative to educate the stakes of the exploitation of health data. Thus, we have carried out the time analysis which will enable us to predict future achievements using the target variable.
  • 48. 3. Time series analysis In order to facilitate the data mining part, we decided to study the stability and the variation of our data over time. Our goal in the Data Mining part is to predict the variable “cible 1”. To accomplish this task, we devoted our efforts in studying the data and it’s correlation with our target variable. Connectingand importingdata from the data warehouse: Figure 44 importing data commands
  • 49. Successful Import of the 1048575 Data Warehouse Lines  Correlation analysis We want to measure the intensity of the link between our target variable and the quantitative data. Figure 46 Mcor Commands We used the Pearson correlation to have standard deviations. Figure 47 Pearson correlation The results obtained show an increasing correlation between:  Finess  Année  NBSejourTotal  NBSjourMCO  Age Figure 45 Correlation analysis
  • 50. Figure 48 corrplot comand To support the method cor we use the library corrplot to schematize the result. Figure 49 corrplot graph The correlation between the target variable and age is clear, followed by a slight correlation between the target variable and NBSejourTotal, as well as the target variable and NBSejourMCO. Figure 50 temporal series commands In order to verify this correlation in time, we will create a temporal series. After obtaining the interval of years from the Summary function we create our temporal series s1.
  • 51. Figure 51 temporal series graph We notice that our data have the same pace in time but we must test the Acf to see if our data are correlated over time. Figure 52 acf command
  • 52. The autocorrelation of a series refers to the fact that in a temporal or spatial series, the measurement of a phenomenon at a time t can be correlated with the preceding measurements. Figure 53 graph series We will focus on our target variable, we note that over time, there is no correlation between age and our target variable, between NBSejourTotal and our target variable, on the other hand there is a strong correlation NBSejourMCO and our target variable, in fact the blue line indicates the critical threshold beyond which the autocorrelation is considered significant.
  • 53. To proof that our series is very correlated with itself we carry out a plot in lag of 3 years. Figure 54 correlogram graph We obtained a correlogram which proves that our series is very correlated with itself and therefore we must decompose it and study its seasonality. Figure 55 Time series commands Figure 56 decompose commands
  • 54. We will create two time series, one for the target variable and the other for NBSejourMCO. We will then decompose them. Figure 57 Decomposition of additive time series There is a trend and a seasonality for the time series of the target variable.
  • 55. Figure 58 Decomposition of additive time series
  • 56. There is a trend and a seasonality for the time series of the variable NBSéjourMCO. Figure 59 Decomposition of additive time series Demonstration of the correlation, seasonality and tendency for our two curves.  Prevision In this part, we will try to predict our target variable. Figure 60 Prevision commands We will use Arima from the Forecast library for prediction.
  • 57. Figure 61 Summary(fit) result With an AIC <0 the model is good. We try to see if there is a correlation between our model and the residuals. Figure 62 Acf residuals graph No value that exceeds the blue line indicates the critical threshold beyond which the autocorrelation is considered significant, hence no correlation with the residuals. Figure 63 plot forecast
  • 58. We use the box-type Box-Pierce because it is the most efficient algorithm with strongly correlated data. Figure 64 Forcats grap Our target was predicted for the next two years (in blue) and the correct values from the test dataset are in green. 1. Used technologies R : During the Sprint «Data Mining and Temporary Series » we have used a «R » which is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
  • 59.
  • 61. Chapter 5: Sprint Big Data
  • 63. Chapter 6: Marketing strategy Introduction The economy is not without friction. Supply and demand are not so easily encountered that consumers have to make efforts to look for goods that satisfy their needs, companies have to find customers who value the goods they produce and produce. Marketing organizes this meeting and facilitates the exchange of the transaction to the relationship. Indeed, people often have a fairly narrow view of marketing. Many think, for example, that marketing is limited to sales or advertising. But marketing brings together a much wider range of activities that link an organization to its market. Marketing is also a managerial philosophy that places the consumer at the heart of the company's concerns. With a marketing approach, the company's success depends on understanding and satisfying the needs of the consumer. Therefore, all our intention is based on the customer and his expectations while taking into consideration all the key elements that will guarantee a total and reciprocal satisfaction of the customer and our company. In this chapter, we will show you a complete marketing study of our project. 1. Analyse of needs
  • 64. We Note that costs for chronic diseases are very high, which is 5 times more than a normal patient. And about 25 % of people with a chronic disease have some of activity restriction… As we see in this figure chronic illness are not ready to decrease. In contrary of injuries and communicable Disease they are in stabilization phase and we will see a decrease for 2020. 2. Strategic Analysis (SWOT) SWOT (Strengths - Weaknesses - Opportunities - Threats) is a strategic analysis tool that combines the study of the strengths and weaknesses of an organization, a territory, a sector ... Opportunities and threats of its environment, in order to help define a development strategy. The aim of the analysis is to take into account in the strategy both internal and external factors, maximizing the potentials of the forces and opportunities and thus identify the key factors for success and minimizing the effects of weaknesses and threats to gain the competitive advantage.
  • 65. Figure 65 Strategic Analysis (SWOT) Conclusion Now, after this marketing study, we have a clear and concise plan of action that will allow us on the one hand to position ourselves vis-à-vis our competitors in order to judge our product in an objective way, On the other hand, to set well our objectives that we want to achieve by being realistic so as not to be surprised at the end.
  • 66. General Conclusion Use the data available to the company to give them added value, this is the challenge of modern business. In this context, and in order to solve recurring problems in the process of decision making, Anap 2020 initiated the project to build a data warehouse to allow the establishment of a reliable and efficient decision-making system. Throughout our work in design and construction, we tried to follow a mixed approach, combining thus between two known approaches in the field of Data Warehouse, namely the "Needs Analysis" approach and the approach "Data sources". This allowed to meet the expectations and needs of users while making the most of data generated by operating systems to anticipate unexpressed needs. Finally, and before citing the prospects of the project, we can say that this project Anap 2020 has allowed us to acquire a very good experience and evolve in an area that was more us at least unknown namely field of decision-making systems. We can mention the following perspectives and developments:  Follow the current deployment  Extending the system to other operating systems including human resource.  Making the report available on mobile portals defeated OS for that information is relevant to wear anywhere and anytime.