Rapport pebi-anap-atih-2020

P a g e 1 | 51
Summary
General Introduction .................................................................................................................. 6
Project Context........................................................................................................................... 7
Introduction ............................................................................................................................ 7
Presentation of the organization............................................................................................. 7
ANAP................................................................................................................................. 7
ATIH .................................................................................................................................. 8
Presentation of the project...................................................................................................... 8
ANAP-ATIH2020 .............................................................................................................. 8
Proposed solution ................................................................................................................... 9
Methodology .......................................................................................................................... 9
Presentation of the Scrum method ..................................................................................... 9
Scrum roles....................................................................................................................... 10
Road map.............................................................................................................................. 11
Conclusion............................................................................................................................ 11
Analysis.................................................................................................................................... 12
Introduction .......................................................................................................................... 12
Functional requirements....................................................................................................... 12
Non-Functional Requirements ............................................................................................. 12
Actor identification .............................................................................................................. 13
Use case diagram.................................................................................................................. 14
Conclusion............................................................................................................................ 15
ETL........................................................................................................................................... 16
Introduction .......................................................................................................................... 16
Data Source .......................................................................................................................... 16
Dimensional modeling ......................................................................................................... 16
Choice of the model ............................................................................................................. 17

P a g e 2 | 51
Realisation............................................................................................................................ 19
Used tools......................................................................................................................... 19
Extract Transform Load ................................................................................................... 20
Operational Data Store ODS............................................................................................ 20
Data warehouse ................................................................................................................ 21
OLAP ....................................................................................................................................... 24
Introduction .......................................................................................................................... 24
Used tools............................................................................................................................. 24
Schema Workbench.......................................................................................................... 24
Mondrian .......................................................................................................................... 24
Realization............................................................................................................................ 25
Reporting.................................................................................................................................. 27
Introduction .......................................................................................................................... 27
BI and Reporting .................................................................................................................. 27
Performance ..................................................................................................................... 27
Decision making............................................................................................................... 27
Delivery............................................................................................................................ 27
Used tools............................................................................................................................. 28
QlikView .......................................................................................................................... 28
JasperReports ................................................................................................................... 28
Saiku................................................................................................................................. 29
Realization............................................................................................................................ 29
Reporting using QlikView ............................................................................................... 29
KPI ................................................................................................................................... 32
Reporting using Jaspersoft ................................................................................................... 33
Conclusion............................................................................................................................ 39
Data mining.............................................................................................................................. 40

P a g e 3 | 51
Introduction .......................................................................................................................... 40
Objective .............................................................................................................................. 40
Used tools............................................................................................................................. 40
RStudio............................................................................................................................. 40
Realization............................................................................................................................ 41
Unsupervised methods ..................................................................................................... 41
Supervised methods.......................................................................................................... 42
Generalized linear model ................................................................................................. 42
Random Forests................................................................................................................ 43
GBM: Gradient boosting model....................................................................................... 43
Model 2 GBM .................................................................................................................. 43
Deep Learning.................................................................................................................. 43
Conclusion............................................................................................................................ 43
Big Data.................................................................................................................................... 44
Introduction .......................................................................................................................... 44
Big Data................................................................................................................................ 44
Used Tools............................................................................................................................ 45
Apache Tomcat ................................................................................................................ 45
Realization...................................................................................Erreur ! Signet non défini.
General Conclusion.................................................................................................................. 51

P a g e 4 | 51
Illustrations
Figure 1: ANAP logo ................................................................................................................. 7
Figure 2:ATIH logo.................................................................................................................... 8
Figure 3: Scrum methode ......................................................................................................... 10
Figure 4: Road map.................................................................................................................. 11
Figure 5: Use case diagram ...................................................................................................... 14
Figure 6: Hospidiag interface................................................................................................... 16
Figure 7: Data warehouse schema............................................................................................ 18
Figure 8: Postgres logo............................................................................................................. 19
Figure 9: Pentaho logo ............................................................................................................. 20
Figure 10: Charging of table HD.............................................................................................. 21
Figure 11: Charging domaine activite dimension .................................................................... 21
Figure 12: Data transformation ................................................................................................ 22
Figure 13: fact_data charging................................................................................................... 22
Figure 14: fact HD charging..................................................................................................... 23
Figure 15: Casting data ............................................................................................................ 23
Figure 16: Charging data warehouse job.................................................................................. 23
Figure 17: Mondrian logo ........................................................................................................ 25
Figure 18: Cube creation with schema workbench .................................................................. 25
Figure 19: XML schema generated with dchema workbench.................................................. 26
Figure 20: ClikView logo......................................................................................................... 28
Figure 21: JasperSoft logo........................................................................................................ 28
Figure 22: Saiku logo ............................................................................................................... 29
Figure 23: Histogram of number of stay/ session .................................................................... 29
Figure 24: Pie chart of the number of ALD’sessions.............................................................. 30
Figure 25: Histogram of number of ALD’ sessions by establishment's category.................... 31
Figure 26 :number of ALD' sessions at EBNL ........................................................................ 31
Figure 27: histogram number of beds by regions.................................................................... 32
Figure 28: KPIs ........................................................................................................................ 33
Figure 29: Report with JaperSoft ............................................................................................. 33
Figure 30: Top 10 establishments accorfing to Ald sessions number...................................... 34
Figure 31: AP-HP-Seine-Saint-Denis activity domains........................................................... 35
Figure 32AP-HP-ValDe-Mane activity domains:.................................................................... 35

P a g e 5 | 51
Figure 33:AP-HP-Haut-Seine activity domains....................................................................... 36
Figure 34:AP-HPParis activity domains .................................................................................. 36
Figure 35: Number of ALD sessions by establishment category............................................. 37
Figure 36: number of ALD sessions by age............................................................................. 38
Figure 37: TOP3-Number-Chemo-Sessions-Radio-Hemodialysis -childbirth ........................ 38
Figure 38: R studio logo........................................................................................................... 41
Figure 39: Rules with arulesViz............................................................................................... 41
Figure 40: Presentation of a rule .............................................................................................. 42
Figure 41: Apache logo............................................................................................................ 46
Figure 43: Fetching data from twiter........................................................................................ 46
Figure 44: Fetching configuration............................................................................................ 47
Figure 45: Fetched data............................................................................................................ 47
Figure 46: Table creation and inserting data into it ................................................................. 48
Figure 47: Used words after the Real Madrid Vs Celta Vigo game ........................................ 48
Figure 48: Activity on the Real Madrid Facebookpage .......................................................... 49
Figure 49: Activity on the FCBarcelone facebook page......................................................... 50

P a g e 6 | 51
General Introduction
Harnessing the full potential of data requires developing an organization-wide of
organizing this data in addition to a data science strategy. Such strategies are now commonplace
in most industries such as banking and retail. Banks can offer their customers targeted needs-
based services and improved fraud protection because they collect and analyze transactional
data.
Furthermore, as healthcare organizations face increasing demand for healthcare
services, harnessing data and analytics helps organizations improve patient care, manage
chronic disease, apply adaptive treatments and reduce costs.
Healthcare is the maintenance or improvement of health via the diagnosis, treatment,
and prevention of disease, illness, injury, and other physical and mental impairments in human
beings. Such field contains a huge amount of data that needs to be analyzed.
Health care is a glaring exception. Individual pieces of data can have life-or-death
importance, but many organizations fail to aggregate data effectively to gain insights into wider
care processes. Without a data science strategy, health care organizations can’t draw on
increasing volumes of data and medical knowledge in an organized, strategic way, and
individual clinicians can’t use that knowledge to improve the safety, quality, and efficiency of
the care they provide.
Hence comes the need to collect data and from that collected data comes the need to
make good decisions. Making good decisions that requires all relevant data to be taking into
consideration.
If an organization tries to aggregate and analyze poor-quality data, it may derive useless
or even dangerous conclusions. Therefore, the data has to be structured. The best source for that
data is a well-designed data warehouse which will help in the process of making decisions,
analyzing data and even making predictions.

P a g e 7 | 51
Project Context
1. Introduction
We can’t talk about out an organization without talking about information. Nowadays
every organization tries to have a system to manipulate its data ERP.
The Hospidiag ERP is the result of the cooperation between two organizations ANAP
and ATIH.
In this chapter, we will give the presentation of the ANAP-ATIH2020 but first we will
start by presenting the organizations.
2. Presentation of the organization
ANAP
Present since 2009, the National Agency of Support for the Performance of the
establishments of health and medical and social ANAP, comes to improve the performances
within the framework of the reform of the health system in France. This agency which is public
help the establishments of health and medical social to improve the service provided to the
patients and to the users, by developing recommendations and tools, allowing them to optimize
their management and their organization.
Figure 1: ANAP logo

P a g e 8 | 51
ATIH
Established in 2000, the Technical Agency of the Information about the Hospitalization
ATIH, is in charge of collecting and of analyzing the data of the establishments of health:
activity, costs, organization and quality of the care, the finances, the human resources...
It realizes studies on the hospitable costs and participates in the mechanisms of financing
of establishments.
Figure 2:ATIH logo
3. Presentation of the project
ANAP-ATIH2020
The ageing of the French population comes along with an important increase of the
number of people living with chronic diseases. The offer of care which mainly built itself
around the coverage of the short-term care has to evolve today to integrate the needs for long-
term follow-up.
The ANAP and the ATIH, within the framework of their respective missions agree on
an essential report: a considerable quantity of data is produced by the various actors of health.
This wealth and the variety of the collected information open big perspectives of exploitation
as for the understanding of the health system and its evolution.
The opening of the data can allow to slow down the capacity of exploitation and analysis
and so to take advantage at best of the wealth of this information.
The ANAP-ATIH project 2020 constitutes an opportunity to make sensitive widely on
this question and he pursues 3 main objectives:

P a g e 9 | 51
• To promote a playful and attractive initiative to make the pedagogy of the stakes in the
exploitation of the data of health.
• To demonstrate the interest and the potential of an exploitation of the data in the service
of a health policy to a local or national level, and to give a concrete illustration to this
major question of the adaptation of the organization of establishments to the increase of
the chronic pathologies.
• To explore the capacity of dated scientists diverse horizons to contribute to the
resolution of problems resting on a better exploitation of the data.
4. Proposed solution
Based on identified needs of the ANAP-ATIH2020 we will propose a solution to
structure the collected data in a data warehouse in order to have not only a quick and easy access
to it but also to insure its quality and consistency. Then we will create a cube that allows fast
analysis of data according to multiple dimensions and which will be used to create reports with
multiple tools.
5. Methodology
Presentation of the Scrum method
A successful Scrum project is much about understanding what Scrum is. Therefor we
will try to give an overview about Scrum which will be our methodology during the project.
Scrum is a way for teams to work together to develop a product. Product development,
using Scrum, occurs in small pieces, with each piece building upon previously created pieces.
Building products one small piece at a time encourages creativity and enables teams to
respond to feedback and change, to build exactly and only what is needed. More specifically,
Scrum is a simple framework for effective team collaboration on complex projects. Scrum
provides a small set of rules that create just enough structure for teams to be able to focus their
innovation on solving what might otherwise be an insurmountable challenge.
However, Scrum is much more than a simple framework. Scrum supports our need to
be human at work: to belong, to learn, to do, to create and be creative, to grow, to improve, and
to interact with other people. In other words, Scrum leverages the innate traits and
characteristics in people to allow them to do great things together.

P a g e 10 | 51
Figure 3: Scrum methode
Scrum roles
Building complex products for customers is an inherently difficult task. Scrum provides
structure to allow teams to deal with that difficulty. However, the fundamental process is
incredibly simple, and at its core is governed by 3 primary roles.
• Product Owners: determine what needs to be built in the next 30 days or less.
• Team: build what is needed in 30 days (or less), and then demonstrate what they
have built. Based on this demonstration, the Product Owner determines what to
build next.
• Scrum Master: ensure this process happens as smoothly as possible, and
continually help improve the process, the team and the product being created.
While this is an incredibly simplified view of how Scrum works, it captures the essence of this
highly productive approach for team collaboration.

P a g e 11 | 51
6. Road map
Figure 4: Road map
7. Conclusion
In this chapter, we introduced briefly the project by presenting the organizations
concerned by the challenge. The ANAP-ATIH2020 is based on the collect of information from
the French establishments of health which is to be structured and analyzed in order to predict
the medium-term evolution of the coverage care of the chronic diseases’ importance for the
establishments of health.

P a g e 12 | 51
Analysis
1. Introduction
During this chapter, we will start by presenting both the functional and non-functional
requirements of our project. Then, we will continue with the project actors’ identification and
finally we will resume with a use case diagram.
2. Functional requirements
The system that we need has to be:
• Operational
• Scalable
• User friendly
• Offering the necessary information in real time.
For this reason, the system must perform to fulfill the requirements of all users. We present
in the following paragraph all functional requirements:
• Analyze the input Data
• Analyze the output Data
• Analyze the session time for each user
• Analyze the number of connected people
• User Status
3. Non-Functional Requirements
Non-functional requirements are the gaps that may prevent the application of operate
effectively and efficiently.
a. Ability:
The total amount of data within selected databases represents a very large volume. In
addition, the treatment thereof increases about this little initial volume. Therefore, server
capacity shall be sufficient to enable a collect all these data allow program execution and
safeguarding updates over time.
b. Integrity:
At all levels of implementation of the data warehouse, different errors will be treated
including the following:

P a g e 13 | 51
• The treatment of bad data when importing data from different bases.
• In order to standardize the format of the data collected, it will be necessary convert
two-dimensional structures in three-dimensional structures. This will be the source
of errors that must be taken into account.
• Referential integrity in the database tables.
c. Quality:
• Facilitating data access and dissemination of information.
• Reliability and traceability of data.
• Human Machine Interaction as intuitive as possible
d. Simplicity:
Since among the users of the application, there are those who do not necessarily have great
knowledge in the field of IT, the functionality of the solution should be understandable and easy
to handle. Indeed, the navigation through the different sections shall be designed so that the
user finds it easily; it must actually feel in control at all times.
e. Performance:
The application must meet all user requirements in an optimal way. The application
performance results in reduced access times to different features an access time to data
acceptable seen handling a warehouse relatively large data.
f. Reliability:
It must ensure content quality and relevance of information.
g. Ergonomics:
The first thing that catches the attention of users is the ergonomics and ease use, for that
special attention should be given to this need.
4. Actor identification
An actor represents the abstraction of a role played by external entities that interact
directly with the system studied. An actor acts on the system, it plays a different role in each
use case which he collaborates and is usually represented by a stick man. An actor characterizes
an outside user, or a group of users that interact directly with the system. In our system, we
have identified two actors: Administrator Decision maker

P a g e 14 | 51
• The Administrator: Generate the report and cubes and publish as required of terms and
conditions.
• The Decision Maker: He takes analysis generated reports.
5. Use case diagram
A use case is a methodology used in system analysis to identify, clarify, and organize
system requirements. The use case is made up of a set of possible sequences of interactions
between systems and users in a particular environment and related to a particular goal.
It consists of a group of elements (for example, classes and interfaces) that can be used
together in a way that will have an effect larger than the sum of the separate elements combined.
The use case should contain all system activities that have significance to the users. A use case
can be thought of as a collection of possible scenarios related to a particular goal, indeed, the
use case and goal are sometimes considered to be synonymous.
Figure 5: Use case diagram
a. Create Report:
Pre-condition: none
Description: An Admin can create reports
Post-condition: the report has been successfully added

P a g e 15 | 51
b. Save Report:
Pre-condition: none.
Description: Only an admin can save reports.
Post-condition: the report has been successfully saved.
c. Edit Reports:
Description: An admin edit reports.
Post-condition: the report has been successfully saved.
d. Load Report:
Description: An admin or decision maker can load reports.
Post-condition: the report is loaded
6. Conclusion
During this chapter, we gave an overview of the project. In the chapters, we will introduce the
that we have established over the different sprints.

P a g e 16 | 51
ETL
1. Introduction
ETL is short for Extract, Transform and Load. As the name hints, we’ll extract data from
one or more operational databases, transform it to fit in a warehouse structure, and load the data
into the DWH.
A data warehouse is a system used to store information for use in data analysis and
reporting in which the data are integrated, not volatile, and logged. But first, it is essential to
define its structure. Before filling the data warehouse, the design of it is essential.
2. Data Source
The data used in our project is offered throw the Hospidiag OpenData Source.
Figure 6: Hospidiag interface
3. Dimensional modeling
Dimensional modeling is a part of data warehouse design, results in the creation of the
dimensional model. There are two types of tables involved:
• Dimension tables are used to describe the data we want to store. For example: a
retailer might want to store the date, store, and employee involved in a specific
purchase. Each dimension table is its own category (date, employee, store) and can
have one or more attributes. For each store, we can save its location at the city, region,

P a g e 17 | 51
state and country level. For each date, we can store the year, month, day of the month,
day of the week, etc. This is related to the hierarchy of attributes in the dimension
table.
• Fact tables contain the data we want to include in reports, aggregated based on
values within the related dimension tables. A fact table has only columns that store
values and foreign keys referencing the dimension tables. Combining all the foreign
keys forms the primary key of the fact table. For instance, a fact table could store a
number of contacts and the number of sales resulting from these contacts.
• Star Schema: It has single fact table connected to dimension tables like a star. In star
schema only one join establishes the relationship between the fact table and any one
of the dimension tables. A star schema has one fact table and is associated with
numerous dimensions’ table and depicts a star.
• Snowflake Schema: It is an extension of the star schema. In snowflake schema, very
large dimension tables are normalized into multiple tables. It is used when a
dimensional table becomes very big. In snow flake schema since there is relationship
between the dimensions Tables it has to do many joins to fetch the data. Every
dimension table is associated with sub dimension table. Performance wise, star
schema is good. But if memory utilization is a major concern, then snow flake schema
is better than star schema.
4. Choice of the model
When choosing a database schema for a data warehouse, snowflake and star schemas
tend to be popular choices.
Our choice is based on the model star schema simply because:
First of all, with the star model, dimension analysis is easier. In addition, we do not have
dimensions that are connected directly to each other so it is unnecessary to use the snowflake
schema. Also, this model offers an ease of use with lower query complexity and easy to
understand and it has query performance with a less number of foreign keys and hence shorter
query execution time (faster). Finally, the data model is a top down approach.

P a g e 18 | 51
Figure 7: Data warehouse schema
Dim_domaine_activite: contains an id, the code and the label of an activity.
Dim_etablissement: contains an id and informations regarding an establishment such as its
name, category weather it’s public or private and its activity in medicine, surgery and obstetrics.
Dim_tranche_age: contains an id the age is divided into two classes under 75 and over 75.
Dim_region: that contains an id, a region and a department.
Dim_cia: contains an id and different indicators.
DimTemps: Calendar date dimension are attached to virtually every fact table to allow
navigation of the fact table through familiar dates, months, fiscal periods, and special days on
the calendar. You would never want to compute Easter in SQL, but rather want to look it up in
the calendar date dimension. The calendar date dimension typically has many attributes
describing characteristics such as week number, month name, fiscal period, and national
holiday indicator. To facilitate partitioning, the primary key of a date dimension can be more
meaningful, such as an integer representing YYYYMMDD, instead of a sequentially-assigned
surrogate key. However, the date dimension table needs a special row to represent unknown or

P a g e 19 | 51
to-be-determined dates. If a smart date key is used, filtering and grouping should be based on
the dimension table’s attributes, not the smart key.
Fact_data: This is the first fact table, contains all foreign keys and the measures are used to
perform the calculation. The fact table is directly related to the following dimensions
(dim_domaine_activite, dim_temps, dim_etablissement, dim_provenance, dim_tranche_age).
And the measures are nombre de séjours/séances MCO des patients en ALD, nombre total de
séjours/séances and cible1 wich is a variable to predict.
Fact_hd: it’s the second fact table, contains all foreign keys and the measures are used to
perform the calculation. The fact table is directly related to the following dimensions
(dimTemps, dim_cia, dim_etablissement, dim_provenance). And it contains over 160
indicators.
5. Realisation
Used tools
a. PostgresSQL
It is an object-relational database management system (ORDBMS) with an emphasis
on extensibility and standards-compliance. As a database server, its primary function is to store
data securely, supporting best practices, and to allow for retrieval at the request of other
software applications. It can handle workloads ranging from small single machine applications
to large Internet-facing applications with many concurrent users. Recent versions also provide
replication of the database itself for availability and scalability.
Figure 8: Postgres logo
b. Pentaho

P a g e 20 | 51
Is a business intelligence BI software company that offers open source products which
provide data integration, OLAP services, reporting, information dashboards, data mining and
extract, transform, load ETL capabilities. It was founded in 2004 by five founders and is
headquartered in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015
Figure 9: Pentaho logo
Extract Transform Load
The first step of a BI project is to create a central repository for a vision Global data of each
service. This repository is called data warehouse. Data warehouse is a system used to store
information for use in data analysis and reporting.
This process therefore takes place in three stages:
• Extraction of data from one or more data sources.
• Transformation aggregated data.
• Load data into the destination database (data warehouse).
Pentaho Data Integration prepares the data to create a complete image of your business that
causes for thought. Using visual tools to eliminate coding and complexity, Pentaho brings Big
Data and data sources within the reach of business and IT users alike. (Power to access, prepare
and blend all data) After charging the tables in the operational data store (ODS), we will
transform and charge the data in our data warehouse.
Operational Data Store ODS
The ODS is a database designed to integrate data from multiple sources for additional
operations on the data. Unlike a master data store, the data is not passed back to operational
systems. It may be passed for further operations and to the data warehouse for reporting.
In the realization of the ODS, we followed the Extract Load Process EL.

P a g e 21 | 51
First, we started by extracting data from the different hospidiag files that represent 8
years from 2008 to 2015.
Figure 10: Charging of table HD
Data warehouse
In order to create the data warehouse, we extracted the data from the ODS, we transform
it and then w loaded the different dimensions and fact tables. The screenshots below show a
part of this process.
Loading the dimensions:
Figure 11: Charging domaine activite dimension

P a g e 22 | 51
Figure 12: Data transformation
Loading the fact tables:
Figure 13: fact_data charging

P a g e 23 | 51
Figure 14: fact HD charging
Figure 15: Casting data
Figure 16: Charging data warehouse job

P a g e 24 | 51
OLAP
1. Introduction
After the realization of the data warehouse, we will move on to demonstrating the next
step of our project which is the creation of the Online Analytical Processing OLAP.
An OLAP cube is a multidimensional database that is optimized for data warehouse and
online analytical processing (OLAP) applications. In fact, an OLAP cube is a method of storing
data in a multidimensional form, generally for reporting purposes. In OLAP cubes, data
(measures) are categorized by dimensions. These cubes are often pre-summarized across
dimensions to drastically improve query time over relational databases.
The query language used to interact and perform tasks with OLAP cubes is
multidimensional expressions (MDX). The MDX language was originally developed by
Microsoft in the late 1990s, and has been adopted by many other vendors of multidimensional
databases.
2. Used tools
Schema Workbench
The Mondrian Schema Workbench allows you to visually create and test Mondrian
OLAP cube schemas. It provides the following functionality: • Schema editor integrated with
the underlying data source for validation. (See above) • Test MDX queries against schema and
database • Browse underlying databases structure See the Mondrian technical guide to
understand schemas. Once you have the schema file, you can upload this into your application.
Mondrian
Mondrian is an OLAP engine (Online Analytical Processing) written in Java by Julian
Hyde enabling the design, publishing and querying of multidimensional cubes. It allows
execution of MDX queries on data warehouses based on RDBMS, where his characterization
of "ROLAP" (Relational OLAP). In terms of ROLAP, Mondrian is the open source reference.
It allows access to the results in an understandable format by a multidimensional presentation
client side API, usually in Web mode, for example JPivot, Pentaho Analyzer, Pentaho Analysis
Tool, and Geo Analysis Tool (GAT). It uses a standard OLAP modelling and can connect to
any data warehouse designed by the rules of the art business intelligence. It is interesting to
note that Mondrian OLAP component used by most open source BI suite including Pentaho,
SpagoBI and JasperServer.

P a g e 25 | 51
Figure 17: Mondrian logo
3. Realization
In our project, we have created two cubes. The first one contains the fact table fact_data
which and five different dimensions: dim_domaine_activite, dim_temps, dim_etablissement,
dim_provenance, dim_tranche_age.
The second cube contains fact table hd_fact which will be analysed according is the
following dimensions dimTemps, dim_cia, dim_etablissement, dim_provenance,
Figure 18: Cube creation with schema workbench

P a g e 26 | 51
Figure 19: XML schema generated with dchema workbench

P a g e 27 | 51
Reporting
1. Introduction
Most companies have a need for different type of reports. In many cases hundreds of
different types of reports and occasionally often more. Business Intelligence software often has
comprehensive reporting tools available that can extract and present data in many different
media types (like over an internal Web page/Intra net, Internet (to customers), Excel file format,
PDF format e.g.
In many cases these reporting facilities will be controlled by parameters that can be
chosen real time and present a report that has been run directly against data (often a Data
Warehouse or multidimensional data) Reporting is seen as static information being retrieved
from a source system like an ERP.
Most ERP systems have these static reports as part of the package as an out-of-the-box
(OOTB) offering. Most of these reports cover areas like purchase orders, invoices, goods,
received, debtors balances, inventory on hand, financial statements, vendor and customer lists,
resourcing, planning, etc.
2. BI and Reporting
Performance
BI can deliver high volumes of data to a large number of users, because of the
combination of architecture, software and technologies It is important to understand that BI
does not sit within the ERP like reporting as much as it sits on top of the ERP. Thus, the BI
environment is not impacting the ERP environment and vice versa.
Decision making
Reporting allows for short term and reactive decision making as the reports are very
static and a lot of manual work needs to be done to consolidate data from multiple dimensions.
BI allows decision-making that directly affects your strategy. Key performance indicators (KPI)
are set within a BI environment and can be closely tracked, to monitor business performance.
Delivery
Reporting gives static reports, parameterized and in a list format. BI allows for dynamic
dashboards and scorecards across multiple areas within the business, providing a consolidated
view to drill, pivot and slice-and-dice your data. BI also goes a step further and allows for
predictive analytics like forecasting based on historical data, patterns and trend. Allowing you

P a g e 28 | 51
to be pro-active, compared to a reporting environment that only allows for reactive
management.
3. Used tools
QlikView
The QlikView platform lets users discover deeper insights by building their own rich,
guided analytics. Mine Big Data with this enterprise-ready solution.
Figure 20: ClikView logo
JasperReports
Jasper Soft provides reporting and analytics that can be embedded into a web or mobile
application as well as operate as a central information hub for the enterprise by delivering
mission critical information on a real-time or scheduled basis to the browser, mobile device,
printer, or email inbox in a variety of file formats. Jasper Server is optimized to share, secure,
and centrally manages your Jasper soft reports and analytic views.
Figure 21: JasperSoft logo

P a g e 29 | 51
Saiku
Saiku is an open source Analytics Client, and serves as the UI component of the
Openbravo Analytics Module. It uses the MDX language to seamlessly interact with the
Mondrian cubes that Openbravo generates, allowing users to easily create, visualize and analyze
information in graphical and pivot table formats.
Figure 22: Saiku logo
4. Realization
Reporting using QlikView
Figure 23: Histogram of number of stay/ session
By analyzing the total number of patient sessions for each category of facility versus
ALD (Long Duration Disease) sessions, all sessions in the CLCC (Cancer Control Center) are
sessions for patients in ALD, which is normal for this type of center that treats only this kind
ofaffection.
It should be noted that non-profit institutions (EBNLs) and regional hospital centers
(CHR) treat ALD patients very much contrary to clinics (CLIs), which are the least involved in
the treatment of chronic diseases with 45 million sessions in total for only 5 million in ALD

P a g e 30 | 51
which can be explained by it targeting only patients who can afford to pay the exorbitant prices
of treatments.
Figure 24: Pie chart of the number of ALD’sessions
If we analyze this graph, we note that the number of ALD sessions for the CLCC centers
is the most remarkable in regions 13, 44 and 94
These regions are also found to be among the regions where there is a maximum number
of ALD sessions for CH and CHR institutions, as shown in this graph

P a g e 31 | 51
Figure 25: Histogram of number of ALD’ sessions by establishment's category
The previous histogram explains the creation of Centers for the fight against cancer
which come to help regional hospitals and hospitals by reducing the burden on them by treating
patients who have cancer in other centers more adapted
On the other hand, it is found that these remarkable regions in terms of total number of
sessions do not appear in the regions most treated by the not-for-profit establishments (EBNL)
Figure 26 :number of ALD' sessions at EBNL

P a g e 32 | 51
The pie chart results can be explained by the fact that they are created to compensate for
the lack of facilities treating ALD in some regions
In conclusion, we found that regions or patients who spend long periods in ALD
treatment are not among the regions where there is a high occupancy rate of beds in medicine,
surgery and obstetrics as shown in this graph
Figure 27: histogram number of beds by regions
Finally, we can conclude that the performance of French health services is satisfactory
and meets the requirements in terms of establishments and beds that are constantly increasing.
KPI
With the KIPs we tried to see the degree of performance of every establishment. An
establishment is considered performing if its rate is >50%

P a g e 33 | 51
Figure 28: KPIs
5. Reporting using Jaspersoft
Figure 29: Report with JaperSoft
In the previous figure, we have shown the top ten establishments in terms of
ALD’sessions number.
In this sprint, we are lead to give a better representation of our data.

P a g e 34 | 51
So, we have chosen to represent in this report a few figures, highlighting the side job of
our subject.
This table is a list of the ten social reason (property) by Department according to the
number of stays in each establishment with a descending sort. But to offer good visibility on
the distribution of the number of stays, there not a better way than the illustration of this chart
with a graph.
In the figure below we can see clearly the importance of this variable in each of the
institutions.
Figure 30: Top 10 establishments accorfing to Ald sessions number
We observed and then found that he has an establishment who repeats several times
which is AP - HP but not in the same Department. So, we've focused on this property, and we
sum perform a deeper representation on the areas of activity of these institutions, which are:
AP-HP, Paris, Val-de-Marne, Hauts-de-Seine, Seine-Saint-Denis.

P a g e 35 | 51
Figure 31: AP-HP-Seine-Saint-Denis activity domains
Figure 32AP-HP-ValDe-Mane activity domains:

P a g e 36 | 51
Figure 33:AP-HP-Haut-Seine activity domains
Figure 34:AP-HPParis activity domains
So, we can see that these institutions treat the same diseases and almost share the same areas
of activity. And that the number of stays in diseases related to the areas of activity: digestive,

P a g e 37 | 51
orthopedic trauma, nervous system, is very important in this business.
Figure 35: Number of ALD sessions by establishment category
This pie chart shows the distribution of establishments by category in percentage days.

P a g e 38 | 51
Figure 36: number of ALD sessions by age
While this camembert clearly represents the segmentation according to the age range of
number of stays of all institutions.
Moreover, we are interested in the representation of the number of stays in the areas of
chemo, Radi, hemodialysis, and delivery. Then we deducted that institutions which deals most
these areas are the following:
Figure 37: TOP3-Number-Chemo-Sessions-Radio-Hemodialysis -childbirth
Each indicator is relative to an activity. That's how that Hospidiag manages its business
areas.
Otherwise a small comparison between MCO and HC institutions to see if it is the same
reasons social or not.

P a g e 39 | 51
So, the two areas the five institutions are the same. They have the highest of RSA in
MCO and HC.
6. Conclusion
Through this chapter, we created an OLAP cube and the we used it to analyze the data
with different approaches.

P a g e 40 | 51
Data mining
7. Introduction
“Data mining” is the significant discovery process new correlations, patterns and trends
by sifting through large amounts of data stored in repositories, using pattern recognition
technology, statistical and mathematical techniques."
The data Mining is very recent, dating 1980 He saw its expansion to cope with this new
finding that characterized the economic scene, namely the multiplication of very large databases
and difficult to use by businesses who did not have enough resources. This is a set of tools that
have been developed to study the interactions and unknown phenomena and explore the
underlying data.
This is also the literal meaning of the term dates mining: Data Mining. It is found both
in the management of human resources, in sectors such as the retail ... In an initial step,
computationally extracted valid data and that can be operated from major data sources.
Eventually the use of these data will detect all the underlying correlations. Data Mining uses
the rules of statistics and more specifically mathematical algorithms to compare all the results
and conclude on correlations or links between different phenomena.
8. Objective
The performance criterion of the participants' contributions is the RMSE (Root Mean
Square Error).
This is the square root of the arithmetic mean of the squares of the deviations between
the forecasts and the target values held.
9. Used tools
RStudio
RStudio is a free and open-source integrated development environment (IDE) for R, a
programming language for statistical computing and graphics.
RStudio is available in two editions: RStudio Desktop, where the program is run locally
as a regular desktop application; and RStudio Server, which allows accessing RStudio using a
web browser while it is running on a remote Linux server. Prepackaged distributions of RStudio
Desktop are available for Windows, macOS, and Linux.

P a g e 41 | 51
Figure 38: R studio logo
10.Realization
Unsupervised methods
Association rules are if/then statements that help uncover relationships between
seemingly unrelated data in a relational database or other information repository.
An association rule has two parts, an antecedent (if) and a consequent (then). An
antecedent is an item found in the data. A consequent is an item that is found in combination
with the antecedent.
--
Figure 39: Rules with arulesViz

P a g e 42 | 51
Figure 40: Presentation of a rule
From the association rules obtained we can consider this rule as the most reliable:
For the patients of CH DU HAUT BURGEY and whose the age less than 75 years old, will
likely to have short number of sessions in ALD.
Supervised methods
In order to determine the variable cible1, we first made a selection of the variables that
are significant the most. Then we used different predictive approaches which they are: multiple
linear regression, random forest, Generalized Boosted Regression Models GBM and deep
learning and then we made the selection of the best model based on the RMSE score.
Generalized linear model
> h2o.performance(regression.model)
H2ORegressionMetrics: glm
** Reported on training data. **
MSE: 0.002698627
RMSE: 0.05194831
MAE: 0.03549875
RMSLE: 0.04100089
Mean Residual Deviance : 0.002698627
R^2 : 0.9074501
Null Deviance :53298.98
Null D.o.F. :1827898
Residual Deviance :4932.817
Residual D.o.F. :1827878
AIC :-5624648

P a g e 43 | 51
Random Forests
> h2o.performance(rforest.model)
H2ORegressionMetrics: drf
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.001330922
RMSE: 0.0364818
MAE: 0.01778781
RMSLE: 0.02686152
GBM: Gradient boosting model
> h2o.performance (gbm.model)
H2ORegressionMetrics: gbm
MSE: 0.0004069792
RMSE: 0.02017373
MAE: 0.009795494
RMSLE: 0.01485119
Model 2 GBM
> h2o.performance (gbm.model)
H2ORegressionMetrics: gbm
MSE: 0.0009652376
RMSE: 0.03106827
MAE: 0.01555565
RMSLE: 0.02287432
Deep Learning
> h2o.performance(dlearning.model)
H2ORegressionMetrics: deeplearning
** Metrics reported on temporary training frame with 9999 samples **
MSE: 0.0007069083
RMSE: 0.02658775
MAE: 0.01316837
RMSLE: 0.01945228
11.Conclusion
After creating many regression models, we came to the conclusion that the Gradient
Boosting Model (GBM) is the most powerful, it has the minimal Root Mean Square Error
(RMSE) with a value of 0.02017373.

P a g e 44 | 51
Big Data
12.Introduction
During this chapter, we will try to give an overview about the big data and then we will
show the process of extracting and analyzing the extracted data.
13.Big Data
Big data is a term for data sets that are so large or complex that traditional data
processing applications are inadequate. Challenges include analysis, capture, data curation,
search, sharing, storage, transfer, visualization, querying, updating and information privacy.
The term often refers simply to the use of predictive analytics or certain other advanced
methods to extract value from data, and seldom to a particular size of data set.
Accuracy in big data may lead to more confident decision making, and better decisions
can result in greater operational efficiency, cost reduction and reduced risk.
Big Data has become of major interest to the IT world.
• In general, the term refers to types of relatively new data (video, images,
sound, etc.) That generate large files.
• It also means large sets of small volumes of data (comments on websites of
social networks, photos of the seabed, images from traffic cameras) that take
their meaning when combined.
• In most cases, these Big Data experiencing rapid growth and some modest
data sets will have to grow to become the Big Data.
Big Data is characterized by:
Volume refers to the vast amounts of data generated every second. Just think of all the
emails, twitter messages, photos, video clips, sensor data etc. we produce and share every
second. We are not talking Terabytes but Zettabytes or Brontobytes. On Facebook alone we
send 10 billion messages per day, click the "like' button 4.5 billion times and upload 350 million
new pictures each and every day. If we take all the data generated in the world between the
beginning of time and 2008, the same amount of data will soon be generated every minute! This
increasingly makes data sets too large to store and analyze using traditional database
technology. With big data technology we can now store and use these data sets with the help of

P a g e 45 | 51
distributed systems, where parts of the data is stored in different locations and brought together
by software.
Velocity refers to the speed at which new data is generated and the speed at which data
moves around. Just think of social media messages going viral in seconds, the speed at which
credit card transactions are checked for fraudulent activities, or the milliseconds it takes trading
systems to analyze social media networks to pick up signals that trigger decisions to buy or sell
shares. Big data technology allows us now to analyze the data while it is being generated,
without ever putting it into databases.
Variety refers to the different types of data we can now use. In the past we focused on
structured data that neatly fits into tables or relational databases, such as financial data (e.g.
sales by product or region). In fact, 80% of the world’s data is now unstructured, and therefore
can’t easily be put into tables (think of photos, video sequences or social media updates). With
big data technology we can now harness differed 37 types of data (structured and unstructured)
including messages, social media conversations, photos, sensor data, video or voice recordings
and bring them together with more traditional, structured data.
Veracity refers to the messiness or trustworthiness of the data. With many forms of big
data, quality and accuracy are less controllable (just think of Twitter posts with hash tags,
abbreviations, typos and colloquial speech as well as the reliability and accuracy of content)
but big data and analytics technology now allows us to work with this type of data. The volumes
often make up for the lack of quality or accuracy.
Value: Then there is another V to take into account when looking at Big Data: Value!
It is all well and good having access to big data but unless we can turn it into value it is useless.
So, you can safely argue that 'value' is the most important V of Big Data. It is important that
businesses make a business case for any attempt to collect and leverage big data. It is so easy
to fall into the buzz trap and embark on big data initiatives without a clear understanding of
costs and benefits.
14.Used Tools
Apache Tomcat
Tomcat is an application server from the Apache Software Foundation that executes
Java servlets and renders Web pages that include Java Server Page coding. Described as a
"reference implementation" of the Java Servlet and the Java Server Page specifications, Tomcat

P a g e 46 | 51
is the result of an open collaboration of developers and is available from the Apache Web site
in both binary and source versions.
Tomcat can be used as either a standalone product with its own internal Web server or
together with other Web servers, including Apache, Netscape Enterprise Server, Microsoft
Internet Information Server (IIS), and Microsoft Personal Web Server. Tomcat requires a Java
Runtime Enterprise Environment that conforms to JRE 1.1 or later. Tomcat is one of several
open source collaborations that are collectively known as Jakarta.
Figure 41: Apache logo
Realization
First of all we streamed data from twitter, then we filtered the data in order to have significant
one and finally we made an analysis of this data.
Figure 42: Fetching data from twiter

P a g e 47 | 51
Figure 43: Fetching configuration
Figure 44: Fetched data

P a g e 48 | 51
Figure 45: Table creation and inserting data into it
Figure 46: Used words after the Real Madrid Vs Celta Vigo game

P a g e 49 | 51
We also streamed data from Facebook using the “Rfacebook” package. The data that we
streamed is concerning the comments after the 18th
may 2017 game between Real Madrid and
Celta Vigo.
Figure 47: Activity on the Real Madrid Facebookpage

P a g e 50 | 51
Figure 48: Activity on the FCBarcelone facebook page
Conclusion
While working with data, we fetched data and used it in order to analyze the streamed data
based on specific filters.

P a g e 51 | 51
General Conclusion
These past months of work have allowed us to place ourselves in a professional context
and team work on a project of great magnitude. Our internship was particularly Trainer
technical perspective. We strengthened our bases in SQL, MDX discovered and above all we
have discovered the world Business Intelligence and reporting.
Furthermore, this course is a continuation of what we have learned in the previous years
on development software. This allowed be familiar with new concepts such as Business
Intelligence.
We discovered how reporting and analysis data are important in the response that any
application provides its customers. Although it took us some time to acquire the notions of real
estate, BI, reporting and the language barrier sometimes adds to the difficulty, gradually, we
adapted and used to a new working environment and new technologies. Also, we gain
knowledge of data mining, machine learning and big data.
Finally, the course was rewarding for all our achievements are spent in production.

Rapport pebi-anap-atih-2020

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Rapport pebi-anap-atih-2020

Similar to Rapport pebi-anap-atih-2020 (20)

Recently uploaded

Recently uploaded (20)

Rapport pebi-anap-atih-2020