SlideShare a Scribd company logo
1 of 51
Download to read offline
P a g e 1 | 51
Summary
General Introduction .................................................................................................................. 6
Project Context........................................................................................................................... 7
Introduction ............................................................................................................................ 7
Presentation of the organization............................................................................................. 7
ANAP................................................................................................................................. 7
ATIH .................................................................................................................................. 8
Presentation of the project...................................................................................................... 8
ANAP-ATIH2020 .............................................................................................................. 8
Proposed solution ................................................................................................................... 9
Methodology .......................................................................................................................... 9
Presentation of the Scrum method ..................................................................................... 9
Scrum roles....................................................................................................................... 10
Road map.............................................................................................................................. 11
Conclusion............................................................................................................................ 11
Analysis.................................................................................................................................... 12
Introduction .......................................................................................................................... 12
Functional requirements....................................................................................................... 12
Non-Functional Requirements ............................................................................................. 12
Actor identification .............................................................................................................. 13
Use case diagram.................................................................................................................. 14
Conclusion............................................................................................................................ 15
ETL........................................................................................................................................... 16
Introduction .......................................................................................................................... 16
Data Source .......................................................................................................................... 16
Dimensional modeling ......................................................................................................... 16
Choice of the model ............................................................................................................. 17
P a g e 2 | 51
Realisation............................................................................................................................ 19
Used tools......................................................................................................................... 19
Extract Transform Load ................................................................................................... 20
Operational Data Store ODS............................................................................................ 20
Data warehouse ................................................................................................................ 21
OLAP ....................................................................................................................................... 24
Introduction .......................................................................................................................... 24
Used tools............................................................................................................................. 24
Schema Workbench.......................................................................................................... 24
Mondrian .......................................................................................................................... 24
Realization............................................................................................................................ 25
Reporting.................................................................................................................................. 27
Introduction .......................................................................................................................... 27
BI and Reporting .................................................................................................................. 27
Performance ..................................................................................................................... 27
Decision making............................................................................................................... 27
Delivery............................................................................................................................ 27
Used tools............................................................................................................................. 28
QlikView .......................................................................................................................... 28
JasperReports ................................................................................................................... 28
Saiku................................................................................................................................. 29
Realization............................................................................................................................ 29
Reporting using QlikView ............................................................................................... 29
KPI ................................................................................................................................... 32
Reporting using Jaspersoft ................................................................................................... 33
Conclusion............................................................................................................................ 39
Data mining.............................................................................................................................. 40
P a g e 3 | 51
Introduction .......................................................................................................................... 40
Objective .............................................................................................................................. 40
Used tools............................................................................................................................. 40
RStudio............................................................................................................................. 40
Realization............................................................................................................................ 41
Unsupervised methods ..................................................................................................... 41
Supervised methods.......................................................................................................... 42
Generalized linear model ................................................................................................. 42
Random Forests................................................................................................................ 43
GBM: Gradient boosting model....................................................................................... 43
Model 2 GBM .................................................................................................................. 43
Deep Learning.................................................................................................................. 43
Conclusion............................................................................................................................ 43
Big Data.................................................................................................................................... 44
Introduction .......................................................................................................................... 44
Big Data................................................................................................................................ 44
Used Tools............................................................................................................................ 45
Apache Tomcat ................................................................................................................ 45
Realization...................................................................................Erreur ! Signet non défini.
General Conclusion.................................................................................................................. 51
P a g e 4 | 51
Illustrations
Figure 1: ANAP logo ................................................................................................................. 7
Figure 2:ATIH logo.................................................................................................................... 8
Figure 3: Scrum methode ......................................................................................................... 10
Figure 4: Road map.................................................................................................................. 11
Figure 5: Use case diagram ...................................................................................................... 14
Figure 6: Hospidiag interface................................................................................................... 16
Figure 7: Data warehouse schema............................................................................................ 18
Figure 8: Postgres logo............................................................................................................. 19
Figure 9: Pentaho logo ............................................................................................................. 20
Figure 10: Charging of table HD.............................................................................................. 21
Figure 11: Charging domaine activite dimension .................................................................... 21
Figure 12: Data transformation ................................................................................................ 22
Figure 13: fact_data charging................................................................................................... 22
Figure 14: fact HD charging..................................................................................................... 23
Figure 15: Casting data ............................................................................................................ 23
Figure 16: Charging data warehouse job.................................................................................. 23
Figure 17: Mondrian logo ........................................................................................................ 25
Figure 18: Cube creation with schema workbench .................................................................. 25
Figure 19: XML schema generated with dchema workbench.................................................. 26
Figure 20: ClikView logo......................................................................................................... 28
Figure 21: JasperSoft logo........................................................................................................ 28
Figure 22: Saiku logo ............................................................................................................... 29
Figure 23: Histogram of number of stay/ session .................................................................... 29
Figure 24: Pie chart of the number of ALD’sessions.............................................................. 30
Figure 25: Histogram of number of ALD’ sessions by establishment's category.................... 31
Figure 26 :number of ALD' sessions at EBNL ........................................................................ 31
Figure 27: histogram number of beds by regions.................................................................... 32
Figure 28: KPIs ........................................................................................................................ 33
Figure 29: Report with JaperSoft ............................................................................................. 33
Figure 30: Top 10 establishments accorfing to Ald sessions number...................................... 34
Figure 31: AP-HP-Seine-Saint-Denis activity domains........................................................... 35
Figure 32AP-HP-ValDe-Mane activity domains:.................................................................... 35
P a g e 5 | 51
Figure 33:AP-HP-Haut-Seine activity domains....................................................................... 36
Figure 34:AP-HPParis activity domains .................................................................................. 36
Figure 35: Number of ALD sessions by establishment category............................................. 37
Figure 36: number of ALD sessions by age............................................................................. 38
Figure 37: TOP3-Number-Chemo-Sessions-Radio-Hemodialysis -childbirth ........................ 38
Figure 38: R studio logo........................................................................................................... 41
Figure 39: Rules with arulesViz............................................................................................... 41
Figure 40: Presentation of a rule .............................................................................................. 42
Figure 41: Apache logo............................................................................................................ 46
Figure 43: Fetching data from twiter........................................................................................ 46
Figure 44: Fetching configuration............................................................................................ 47
Figure 45: Fetched data............................................................................................................ 47
Figure 46: Table creation and inserting data into it ................................................................. 48
Figure 47: Used words after the Real Madrid Vs Celta Vigo game ........................................ 48
Figure 48: Activity on the Real Madrid Facebookpage .......................................................... 49
Figure 49: Activity on the FCBarcelone facebook page......................................................... 50
P a g e 6 | 51
General Introduction
Harnessing the full potential of data requires developing an organization-wide of
organizing this data in addition to a data science strategy. Such strategies are now commonplace
in most industries such as banking and retail. Banks can offer their customers targeted needs-
based services and improved fraud protection because they collect and analyze transactional
data.
Furthermore, as healthcare organizations face increasing demand for healthcare
services, harnessing data and analytics helps organizations improve patient care, manage
chronic disease, apply adaptive treatments and reduce costs.
Healthcare is the maintenance or improvement of health via the diagnosis, treatment,
and prevention of disease, illness, injury, and other physical and mental impairments in human
beings. Such field contains a huge amount of data that needs to be analyzed.
Health care is a glaring exception. Individual pieces of data can have life-or-death
importance, but many organizations fail to aggregate data effectively to gain insights into wider
care processes. Without a data science strategy, health care organizations can’t draw on
increasing volumes of data and medical knowledge in an organized, strategic way, and
individual clinicians can’t use that knowledge to improve the safety, quality, and efficiency of
the care they provide.
Hence comes the need to collect data and from that collected data comes the need to
make good decisions. Making good decisions that requires all relevant data to be taking into
consideration.
If an organization tries to aggregate and analyze poor-quality data, it may derive useless
or even dangerous conclusions. Therefore, the data has to be structured. The best source for that
data is a well-designed data warehouse which will help in the process of making decisions,
analyzing data and even making predictions.
P a g e 7 | 51
Project Context
1. Introduction
We can’t talk about out an organization without talking about information. Nowadays
every organization tries to have a system to manipulate its data ERP.
The Hospidiag ERP is the result of the cooperation between two organizations ANAP
and ATIH.
In this chapter, we will give the presentation of the ANAP-ATIH2020 but first we will
start by presenting the organizations.
2. Presentation of the organization
ANAP
Present since 2009, the National Agency of Support for the Performance of the
establishments of health and medical and social ANAP, comes to improve the performances
within the framework of the reform of the health system in France. This agency which is public
help the establishments of health and medical social to improve the service provided to the
patients and to the users, by developing recommendations and tools, allowing them to optimize
their management and their organization.
Figure 1: ANAP logo
P a g e 8 | 51
ATIH
Established in 2000, the Technical Agency of the Information about the Hospitalization
ATIH, is in charge of collecting and of analyzing the data of the establishments of health:
activity, costs, organization and quality of the care, the finances, the human resources...
It realizes studies on the hospitable costs and participates in the mechanisms of financing
of establishments.
Figure 2:ATIH logo
3. Presentation of the project
ANAP-ATIH2020
The ageing of the French population comes along with an important increase of the
number of people living with chronic diseases. The offer of care which mainly built itself
around the coverage of the short-term care has to evolve today to integrate the needs for long-
term follow-up.
The ANAP and the ATIH, within the framework of their respective missions agree on
an essential report: a considerable quantity of data is produced by the various actors of health.
This wealth and the variety of the collected information open big perspectives of exploitation
as for the understanding of the health system and its evolution.
The opening of the data can allow to slow down the capacity of exploitation and analysis
and so to take advantage at best of the wealth of this information.
The ANAP-ATIH project 2020 constitutes an opportunity to make sensitive widely on
this question and he pursues 3 main objectives:
P a g e 9 | 51
• To promote a playful and attractive initiative to make the pedagogy of the stakes in the
exploitation of the data of health.
• To demonstrate the interest and the potential of an exploitation of the data in the service
of a health policy to a local or national level, and to give a concrete illustration to this
major question of the adaptation of the organization of establishments to the increase of
the chronic pathologies.
• To explore the capacity of dated scientists diverse horizons to contribute to the
resolution of problems resting on a better exploitation of the data.
4. Proposed solution
Based on identified needs of the ANAP-ATIH2020 we will propose a solution to
structure the collected data in a data warehouse in order to have not only a quick and easy access
to it but also to insure its quality and consistency. Then we will create a cube that allows fast
analysis of data according to multiple dimensions and which will be used to create reports with
multiple tools.
5. Methodology
Presentation of the Scrum method
A successful Scrum project is much about understanding what Scrum is. Therefor we
will try to give an overview about Scrum which will be our methodology during the project.
Scrum is a way for teams to work together to develop a product. Product development,
using Scrum, occurs in small pieces, with each piece building upon previously created pieces.
Building products one small piece at a time encourages creativity and enables teams to
respond to feedback and change, to build exactly and only what is needed. More specifically,
Scrum is a simple framework for effective team collaboration on complex projects. Scrum
provides a small set of rules that create just enough structure for teams to be able to focus their
innovation on solving what might otherwise be an insurmountable challenge.
However, Scrum is much more than a simple framework. Scrum supports our need to
be human at work: to belong, to learn, to do, to create and be creative, to grow, to improve, and
to interact with other people. In other words, Scrum leverages the innate traits and
characteristics in people to allow them to do great things together.
P a g e 10 | 51
Figure 3: Scrum methode
Scrum roles
Building complex products for customers is an inherently difficult task. Scrum provides
structure to allow teams to deal with that difficulty. However, the fundamental process is
incredibly simple, and at its core is governed by 3 primary roles.
• Product Owners: determine what needs to be built in the next 30 days or less.
• Team: build what is needed in 30 days (or less), and then demonstrate what they
have built. Based on this demonstration, the Product Owner determines what to
build next.
• Scrum Master: ensure this process happens as smoothly as possible, and
continually help improve the process, the team and the product being created.
While this is an incredibly simplified view of how Scrum works, it captures the essence of this
highly productive approach for team collaboration.
P a g e 11 | 51
6. Road map
Figure 4: Road map
7. Conclusion
In this chapter, we introduced briefly the project by presenting the organizations
concerned by the challenge. The ANAP-ATIH2020 is based on the collect of information from
the French establishments of health which is to be structured and analyzed in order to predict
the medium-term evolution of the coverage care of the chronic diseases’ importance for the
establishments of health.
P a g e 12 | 51
Analysis
1. Introduction
During this chapter, we will start by presenting both the functional and non-functional
requirements of our project. Then, we will continue with the project actors’ identification and
finally we will resume with a use case diagram.
2. Functional requirements
The system that we need has to be:
• Operational
• Scalable
• User friendly
• Offering the necessary information in real time.
For this reason, the system must perform to fulfill the requirements of all users. We present
in the following paragraph all functional requirements:
• Analyze the input Data
• Analyze the output Data
• Analyze the session time for each user
• Analyze the number of connected people
• User Status
3. Non-Functional Requirements
Non-functional requirements are the gaps that may prevent the application of operate
effectively and efficiently.
a. Ability:
The total amount of data within selected databases represents a very large volume. In
addition, the treatment thereof increases about this little initial volume. Therefore, server
capacity shall be sufficient to enable a collect all these data allow program execution and
safeguarding updates over time.
b. Integrity:
At all levels of implementation of the data warehouse, different errors will be treated
including the following:
P a g e 13 | 51
• The treatment of bad data when importing data from different bases.
• In order to standardize the format of the data collected, it will be necessary convert
two-dimensional structures in three-dimensional structures. This will be the source
of errors that must be taken into account.
• Referential integrity in the database tables.
c. Quality:
• Facilitating data access and dissemination of information.
• Reliability and traceability of data.
• Human Machine Interaction as intuitive as possible
d. Simplicity:
Since among the users of the application, there are those who do not necessarily have great
knowledge in the field of IT, the functionality of the solution should be understandable and easy
to handle. Indeed, the navigation through the different sections shall be designed so that the
user finds it easily; it must actually feel in control at all times.
e. Performance:
The application must meet all user requirements in an optimal way. The application
performance results in reduced access times to different features an access time to data
acceptable seen handling a warehouse relatively large data.
f. Reliability:
It must ensure content quality and relevance of information.
g. Ergonomics:
The first thing that catches the attention of users is the ergonomics and ease use, for that
special attention should be given to this need.
4. Actor identification
An actor represents the abstraction of a role played by external entities that interact
directly with the system studied. An actor acts on the system, it plays a different role in each
use case which he collaborates and is usually represented by a stick man. An actor characterizes
an outside user, or a group of users that interact directly with the system. In our system, we
have identified two actors: Administrator Decision maker
P a g e 14 | 51
• The Administrator: Generate the report and cubes and publish as required of terms and
conditions.
• The Decision Maker: He takes analysis generated reports.
5. Use case diagram
A use case is a methodology used in system analysis to identify, clarify, and organize
system requirements. The use case is made up of a set of possible sequences of interactions
between systems and users in a particular environment and related to a particular goal.
It consists of a group of elements (for example, classes and interfaces) that can be used
together in a way that will have an effect larger than the sum of the separate elements combined.
The use case should contain all system activities that have significance to the users. A use case
can be thought of as a collection of possible scenarios related to a particular goal, indeed, the
use case and goal are sometimes considered to be synonymous.
Figure 5: Use case diagram
a. Create Report:
Pre-condition: none
Description: An Admin can create reports
Post-condition: the report has been successfully added
P a g e 15 | 51
b. Save Report:
Pre-condition: none.
Description: Only an admin can save reports.
Post-condition: the report has been successfully saved.
c. Edit Reports:
Pre-condition: none.
Description: An admin edit reports.
Post-condition: the report has been successfully saved.
d. Load Report:
Pre-condition: none.
Description: An admin or decision maker can load reports.
Post-condition: the report is loaded
6. Conclusion
During this chapter, we gave an overview of the project. In the chapters, we will introduce the
that we have established over the different sprints.
P a g e 16 | 51
ETL
1. Introduction
ETL is short for Extract, Transform and Load. As the name hints, we’ll extract data from
one or more operational databases, transform it to fit in a warehouse structure, and load the data
into the DWH.
A data warehouse is a system used to store information for use in data analysis and
reporting in which the data are integrated, not volatile, and logged. But first, it is essential to
define its structure. Before filling the data warehouse, the design of it is essential.
2. Data Source
The data used in our project is offered throw the Hospidiag OpenData Source.
Figure 6: Hospidiag interface
3. Dimensional modeling
Dimensional modeling is a part of data warehouse design, results in the creation of the
dimensional model. There are two types of tables involved:
• Dimension tables are used to describe the data we want to store. For example: a
retailer might want to store the date, store, and employee involved in a specific
purchase. Each dimension table is its own category (date, employee, store) and can
have one or more attributes. For each store, we can save its location at the city, region,
P a g e 17 | 51
state and country level. For each date, we can store the year, month, day of the month,
day of the week, etc. This is related to the hierarchy of attributes in the dimension
table.
• Fact tables contain the data we want to include in reports, aggregated based on
values within the related dimension tables. A fact table has only columns that store
values and foreign keys referencing the dimension tables. Combining all the foreign
keys forms the primary key of the fact table. For instance, a fact table could store a
number of contacts and the number of sales resulting from these contacts.
• Star Schema: It has single fact table connected to dimension tables like a star. In star
schema only one join establishes the relationship between the fact table and any one
of the dimension tables. A star schema has one fact table and is associated with
numerous dimensions’ table and depicts a star.
• Snowflake Schema: It is an extension of the star schema. In snowflake schema, very
large dimension tables are normalized into multiple tables. It is used when a
dimensional table becomes very big. In snow flake schema since there is relationship
between the dimensions Tables it has to do many joins to fetch the data. Every
dimension table is associated with sub dimension table. Performance wise, star
schema is good. But if memory utilization is a major concern, then snow flake schema
is better than star schema.
4. Choice of the model
When choosing a database schema for a data warehouse, snowflake and star schemas
tend to be popular choices.
Our choice is based on the model star schema simply because:
First of all, with the star model, dimension analysis is easier. In addition, we do not have
dimensions that are connected directly to each other so it is unnecessary to use the snowflake
schema. Also, this model offers an ease of use with lower query complexity and easy to
understand and it has query performance with a less number of foreign keys and hence shorter
query execution time (faster). Finally, the data model is a top down approach.
P a g e 18 | 51
Figure 7: Data warehouse schema
Dim_domaine_activite: contains an id, the code and the label of an activity.
Dim_etablissement: contains an id and informations regarding an establishment such as its
name, category weather it’s public or private and its activity in medicine, surgery and obstetrics.
Dim_tranche_age: contains an id the age is divided into two classes under 75 and over 75.
Dim_region: that contains an id, a region and a department.
Dim_cia: contains an id and different indicators.
DimTemps: Calendar date dimension are attached to virtually every fact table to allow
navigation of the fact table through familiar dates, months, fiscal periods, and special days on
the calendar. You would never want to compute Easter in SQL, but rather want to look it up in
the calendar date dimension. The calendar date dimension typically has many attributes
describing characteristics such as week number, month name, fiscal period, and national
holiday indicator. To facilitate partitioning, the primary key of a date dimension can be more
meaningful, such as an integer representing YYYYMMDD, instead of a sequentially-assigned
surrogate key. However, the date dimension table needs a special row to represent unknown or
P a g e 19 | 51
to-be-determined dates. If a smart date key is used, filtering and grouping should be based on
the dimension table’s attributes, not the smart key.
Fact_data: This is the first fact table, contains all foreign keys and the measures are used to
perform the calculation. The fact table is directly related to the following dimensions
(dim_domaine_activite, dim_temps, dim_etablissement, dim_provenance, dim_tranche_age).
And the measures are nombre de séjours/séances MCO des patients en ALD, nombre total de
séjours/séances and cible1 wich is a variable to predict.
Fact_hd: it’s the second fact table, contains all foreign keys and the measures are used to
perform the calculation. The fact table is directly related to the following dimensions
(dimTemps, dim_cia, dim_etablissement, dim_provenance). And it contains over 160
indicators.
5. Realisation
Used tools
a. PostgresSQL
It is an object-relational database management system (ORDBMS) with an emphasis
on extensibility and standards-compliance. As a database server, its primary function is to store
data securely, supporting best practices, and to allow for retrieval at the request of other
software applications. It can handle workloads ranging from small single machine applications
to large Internet-facing applications with many concurrent users. Recent versions also provide
replication of the database itself for availability and scalability.
Figure 8: Postgres logo
b. Pentaho
P a g e 20 | 51
Is a business intelligence BI software company that offers open source products which
provide data integration, OLAP services, reporting, information dashboards, data mining and
extract, transform, load ETL capabilities. It was founded in 2004 by five founders and is
headquartered in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015
Figure 9: Pentaho logo
Extract Transform Load
The first step of a BI project is to create a central repository for a vision Global data of each
service. This repository is called data warehouse. Data warehouse is a system used to store
information for use in data analysis and reporting.
This process therefore takes place in three stages:
• Extraction of data from one or more data sources.
• Transformation aggregated data.
• Load data into the destination database (data warehouse).
Pentaho Data Integration prepares the data to create a complete image of your business that
causes for thought. Using visual tools to eliminate coding and complexity, Pentaho brings Big
Data and data sources within the reach of business and IT users alike. (Power to access, prepare
and blend all data) After charging the tables in the operational data store (ODS), we will
transform and charge the data in our data warehouse.
Operational Data Store ODS
The ODS is a database designed to integrate data from multiple sources for additional
operations on the data. Unlike a master data store, the data is not passed back to operational
systems. It may be passed for further operations and to the data warehouse for reporting.
In the realization of the ODS, we followed the Extract Load Process EL.
P a g e 21 | 51
First, we started by extracting data from the different hospidiag files that represent 8
years from 2008 to 2015.
Figure 10: Charging of table HD
Data warehouse
In order to create the data warehouse, we extracted the data from the ODS, we transform
it and then w loaded the different dimensions and fact tables. The screenshots below show a
part of this process.
Loading the dimensions:
Figure 11: Charging domaine activite dimension
P a g e 22 | 51
Figure 12: Data transformation
Loading the fact tables:
Figure 13: fact_data charging
P a g e 23 | 51
Figure 14: fact HD charging
Figure 15: Casting data
Figure 16: Charging data warehouse job
P a g e 24 | 51
OLAP
1. Introduction
After the realization of the data warehouse, we will move on to demonstrating the next
step of our project which is the creation of the Online Analytical Processing OLAP.
An OLAP cube is a multidimensional database that is optimized for data warehouse and
online analytical processing (OLAP) applications. In fact, an OLAP cube is a method of storing
data in a multidimensional form, generally for reporting purposes. In OLAP cubes, data
(measures) are categorized by dimensions. These cubes are often pre-summarized across
dimensions to drastically improve query time over relational databases.
The query language used to interact and perform tasks with OLAP cubes is
multidimensional expressions (MDX). The MDX language was originally developed by
Microsoft in the late 1990s, and has been adopted by many other vendors of multidimensional
databases.
2. Used tools
Schema Workbench
The Mondrian Schema Workbench allows you to visually create and test Mondrian
OLAP cube schemas. It provides the following functionality: • Schema editor integrated with
the underlying data source for validation. (See above) • Test MDX queries against schema and
database • Browse underlying databases structure See the Mondrian technical guide to
understand schemas. Once you have the schema file, you can upload this into your application.
Mondrian
Mondrian is an OLAP engine (Online Analytical Processing) written in Java by Julian
Hyde enabling the design, publishing and querying of multidimensional cubes. It allows
execution of MDX queries on data warehouses based on RDBMS, where his characterization
of "ROLAP" (Relational OLAP). In terms of ROLAP, Mondrian is the open source reference.
It allows access to the results in an understandable format by a multidimensional presentation
client side API, usually in Web mode, for example JPivot, Pentaho Analyzer, Pentaho Analysis
Tool, and Geo Analysis Tool (GAT). It uses a standard OLAP modelling and can connect to
any data warehouse designed by the rules of the art business intelligence. It is interesting to
note that Mondrian OLAP component used by most open source BI suite including Pentaho,
SpagoBI and JasperServer.
P a g e 25 | 51
Figure 17: Mondrian logo
3. Realization
In our project, we have created two cubes. The first one contains the fact table fact_data
which and five different dimensions: dim_domaine_activite, dim_temps, dim_etablissement,
dim_provenance, dim_tranche_age.
The second cube contains fact table hd_fact which will be analysed according is the
following dimensions dimTemps, dim_cia, dim_etablissement, dim_provenance,
Figure 18: Cube creation with schema workbench
P a g e 26 | 51
Figure 19: XML schema generated with dchema workbench
P a g e 27 | 51
Reporting
1. Introduction
Most companies have a need for different type of reports. In many cases hundreds of
different types of reports and occasionally often more. Business Intelligence software often has
comprehensive reporting tools available that can extract and present data in many different
media types (like over an internal Web page/Intra net, Internet (to customers), Excel file format,
PDF format e.g.
In many cases these reporting facilities will be controlled by parameters that can be
chosen real time and present a report that has been run directly against data (often a Data
Warehouse or multidimensional data) Reporting is seen as static information being retrieved
from a source system like an ERP.
Most ERP systems have these static reports as part of the package as an out-of-the-box
(OOTB) offering. Most of these reports cover areas like purchase orders, invoices, goods,
received, debtors balances, inventory on hand, financial statements, vendor and customer lists,
resourcing, planning, etc.
2. BI and Reporting
Performance
BI can deliver high volumes of data to a large number of users, because of the
combination of architecture, software and technologies It is important to understand that BI
does not sit within the ERP like reporting as much as it sits on top of the ERP. Thus, the BI
environment is not impacting the ERP environment and vice versa.
Decision making
Reporting allows for short term and reactive decision making as the reports are very
static and a lot of manual work needs to be done to consolidate data from multiple dimensions.
BI allows decision-making that directly affects your strategy. Key performance indicators (KPI)
are set within a BI environment and can be closely tracked, to monitor business performance.
Delivery
Reporting gives static reports, parameterized and in a list format. BI allows for dynamic
dashboards and scorecards across multiple areas within the business, providing a consolidated
view to drill, pivot and slice-and-dice your data. BI also goes a step further and allows for
predictive analytics like forecasting based on historical data, patterns and trend. Allowing you
P a g e 28 | 51
to be pro-active, compared to a reporting environment that only allows for reactive
management.
3. Used tools
QlikView
The QlikView platform lets users discover deeper insights by building their own rich,
guided analytics. Mine Big Data with this enterprise-ready solution.
Figure 20: ClikView logo
JasperReports
Jasper Soft provides reporting and analytics that can be embedded into a web or mobile
application as well as operate as a central information hub for the enterprise by delivering
mission critical information on a real-time or scheduled basis to the browser, mobile device,
printer, or email inbox in a variety of file formats. Jasper Server is optimized to share, secure,
and centrally manages your Jasper soft reports and analytic views.
Figure 21: JasperSoft logo
P a g e 29 | 51
Saiku
Saiku is an open source Analytics Client, and serves as the UI component of the
Openbravo Analytics Module. It uses the MDX language to seamlessly interact with the
Mondrian cubes that Openbravo generates, allowing users to easily create, visualize and analyze
information in graphical and pivot table formats.
Figure 22: Saiku logo
4. Realization
Reporting using QlikView
Figure 23: Histogram of number of stay/ session
By analyzing the total number of patient sessions for each category of facility versus
ALD (Long Duration Disease) sessions, all sessions in the CLCC (Cancer Control Center) are
sessions for patients in ALD, which is normal for this type of center that treats only this kind
ofaffection.
It should be noted that non-profit institutions (EBNLs) and regional hospital centers
(CHR) treat ALD patients very much contrary to clinics (CLIs), which are the least involved in
the treatment of chronic diseases with 45 million sessions in total for only 5 million in ALD
P a g e 30 | 51
which can be explained by it targeting only patients who can afford to pay the exorbitant prices
of treatments.
Figure 24: Pie chart of the number of ALD’sessions
If we analyze this graph, we note that the number of ALD sessions for the CLCC centers
is the most remarkable in regions 13, 44 and 94
These regions are also found to be among the regions where there is a maximum number
of ALD sessions for CH and CHR institutions, as shown in this graph
P a g e 31 | 51
Figure 25: Histogram of number of ALD’ sessions by establishment's category
The previous histogram explains the creation of Centers for the fight against cancer
which come to help regional hospitals and hospitals by reducing the burden on them by treating
patients who have cancer in other centers more adapted
On the other hand, it is found that these remarkable regions in terms of total number of
sessions do not appear in the regions most treated by the not-for-profit establishments (EBNL)
Figure 26 :number of ALD' sessions at EBNL
P a g e 32 | 51
The pie chart results can be explained by the fact that they are created to compensate for
the lack of facilities treating ALD in some regions
In conclusion, we found that regions or patients who spend long periods in ALD
treatment are not among the regions where there is a high occupancy rate of beds in medicine,
surgery and obstetrics as shown in this graph
Figure 27: histogram number of beds by regions
Finally, we can conclude that the performance of French health services is satisfactory
and meets the requirements in terms of establishments and beds that are constantly increasing.
KPI
With the KIPs we tried to see the degree of performance of every establishment. An
establishment is considered performing if its rate is >50%
P a g e 33 | 51
Figure 28: KPIs
5. Reporting using Jaspersoft
Figure 29: Report with JaperSoft
In the previous figure, we have shown the top ten establishments in terms of
ALD’sessions number.
In this sprint, we are lead to give a better representation of our data.
P a g e 34 | 51
So, we have chosen to represent in this report a few figures, highlighting the side job of
our subject.
This table is a list of the ten social reason (property) by Department according to the
number of stays in each establishment with a descending sort. But to offer good visibility on
the distribution of the number of stays, there not a better way than the illustration of this chart
with a graph.
In the figure below we can see clearly the importance of this variable in each of the
institutions.
Figure 30: Top 10 establishments accorfing to Ald sessions number
We observed and then found that he has an establishment who repeats several times
which is AP - HP but not in the same Department. So, we've focused on this property, and we
sum perform a deeper representation on the areas of activity of these institutions, which are:
AP-HP, Paris, Val-de-Marne, Hauts-de-Seine, Seine-Saint-Denis.
P a g e 35 | 51
Figure 31: AP-HP-Seine-Saint-Denis activity domains
Figure 32AP-HP-ValDe-Mane activity domains:
P a g e 36 | 51
Figure 33:AP-HP-Haut-Seine activity domains
Figure 34:AP-HPParis activity domains
So, we can see that these institutions treat the same diseases and almost share the same areas
of activity. And that the number of stays in diseases related to the areas of activity: digestive,
P a g e 37 | 51
orthopedic trauma, nervous system, is very important in this business.
Figure 35: Number of ALD sessions by establishment category
This pie chart shows the distribution of establishments by category in percentage days.
P a g e 38 | 51
Figure 36: number of ALD sessions by age
While this camembert clearly represents the segmentation according to the age range of
number of stays of all institutions.
Moreover, we are interested in the representation of the number of stays in the areas of
chemo, Radi, hemodialysis, and delivery. Then we deducted that institutions which deals most
these areas are the following:
Figure 37: TOP3-Number-Chemo-Sessions-Radio-Hemodialysis -childbirth
Each indicator is relative to an activity. That's how that Hospidiag manages its business
areas.
Otherwise a small comparison between MCO and HC institutions to see if it is the same
reasons social or not.
P a g e 39 | 51
So, the two areas the five institutions are the same. They have the highest of RSA in
MCO and HC.
6. Conclusion
Through this chapter, we created an OLAP cube and the we used it to analyze the data
with different approaches.
P a g e 40 | 51
Data mining
7. Introduction
“Data mining” is the significant discovery process new correlations, patterns and trends
by sifting through large amounts of data stored in repositories, using pattern recognition
technology, statistical and mathematical techniques."
The data Mining is very recent, dating 1980 He saw its expansion to cope with this new
finding that characterized the economic scene, namely the multiplication of very large databases
and difficult to use by businesses who did not have enough resources. This is a set of tools that
have been developed to study the interactions and unknown phenomena and explore the
underlying data.
This is also the literal meaning of the term dates mining: Data Mining. It is found both
in the management of human resources, in sectors such as the retail ... In an initial step,
computationally extracted valid data and that can be operated from major data sources.
Eventually the use of these data will detect all the underlying correlations. Data Mining uses
the rules of statistics and more specifically mathematical algorithms to compare all the results
and conclude on correlations or links between different phenomena.
8. Objective
The performance criterion of the participants' contributions is the RMSE (Root Mean
Square Error).
This is the square root of the arithmetic mean of the squares of the deviations between
the forecasts and the target values held.
9. Used tools
RStudio
RStudio is a free and open-source integrated development environment (IDE) for R, a
programming language for statistical computing and graphics.
RStudio is available in two editions: RStudio Desktop, where the program is run locally
as a regular desktop application; and RStudio Server, which allows accessing RStudio using a
web browser while it is running on a remote Linux server. Prepackaged distributions of RStudio
Desktop are available for Windows, macOS, and Linux.
P a g e 41 | 51
Figure 38: R studio logo
10.Realization
Unsupervised methods
Association rules are if/then statements that help uncover relationships between
seemingly unrelated data in a relational database or other information repository.
An association rule has two parts, an antecedent (if) and a consequent (then). An
antecedent is an item found in the data. A consequent is an item that is found in combination
with the antecedent.
--
Figure 39: Rules with arulesViz
P a g e 42 | 51
Figure 40: Presentation of a rule
From the association rules obtained we can consider this rule as the most reliable:
For the patients of CH DU HAUT BURGEY and whose the age less than 75 years old, will
likely to have short number of sessions in ALD.
Supervised methods
In order to determine the variable cible1, we first made a selection of the variables that
are significant the most. Then we used different predictive approaches which they are: multiple
linear regression, random forest, Generalized Boosted Regression Models GBM and deep
learning and then we made the selection of the best model based on the RMSE score.
Generalized linear model
> h2o.performance(regression.model)
H2ORegressionMetrics: glm
** Reported on training data. **
MSE: 0.002698627
RMSE: 0.05194831
MAE: 0.03549875
RMSLE: 0.04100089
Mean Residual Deviance : 0.002698627
R^2 : 0.9074501
Null Deviance :53298.98
Null D.o.F. :1827898
Residual Deviance :4932.817
Residual D.o.F. :1827878
AIC :-5624648
P a g e 43 | 51
Random Forests
> h2o.performance(rforest.model)
H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.001330922
RMSE: 0.0364818
MAE: 0.01778781
RMSLE: 0.02686152
Mean Residual Deviance : 0.001330922
GBM: Gradient boosting model
> h2o.performance (gbm.model)
H2ORegressionMetrics: gbm
** Reported on training data. **
MSE: 0.0004069792
RMSE: 0.02017373
MAE: 0.009795494
RMSLE: 0.01485119
Mean Residual Deviance : 0.0004069792
Model 2 GBM
> h2o.performance (gbm.model)
H2ORegressionMetrics: gbm
** Reported on training data. **
MSE: 0.0009652376
RMSE: 0.03106827
MAE: 0.01555565
RMSLE: 0.02287432
Deep Learning
> h2o.performance(dlearning.model)
H2ORegressionMetrics: deeplearning
** Reported on training data. **
** Metrics reported on temporary training frame with 9999 samples **
MSE: 0.0007069083
RMSE: 0.02658775
MAE: 0.01316837
RMSLE: 0.01945228
Mean Residual Deviance : 0.0007069083
11.Conclusion
After creating many regression models, we came to the conclusion that the Gradient
Boosting Model (GBM) is the most powerful, it has the minimal Root Mean Square Error
(RMSE) with a value of 0.02017373.
P a g e 44 | 51
Big Data
12.Introduction
During this chapter, we will try to give an overview about the big data and then we will
show the process of extracting and analyzing the extracted data.
13.Big Data
Big data is a term for data sets that are so large or complex that traditional data
processing applications are inadequate. Challenges include analysis, capture, data curation,
search, sharing, storage, transfer, visualization, querying, updating and information privacy.
The term often refers simply to the use of predictive analytics or certain other advanced
methods to extract value from data, and seldom to a particular size of data set.
Accuracy in big data may lead to more confident decision making, and better decisions
can result in greater operational efficiency, cost reduction and reduced risk.
Big Data has become of major interest to the IT world.
• In general, the term refers to types of relatively new data (video, images,
sound, etc.) That generate large files.
• It also means large sets of small volumes of data (comments on websites of
social networks, photos of the seabed, images from traffic cameras) that take
their meaning when combined.
• In most cases, these Big Data experiencing rapid growth and some modest
data sets will have to grow to become the Big Data.
Big Data is characterized by:
Volume refers to the vast amounts of data generated every second. Just think of all the
emails, twitter messages, photos, video clips, sensor data etc. we produce and share every
second. We are not talking Terabytes but Zettabytes or Brontobytes. On Facebook alone we
send 10 billion messages per day, click the "like' button 4.5 billion times and upload 350 million
new pictures each and every day. If we take all the data generated in the world between the
beginning of time and 2008, the same amount of data will soon be generated every minute! This
increasingly makes data sets too large to store and analyze using traditional database
technology. With big data technology we can now store and use these data sets with the help of
P a g e 45 | 51
distributed systems, where parts of the data is stored in different locations and brought together
by software.
Velocity refers to the speed at which new data is generated and the speed at which data
moves around. Just think of social media messages going viral in seconds, the speed at which
credit card transactions are checked for fraudulent activities, or the milliseconds it takes trading
systems to analyze social media networks to pick up signals that trigger decisions to buy or sell
shares. Big data technology allows us now to analyze the data while it is being generated,
without ever putting it into databases.
Variety refers to the different types of data we can now use. In the past we focused on
structured data that neatly fits into tables or relational databases, such as financial data (e.g.
sales by product or region). In fact, 80% of the world’s data is now unstructured, and therefore
can’t easily be put into tables (think of photos, video sequences or social media updates). With
big data technology we can now harness differed 37 types of data (structured and unstructured)
including messages, social media conversations, photos, sensor data, video or voice recordings
and bring them together with more traditional, structured data.
Veracity refers to the messiness or trustworthiness of the data. With many forms of big
data, quality and accuracy are less controllable (just think of Twitter posts with hash tags,
abbreviations, typos and colloquial speech as well as the reliability and accuracy of content)
but big data and analytics technology now allows us to work with this type of data. The volumes
often make up for the lack of quality or accuracy.
Value: Then there is another V to take into account when looking at Big Data: Value!
It is all well and good having access to big data but unless we can turn it into value it is useless.
So, you can safely argue that 'value' is the most important V of Big Data. It is important that
businesses make a business case for any attempt to collect and leverage big data. It is so easy
to fall into the buzz trap and embark on big data initiatives without a clear understanding of
costs and benefits.
14.Used Tools
Apache Tomcat
Tomcat is an application server from the Apache Software Foundation that executes
Java servlets and renders Web pages that include Java Server Page coding. Described as a
"reference implementation" of the Java Servlet and the Java Server Page specifications, Tomcat
P a g e 46 | 51
is the result of an open collaboration of developers and is available from the Apache Web site
in both binary and source versions.
Tomcat can be used as either a standalone product with its own internal Web server or
together with other Web servers, including Apache, Netscape Enterprise Server, Microsoft
Internet Information Server (IIS), and Microsoft Personal Web Server. Tomcat requires a Java
Runtime Enterprise Environment that conforms to JRE 1.1 or later. Tomcat is one of several
open source collaborations that are collectively known as Jakarta.
Figure 41: Apache logo
Realization
First of all we streamed data from twitter, then we filtered the data in order to have significant
one and finally we made an analysis of this data.
Figure 42: Fetching data from twiter
P a g e 47 | 51
Figure 43: Fetching configuration
Figure 44: Fetched data
P a g e 48 | 51
Figure 45: Table creation and inserting data into it
Figure 46: Used words after the Real Madrid Vs Celta Vigo game
P a g e 49 | 51
We also streamed data from Facebook using the “Rfacebook” package. The data that we
streamed is concerning the comments after the 18th
may 2017 game between Real Madrid and
Celta Vigo.
Figure 47: Activity on the Real Madrid Facebookpage
P a g e 50 | 51
Figure 48: Activity on the FCBarcelone facebook page
Conclusion
While working with data, we fetched data and used it in order to analyze the streamed data
based on specific filters.
P a g e 51 | 51
General Conclusion
These past months of work have allowed us to place ourselves in a professional context
and team work on a project of great magnitude. Our internship was particularly Trainer
technical perspective. We strengthened our bases in SQL, MDX discovered and above all we
have discovered the world Business Intelligence and reporting.
Furthermore, this course is a continuation of what we have learned in the previous years
on development software. This allowed be familiar with new concepts such as Business
Intelligence.
We discovered how reporting and analysis data are important in the response that any
application provides its customers. Although it took us some time to acquire the notions of real
estate, BI, reporting and the language barrier sometimes adds to the difficulty, gradually, we
adapted and used to a new working environment and new technologies. Also, we gain
knowledge of data mining, machine learning and big data.
Finally, the course was rewarding for all our achievements are spent in production.

More Related Content

What's hot

PRM601 Final Project_Magana_J
PRM601 Final Project_Magana_JPRM601 Final Project_Magana_J
PRM601 Final Project_Magana_JJerry P. Maga
 
Final design report
Final design reportFinal design report
Final design reportpeymanabaee
 
HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...
HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...
HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...LEGOO MANDARIN
 
Nft s explained 2022
Nft s explained 2022Nft s explained 2022
Nft s explained 2022Nikoevil
 
Abap coding standards
Abap coding standardsAbap coding standards
Abap coding standardssurendra1579
 

What's hot (6)

896405 - HSSE_v03
896405 - HSSE_v03896405 - HSSE_v03
896405 - HSSE_v03
 
PRM601 Final Project_Magana_J
PRM601 Final Project_Magana_JPRM601 Final Project_Magana_J
PRM601 Final Project_Magana_J
 
Final design report
Final design reportFinal design report
Final design report
 
HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...
HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...
HSK 3 Intensive Reading for Advance Learner V2009 H31005 汉语水平考试三级模拟考题 - Exam-...
 
Nft s explained 2022
Nft s explained 2022Nft s explained 2022
Nft s explained 2022
 
Abap coding standards
Abap coding standardsAbap coding standards
Abap coding standards
 

Similar to Rapport pebi-anap-atih-2020

Plan Bee Chitral Reporting Period Update - Creating an Enabling Environment f...
Plan Bee Chitral Reporting Period Update - Creating an Enabling Environment f...Plan Bee Chitral Reporting Period Update - Creating an Enabling Environment f...
Plan Bee Chitral Reporting Period Update - Creating an Enabling Environment f...Hashoo Foundation USA
 
Open Textbook Basics of Project Management
Open Textbook Basics of Project ManagementOpen Textbook Basics of Project Management
Open Textbook Basics of Project Managementssuserf45585
 
Law Firm Management Project for HND of SQA
Law Firm Management Project for HND of SQALaw Firm Management Project for HND of SQA
Law Firm Management Project for HND of SQAYeeMonNyuntWin
 
Brunel MSc Thesis-NA Stogiannos-Final
Brunel MSc Thesis-NA Stogiannos-FinalBrunel MSc Thesis-NA Stogiannos-Final
Brunel MSc Thesis-NA Stogiannos-FinalAlex Stogiannos
 
IMechE Report Final_Fixed
IMechE Report Final_FixedIMechE Report Final_Fixed
IMechE Report Final_FixedAmit Ramji ✈
 
IMechE Report Final_Fixed
IMechE Report Final_FixedIMechE Report Final_Fixed
IMechE Report Final_FixedAmit Ramji ✈
 
10 guidetoproposalwriting
10 guidetoproposalwriting10 guidetoproposalwriting
10 guidetoproposalwritingPlut Kumkm Aceh
 
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...Phil Carr
 
Virtual Classroom System for Women`s University in Africa
Virtual Classroom System for Women`s University in AfricaVirtual Classroom System for Women`s University in Africa
Virtual Classroom System for Women`s University in Africatarrie chagwiza
 
Cloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos StefanisCloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos StefanisPavlos Stefanis
 
Emergency Planning Independent Study 235.b
Emergency Planning  Independent Study 235.b  Emergency Planning  Independent Study 235.b
Emergency Planning Independent Study 235.b MerrileeDelvalle969
 
Emergency planning independent study 235.b
Emergency planning  independent study 235.b  Emergency planning  independent study 235.b
Emergency planning independent study 235.b ronak56
 
Cobre - Gestión de Activos – Guía para la aplicación de la norma 55001
Cobre - Gestión de Activos – Guía para la aplicación de la norma 55001Cobre - Gestión de Activos – Guía para la aplicación de la norma 55001
Cobre - Gestión de Activos – Guía para la aplicación de la norma 55001Horacio Felauto
 
Albpm60 studio reference_guide
Albpm60 studio reference_guideAlbpm60 studio reference_guide
Albpm60 studio reference_guideVibhor Rastogi
 
Current e manual
Current e manualCurrent e manual
Current e manualAYM1979
 

Similar to Rapport pebi-anap-atih-2020 (20)

Plan Bee Chitral Reporting Period Update - Creating an Enabling Environment f...
Plan Bee Chitral Reporting Period Update - Creating an Enabling Environment f...Plan Bee Chitral Reporting Period Update - Creating an Enabling Environment f...
Plan Bee Chitral Reporting Period Update - Creating an Enabling Environment f...
 
FYPFINAL
FYPFINALFYPFINAL
FYPFINAL
 
Open Textbook Basics of Project Management
Open Textbook Basics of Project ManagementOpen Textbook Basics of Project Management
Open Textbook Basics of Project Management
 
MBA Dissertation Thesis
MBA Dissertation ThesisMBA Dissertation Thesis
MBA Dissertation Thesis
 
Law Firm Management Project for HND of SQA
Law Firm Management Project for HND of SQALaw Firm Management Project for HND of SQA
Law Firm Management Project for HND of SQA
 
Brunel MSc Thesis-NA Stogiannos-Final
Brunel MSc Thesis-NA Stogiannos-FinalBrunel MSc Thesis-NA Stogiannos-Final
Brunel MSc Thesis-NA Stogiannos-Final
 
IMechE Report Final_Fixed
IMechE Report Final_FixedIMechE Report Final_Fixed
IMechE Report Final_Fixed
 
IMechE Report Final_Fixed
IMechE Report Final_FixedIMechE Report Final_Fixed
IMechE Report Final_Fixed
 
10 guidetoproposalwriting
10 guidetoproposalwriting10 guidetoproposalwriting
10 guidetoproposalwriting
 
Engineering
EngineeringEngineering
Engineering
 
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
SSTRM - StrategicReviewGroup.ca - Workshop 2: Power/Energy and Sustainability...
 
Virtual Classroom System for Women`s University in Africa
Virtual Classroom System for Women`s University in AfricaVirtual Classroom System for Women`s University in Africa
Virtual Classroom System for Women`s University in Africa
 
Cloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos StefanisCloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos Stefanis
 
Emergency Planning Independent Study 235.b
Emergency Planning  Independent Study 235.b  Emergency Planning  Independent Study 235.b
Emergency Planning Independent Study 235.b
 
Emergency planning independent study 235.b
Emergency planning  independent study 235.b  Emergency planning  independent study 235.b
Emergency planning independent study 235.b
 
Montero Dea Camera Ready
Montero Dea Camera ReadyMontero Dea Camera Ready
Montero Dea Camera Ready
 
Cobre - Gestión de Activos – Guía para la aplicación de la norma 55001
Cobre - Gestión de Activos – Guía para la aplicación de la norma 55001Cobre - Gestión de Activos – Guía para la aplicación de la norma 55001
Cobre - Gestión de Activos – Guía para la aplicación de la norma 55001
 
Albpm60 studio reference_guide
Albpm60 studio reference_guideAlbpm60 studio reference_guide
Albpm60 studio reference_guide
 
Current e manual
Current e manualCurrent e manual
Current e manual
 
API Project Capstone Paper
API Project Capstone PaperAPI Project Capstone Paper
API Project Capstone Paper
 

Recently uploaded

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 

Recently uploaded (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 

Rapport pebi-anap-atih-2020

  • 1. P a g e 1 | 51 Summary General Introduction .................................................................................................................. 6 Project Context........................................................................................................................... 7 Introduction ............................................................................................................................ 7 Presentation of the organization............................................................................................. 7 ANAP................................................................................................................................. 7 ATIH .................................................................................................................................. 8 Presentation of the project...................................................................................................... 8 ANAP-ATIH2020 .............................................................................................................. 8 Proposed solution ................................................................................................................... 9 Methodology .......................................................................................................................... 9 Presentation of the Scrum method ..................................................................................... 9 Scrum roles....................................................................................................................... 10 Road map.............................................................................................................................. 11 Conclusion............................................................................................................................ 11 Analysis.................................................................................................................................... 12 Introduction .......................................................................................................................... 12 Functional requirements....................................................................................................... 12 Non-Functional Requirements ............................................................................................. 12 Actor identification .............................................................................................................. 13 Use case diagram.................................................................................................................. 14 Conclusion............................................................................................................................ 15 ETL........................................................................................................................................... 16 Introduction .......................................................................................................................... 16 Data Source .......................................................................................................................... 16 Dimensional modeling ......................................................................................................... 16 Choice of the model ............................................................................................................. 17
  • 2. P a g e 2 | 51 Realisation............................................................................................................................ 19 Used tools......................................................................................................................... 19 Extract Transform Load ................................................................................................... 20 Operational Data Store ODS............................................................................................ 20 Data warehouse ................................................................................................................ 21 OLAP ....................................................................................................................................... 24 Introduction .......................................................................................................................... 24 Used tools............................................................................................................................. 24 Schema Workbench.......................................................................................................... 24 Mondrian .......................................................................................................................... 24 Realization............................................................................................................................ 25 Reporting.................................................................................................................................. 27 Introduction .......................................................................................................................... 27 BI and Reporting .................................................................................................................. 27 Performance ..................................................................................................................... 27 Decision making............................................................................................................... 27 Delivery............................................................................................................................ 27 Used tools............................................................................................................................. 28 QlikView .......................................................................................................................... 28 JasperReports ................................................................................................................... 28 Saiku................................................................................................................................. 29 Realization............................................................................................................................ 29 Reporting using QlikView ............................................................................................... 29 KPI ................................................................................................................................... 32 Reporting using Jaspersoft ................................................................................................... 33 Conclusion............................................................................................................................ 39 Data mining.............................................................................................................................. 40
  • 3. P a g e 3 | 51 Introduction .......................................................................................................................... 40 Objective .............................................................................................................................. 40 Used tools............................................................................................................................. 40 RStudio............................................................................................................................. 40 Realization............................................................................................................................ 41 Unsupervised methods ..................................................................................................... 41 Supervised methods.......................................................................................................... 42 Generalized linear model ................................................................................................. 42 Random Forests................................................................................................................ 43 GBM: Gradient boosting model....................................................................................... 43 Model 2 GBM .................................................................................................................. 43 Deep Learning.................................................................................................................. 43 Conclusion............................................................................................................................ 43 Big Data.................................................................................................................................... 44 Introduction .......................................................................................................................... 44 Big Data................................................................................................................................ 44 Used Tools............................................................................................................................ 45 Apache Tomcat ................................................................................................................ 45 Realization...................................................................................Erreur ! Signet non défini. General Conclusion.................................................................................................................. 51
  • 4. P a g e 4 | 51 Illustrations Figure 1: ANAP logo ................................................................................................................. 7 Figure 2:ATIH logo.................................................................................................................... 8 Figure 3: Scrum methode ......................................................................................................... 10 Figure 4: Road map.................................................................................................................. 11 Figure 5: Use case diagram ...................................................................................................... 14 Figure 6: Hospidiag interface................................................................................................... 16 Figure 7: Data warehouse schema............................................................................................ 18 Figure 8: Postgres logo............................................................................................................. 19 Figure 9: Pentaho logo ............................................................................................................. 20 Figure 10: Charging of table HD.............................................................................................. 21 Figure 11: Charging domaine activite dimension .................................................................... 21 Figure 12: Data transformation ................................................................................................ 22 Figure 13: fact_data charging................................................................................................... 22 Figure 14: fact HD charging..................................................................................................... 23 Figure 15: Casting data ............................................................................................................ 23 Figure 16: Charging data warehouse job.................................................................................. 23 Figure 17: Mondrian logo ........................................................................................................ 25 Figure 18: Cube creation with schema workbench .................................................................. 25 Figure 19: XML schema generated with dchema workbench.................................................. 26 Figure 20: ClikView logo......................................................................................................... 28 Figure 21: JasperSoft logo........................................................................................................ 28 Figure 22: Saiku logo ............................................................................................................... 29 Figure 23: Histogram of number of stay/ session .................................................................... 29 Figure 24: Pie chart of the number of ALD’sessions.............................................................. 30 Figure 25: Histogram of number of ALD’ sessions by establishment's category.................... 31 Figure 26 :number of ALD' sessions at EBNL ........................................................................ 31 Figure 27: histogram number of beds by regions.................................................................... 32 Figure 28: KPIs ........................................................................................................................ 33 Figure 29: Report with JaperSoft ............................................................................................. 33 Figure 30: Top 10 establishments accorfing to Ald sessions number...................................... 34 Figure 31: AP-HP-Seine-Saint-Denis activity domains........................................................... 35 Figure 32AP-HP-ValDe-Mane activity domains:.................................................................... 35
  • 5. P a g e 5 | 51 Figure 33:AP-HP-Haut-Seine activity domains....................................................................... 36 Figure 34:AP-HPParis activity domains .................................................................................. 36 Figure 35: Number of ALD sessions by establishment category............................................. 37 Figure 36: number of ALD sessions by age............................................................................. 38 Figure 37: TOP3-Number-Chemo-Sessions-Radio-Hemodialysis -childbirth ........................ 38 Figure 38: R studio logo........................................................................................................... 41 Figure 39: Rules with arulesViz............................................................................................... 41 Figure 40: Presentation of a rule .............................................................................................. 42 Figure 41: Apache logo............................................................................................................ 46 Figure 43: Fetching data from twiter........................................................................................ 46 Figure 44: Fetching configuration............................................................................................ 47 Figure 45: Fetched data............................................................................................................ 47 Figure 46: Table creation and inserting data into it ................................................................. 48 Figure 47: Used words after the Real Madrid Vs Celta Vigo game ........................................ 48 Figure 48: Activity on the Real Madrid Facebookpage .......................................................... 49 Figure 49: Activity on the FCBarcelone facebook page......................................................... 50
  • 6. P a g e 6 | 51 General Introduction Harnessing the full potential of data requires developing an organization-wide of organizing this data in addition to a data science strategy. Such strategies are now commonplace in most industries such as banking and retail. Banks can offer their customers targeted needs- based services and improved fraud protection because they collect and analyze transactional data. Furthermore, as healthcare organizations face increasing demand for healthcare services, harnessing data and analytics helps organizations improve patient care, manage chronic disease, apply adaptive treatments and reduce costs. Healthcare is the maintenance or improvement of health via the diagnosis, treatment, and prevention of disease, illness, injury, and other physical and mental impairments in human beings. Such field contains a huge amount of data that needs to be analyzed. Health care is a glaring exception. Individual pieces of data can have life-or-death importance, but many organizations fail to aggregate data effectively to gain insights into wider care processes. Without a data science strategy, health care organizations can’t draw on increasing volumes of data and medical knowledge in an organized, strategic way, and individual clinicians can’t use that knowledge to improve the safety, quality, and efficiency of the care they provide. Hence comes the need to collect data and from that collected data comes the need to make good decisions. Making good decisions that requires all relevant data to be taking into consideration. If an organization tries to aggregate and analyze poor-quality data, it may derive useless or even dangerous conclusions. Therefore, the data has to be structured. The best source for that data is a well-designed data warehouse which will help in the process of making decisions, analyzing data and even making predictions.
  • 7. P a g e 7 | 51 Project Context 1. Introduction We can’t talk about out an organization without talking about information. Nowadays every organization tries to have a system to manipulate its data ERP. The Hospidiag ERP is the result of the cooperation between two organizations ANAP and ATIH. In this chapter, we will give the presentation of the ANAP-ATIH2020 but first we will start by presenting the organizations. 2. Presentation of the organization ANAP Present since 2009, the National Agency of Support for the Performance of the establishments of health and medical and social ANAP, comes to improve the performances within the framework of the reform of the health system in France. This agency which is public help the establishments of health and medical social to improve the service provided to the patients and to the users, by developing recommendations and tools, allowing them to optimize their management and their organization. Figure 1: ANAP logo
  • 8. P a g e 8 | 51 ATIH Established in 2000, the Technical Agency of the Information about the Hospitalization ATIH, is in charge of collecting and of analyzing the data of the establishments of health: activity, costs, organization and quality of the care, the finances, the human resources... It realizes studies on the hospitable costs and participates in the mechanisms of financing of establishments. Figure 2:ATIH logo 3. Presentation of the project ANAP-ATIH2020 The ageing of the French population comes along with an important increase of the number of people living with chronic diseases. The offer of care which mainly built itself around the coverage of the short-term care has to evolve today to integrate the needs for long- term follow-up. The ANAP and the ATIH, within the framework of their respective missions agree on an essential report: a considerable quantity of data is produced by the various actors of health. This wealth and the variety of the collected information open big perspectives of exploitation as for the understanding of the health system and its evolution. The opening of the data can allow to slow down the capacity of exploitation and analysis and so to take advantage at best of the wealth of this information. The ANAP-ATIH project 2020 constitutes an opportunity to make sensitive widely on this question and he pursues 3 main objectives:
  • 9. P a g e 9 | 51 • To promote a playful and attractive initiative to make the pedagogy of the stakes in the exploitation of the data of health. • To demonstrate the interest and the potential of an exploitation of the data in the service of a health policy to a local or national level, and to give a concrete illustration to this major question of the adaptation of the organization of establishments to the increase of the chronic pathologies. • To explore the capacity of dated scientists diverse horizons to contribute to the resolution of problems resting on a better exploitation of the data. 4. Proposed solution Based on identified needs of the ANAP-ATIH2020 we will propose a solution to structure the collected data in a data warehouse in order to have not only a quick and easy access to it but also to insure its quality and consistency. Then we will create a cube that allows fast analysis of data according to multiple dimensions and which will be used to create reports with multiple tools. 5. Methodology Presentation of the Scrum method A successful Scrum project is much about understanding what Scrum is. Therefor we will try to give an overview about Scrum which will be our methodology during the project. Scrum is a way for teams to work together to develop a product. Product development, using Scrum, occurs in small pieces, with each piece building upon previously created pieces. Building products one small piece at a time encourages creativity and enables teams to respond to feedback and change, to build exactly and only what is needed. More specifically, Scrum is a simple framework for effective team collaboration on complex projects. Scrum provides a small set of rules that create just enough structure for teams to be able to focus their innovation on solving what might otherwise be an insurmountable challenge. However, Scrum is much more than a simple framework. Scrum supports our need to be human at work: to belong, to learn, to do, to create and be creative, to grow, to improve, and to interact with other people. In other words, Scrum leverages the innate traits and characteristics in people to allow them to do great things together.
  • 10. P a g e 10 | 51 Figure 3: Scrum methode Scrum roles Building complex products for customers is an inherently difficult task. Scrum provides structure to allow teams to deal with that difficulty. However, the fundamental process is incredibly simple, and at its core is governed by 3 primary roles. • Product Owners: determine what needs to be built in the next 30 days or less. • Team: build what is needed in 30 days (or less), and then demonstrate what they have built. Based on this demonstration, the Product Owner determines what to build next. • Scrum Master: ensure this process happens as smoothly as possible, and continually help improve the process, the team and the product being created. While this is an incredibly simplified view of how Scrum works, it captures the essence of this highly productive approach for team collaboration.
  • 11. P a g e 11 | 51 6. Road map Figure 4: Road map 7. Conclusion In this chapter, we introduced briefly the project by presenting the organizations concerned by the challenge. The ANAP-ATIH2020 is based on the collect of information from the French establishments of health which is to be structured and analyzed in order to predict the medium-term evolution of the coverage care of the chronic diseases’ importance for the establishments of health.
  • 12. P a g e 12 | 51 Analysis 1. Introduction During this chapter, we will start by presenting both the functional and non-functional requirements of our project. Then, we will continue with the project actors’ identification and finally we will resume with a use case diagram. 2. Functional requirements The system that we need has to be: • Operational • Scalable • User friendly • Offering the necessary information in real time. For this reason, the system must perform to fulfill the requirements of all users. We present in the following paragraph all functional requirements: • Analyze the input Data • Analyze the output Data • Analyze the session time for each user • Analyze the number of connected people • User Status 3. Non-Functional Requirements Non-functional requirements are the gaps that may prevent the application of operate effectively and efficiently. a. Ability: The total amount of data within selected databases represents a very large volume. In addition, the treatment thereof increases about this little initial volume. Therefore, server capacity shall be sufficient to enable a collect all these data allow program execution and safeguarding updates over time. b. Integrity: At all levels of implementation of the data warehouse, different errors will be treated including the following:
  • 13. P a g e 13 | 51 • The treatment of bad data when importing data from different bases. • In order to standardize the format of the data collected, it will be necessary convert two-dimensional structures in three-dimensional structures. This will be the source of errors that must be taken into account. • Referential integrity in the database tables. c. Quality: • Facilitating data access and dissemination of information. • Reliability and traceability of data. • Human Machine Interaction as intuitive as possible d. Simplicity: Since among the users of the application, there are those who do not necessarily have great knowledge in the field of IT, the functionality of the solution should be understandable and easy to handle. Indeed, the navigation through the different sections shall be designed so that the user finds it easily; it must actually feel in control at all times. e. Performance: The application must meet all user requirements in an optimal way. The application performance results in reduced access times to different features an access time to data acceptable seen handling a warehouse relatively large data. f. Reliability: It must ensure content quality and relevance of information. g. Ergonomics: The first thing that catches the attention of users is the ergonomics and ease use, for that special attention should be given to this need. 4. Actor identification An actor represents the abstraction of a role played by external entities that interact directly with the system studied. An actor acts on the system, it plays a different role in each use case which he collaborates and is usually represented by a stick man. An actor characterizes an outside user, or a group of users that interact directly with the system. In our system, we have identified two actors: Administrator Decision maker
  • 14. P a g e 14 | 51 • The Administrator: Generate the report and cubes and publish as required of terms and conditions. • The Decision Maker: He takes analysis generated reports. 5. Use case diagram A use case is a methodology used in system analysis to identify, clarify, and organize system requirements. The use case is made up of a set of possible sequences of interactions between systems and users in a particular environment and related to a particular goal. It consists of a group of elements (for example, classes and interfaces) that can be used together in a way that will have an effect larger than the sum of the separate elements combined. The use case should contain all system activities that have significance to the users. A use case can be thought of as a collection of possible scenarios related to a particular goal, indeed, the use case and goal are sometimes considered to be synonymous. Figure 5: Use case diagram a. Create Report: Pre-condition: none Description: An Admin can create reports Post-condition: the report has been successfully added
  • 15. P a g e 15 | 51 b. Save Report: Pre-condition: none. Description: Only an admin can save reports. Post-condition: the report has been successfully saved. c. Edit Reports: Pre-condition: none. Description: An admin edit reports. Post-condition: the report has been successfully saved. d. Load Report: Pre-condition: none. Description: An admin or decision maker can load reports. Post-condition: the report is loaded 6. Conclusion During this chapter, we gave an overview of the project. In the chapters, we will introduce the that we have established over the different sprints.
  • 16. P a g e 16 | 51 ETL 1. Introduction ETL is short for Extract, Transform and Load. As the name hints, we’ll extract data from one or more operational databases, transform it to fit in a warehouse structure, and load the data into the DWH. A data warehouse is a system used to store information for use in data analysis and reporting in which the data are integrated, not volatile, and logged. But first, it is essential to define its structure. Before filling the data warehouse, the design of it is essential. 2. Data Source The data used in our project is offered throw the Hospidiag OpenData Source. Figure 6: Hospidiag interface 3. Dimensional modeling Dimensional modeling is a part of data warehouse design, results in the creation of the dimensional model. There are two types of tables involved: • Dimension tables are used to describe the data we want to store. For example: a retailer might want to store the date, store, and employee involved in a specific purchase. Each dimension table is its own category (date, employee, store) and can have one or more attributes. For each store, we can save its location at the city, region,
  • 17. P a g e 17 | 51 state and country level. For each date, we can store the year, month, day of the month, day of the week, etc. This is related to the hierarchy of attributes in the dimension table. • Fact tables contain the data we want to include in reports, aggregated based on values within the related dimension tables. A fact table has only columns that store values and foreign keys referencing the dimension tables. Combining all the foreign keys forms the primary key of the fact table. For instance, a fact table could store a number of contacts and the number of sales resulting from these contacts. • Star Schema: It has single fact table connected to dimension tables like a star. In star schema only one join establishes the relationship between the fact table and any one of the dimension tables. A star schema has one fact table and is associated with numerous dimensions’ table and depicts a star. • Snowflake Schema: It is an extension of the star schema. In snowflake schema, very large dimension tables are normalized into multiple tables. It is used when a dimensional table becomes very big. In snow flake schema since there is relationship between the dimensions Tables it has to do many joins to fetch the data. Every dimension table is associated with sub dimension table. Performance wise, star schema is good. But if memory utilization is a major concern, then snow flake schema is better than star schema. 4. Choice of the model When choosing a database schema for a data warehouse, snowflake and star schemas tend to be popular choices. Our choice is based on the model star schema simply because: First of all, with the star model, dimension analysis is easier. In addition, we do not have dimensions that are connected directly to each other so it is unnecessary to use the snowflake schema. Also, this model offers an ease of use with lower query complexity and easy to understand and it has query performance with a less number of foreign keys and hence shorter query execution time (faster). Finally, the data model is a top down approach.
  • 18. P a g e 18 | 51 Figure 7: Data warehouse schema Dim_domaine_activite: contains an id, the code and the label of an activity. Dim_etablissement: contains an id and informations regarding an establishment such as its name, category weather it’s public or private and its activity in medicine, surgery and obstetrics. Dim_tranche_age: contains an id the age is divided into two classes under 75 and over 75. Dim_region: that contains an id, a region and a department. Dim_cia: contains an id and different indicators. DimTemps: Calendar date dimension are attached to virtually every fact table to allow navigation of the fact table through familiar dates, months, fiscal periods, and special days on the calendar. You would never want to compute Easter in SQL, but rather want to look it up in the calendar date dimension. The calendar date dimension typically has many attributes describing characteristics such as week number, month name, fiscal period, and national holiday indicator. To facilitate partitioning, the primary key of a date dimension can be more meaningful, such as an integer representing YYYYMMDD, instead of a sequentially-assigned surrogate key. However, the date dimension table needs a special row to represent unknown or
  • 19. P a g e 19 | 51 to-be-determined dates. If a smart date key is used, filtering and grouping should be based on the dimension table’s attributes, not the smart key. Fact_data: This is the first fact table, contains all foreign keys and the measures are used to perform the calculation. The fact table is directly related to the following dimensions (dim_domaine_activite, dim_temps, dim_etablissement, dim_provenance, dim_tranche_age). And the measures are nombre de séjours/séances MCO des patients en ALD, nombre total de séjours/séances and cible1 wich is a variable to predict. Fact_hd: it’s the second fact table, contains all foreign keys and the measures are used to perform the calculation. The fact table is directly related to the following dimensions (dimTemps, dim_cia, dim_etablissement, dim_provenance). And it contains over 160 indicators. 5. Realisation Used tools a. PostgresSQL It is an object-relational database management system (ORDBMS) with an emphasis on extensibility and standards-compliance. As a database server, its primary function is to store data securely, supporting best practices, and to allow for retrieval at the request of other software applications. It can handle workloads ranging from small single machine applications to large Internet-facing applications with many concurrent users. Recent versions also provide replication of the database itself for availability and scalability. Figure 8: Postgres logo b. Pentaho
  • 20. P a g e 20 | 51 Is a business intelligence BI software company that offers open source products which provide data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load ETL capabilities. It was founded in 2004 by five founders and is headquartered in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015 Figure 9: Pentaho logo Extract Transform Load The first step of a BI project is to create a central repository for a vision Global data of each service. This repository is called data warehouse. Data warehouse is a system used to store information for use in data analysis and reporting. This process therefore takes place in three stages: • Extraction of data from one or more data sources. • Transformation aggregated data. • Load data into the destination database (data warehouse). Pentaho Data Integration prepares the data to create a complete image of your business that causes for thought. Using visual tools to eliminate coding and complexity, Pentaho brings Big Data and data sources within the reach of business and IT users alike. (Power to access, prepare and blend all data) After charging the tables in the operational data store (ODS), we will transform and charge the data in our data warehouse. Operational Data Store ODS The ODS is a database designed to integrate data from multiple sources for additional operations on the data. Unlike a master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting. In the realization of the ODS, we followed the Extract Load Process EL.
  • 21. P a g e 21 | 51 First, we started by extracting data from the different hospidiag files that represent 8 years from 2008 to 2015. Figure 10: Charging of table HD Data warehouse In order to create the data warehouse, we extracted the data from the ODS, we transform it and then w loaded the different dimensions and fact tables. The screenshots below show a part of this process. Loading the dimensions: Figure 11: Charging domaine activite dimension
  • 22. P a g e 22 | 51 Figure 12: Data transformation Loading the fact tables: Figure 13: fact_data charging
  • 23. P a g e 23 | 51 Figure 14: fact HD charging Figure 15: Casting data Figure 16: Charging data warehouse job
  • 24. P a g e 24 | 51 OLAP 1. Introduction After the realization of the data warehouse, we will move on to demonstrating the next step of our project which is the creation of the Online Analytical Processing OLAP. An OLAP cube is a multidimensional database that is optimized for data warehouse and online analytical processing (OLAP) applications. In fact, an OLAP cube is a method of storing data in a multidimensional form, generally for reporting purposes. In OLAP cubes, data (measures) are categorized by dimensions. These cubes are often pre-summarized across dimensions to drastically improve query time over relational databases. The query language used to interact and perform tasks with OLAP cubes is multidimensional expressions (MDX). The MDX language was originally developed by Microsoft in the late 1990s, and has been adopted by many other vendors of multidimensional databases. 2. Used tools Schema Workbench The Mondrian Schema Workbench allows you to visually create and test Mondrian OLAP cube schemas. It provides the following functionality: • Schema editor integrated with the underlying data source for validation. (See above) • Test MDX queries against schema and database • Browse underlying databases structure See the Mondrian technical guide to understand schemas. Once you have the schema file, you can upload this into your application. Mondrian Mondrian is an OLAP engine (Online Analytical Processing) written in Java by Julian Hyde enabling the design, publishing and querying of multidimensional cubes. It allows execution of MDX queries on data warehouses based on RDBMS, where his characterization of "ROLAP" (Relational OLAP). In terms of ROLAP, Mondrian is the open source reference. It allows access to the results in an understandable format by a multidimensional presentation client side API, usually in Web mode, for example JPivot, Pentaho Analyzer, Pentaho Analysis Tool, and Geo Analysis Tool (GAT). It uses a standard OLAP modelling and can connect to any data warehouse designed by the rules of the art business intelligence. It is interesting to note that Mondrian OLAP component used by most open source BI suite including Pentaho, SpagoBI and JasperServer.
  • 25. P a g e 25 | 51 Figure 17: Mondrian logo 3. Realization In our project, we have created two cubes. The first one contains the fact table fact_data which and five different dimensions: dim_domaine_activite, dim_temps, dim_etablissement, dim_provenance, dim_tranche_age. The second cube contains fact table hd_fact which will be analysed according is the following dimensions dimTemps, dim_cia, dim_etablissement, dim_provenance, Figure 18: Cube creation with schema workbench
  • 26. P a g e 26 | 51 Figure 19: XML schema generated with dchema workbench
  • 27. P a g e 27 | 51 Reporting 1. Introduction Most companies have a need for different type of reports. In many cases hundreds of different types of reports and occasionally often more. Business Intelligence software often has comprehensive reporting tools available that can extract and present data in many different media types (like over an internal Web page/Intra net, Internet (to customers), Excel file format, PDF format e.g. In many cases these reporting facilities will be controlled by parameters that can be chosen real time and present a report that has been run directly against data (often a Data Warehouse or multidimensional data) Reporting is seen as static information being retrieved from a source system like an ERP. Most ERP systems have these static reports as part of the package as an out-of-the-box (OOTB) offering. Most of these reports cover areas like purchase orders, invoices, goods, received, debtors balances, inventory on hand, financial statements, vendor and customer lists, resourcing, planning, etc. 2. BI and Reporting Performance BI can deliver high volumes of data to a large number of users, because of the combination of architecture, software and technologies It is important to understand that BI does not sit within the ERP like reporting as much as it sits on top of the ERP. Thus, the BI environment is not impacting the ERP environment and vice versa. Decision making Reporting allows for short term and reactive decision making as the reports are very static and a lot of manual work needs to be done to consolidate data from multiple dimensions. BI allows decision-making that directly affects your strategy. Key performance indicators (KPI) are set within a BI environment and can be closely tracked, to monitor business performance. Delivery Reporting gives static reports, parameterized and in a list format. BI allows for dynamic dashboards and scorecards across multiple areas within the business, providing a consolidated view to drill, pivot and slice-and-dice your data. BI also goes a step further and allows for predictive analytics like forecasting based on historical data, patterns and trend. Allowing you
  • 28. P a g e 28 | 51 to be pro-active, compared to a reporting environment that only allows for reactive management. 3. Used tools QlikView The QlikView platform lets users discover deeper insights by building their own rich, guided analytics. Mine Big Data with this enterprise-ready solution. Figure 20: ClikView logo JasperReports Jasper Soft provides reporting and analytics that can be embedded into a web or mobile application as well as operate as a central information hub for the enterprise by delivering mission critical information on a real-time or scheduled basis to the browser, mobile device, printer, or email inbox in a variety of file formats. Jasper Server is optimized to share, secure, and centrally manages your Jasper soft reports and analytic views. Figure 21: JasperSoft logo
  • 29. P a g e 29 | 51 Saiku Saiku is an open source Analytics Client, and serves as the UI component of the Openbravo Analytics Module. It uses the MDX language to seamlessly interact with the Mondrian cubes that Openbravo generates, allowing users to easily create, visualize and analyze information in graphical and pivot table formats. Figure 22: Saiku logo 4. Realization Reporting using QlikView Figure 23: Histogram of number of stay/ session By analyzing the total number of patient sessions for each category of facility versus ALD (Long Duration Disease) sessions, all sessions in the CLCC (Cancer Control Center) are sessions for patients in ALD, which is normal for this type of center that treats only this kind ofaffection. It should be noted that non-profit institutions (EBNLs) and regional hospital centers (CHR) treat ALD patients very much contrary to clinics (CLIs), which are the least involved in the treatment of chronic diseases with 45 million sessions in total for only 5 million in ALD
  • 30. P a g e 30 | 51 which can be explained by it targeting only patients who can afford to pay the exorbitant prices of treatments. Figure 24: Pie chart of the number of ALD’sessions If we analyze this graph, we note that the number of ALD sessions for the CLCC centers is the most remarkable in regions 13, 44 and 94 These regions are also found to be among the regions where there is a maximum number of ALD sessions for CH and CHR institutions, as shown in this graph
  • 31. P a g e 31 | 51 Figure 25: Histogram of number of ALD’ sessions by establishment's category The previous histogram explains the creation of Centers for the fight against cancer which come to help regional hospitals and hospitals by reducing the burden on them by treating patients who have cancer in other centers more adapted On the other hand, it is found that these remarkable regions in terms of total number of sessions do not appear in the regions most treated by the not-for-profit establishments (EBNL) Figure 26 :number of ALD' sessions at EBNL
  • 32. P a g e 32 | 51 The pie chart results can be explained by the fact that they are created to compensate for the lack of facilities treating ALD in some regions In conclusion, we found that regions or patients who spend long periods in ALD treatment are not among the regions where there is a high occupancy rate of beds in medicine, surgery and obstetrics as shown in this graph Figure 27: histogram number of beds by regions Finally, we can conclude that the performance of French health services is satisfactory and meets the requirements in terms of establishments and beds that are constantly increasing. KPI With the KIPs we tried to see the degree of performance of every establishment. An establishment is considered performing if its rate is >50%
  • 33. P a g e 33 | 51 Figure 28: KPIs 5. Reporting using Jaspersoft Figure 29: Report with JaperSoft In the previous figure, we have shown the top ten establishments in terms of ALD’sessions number. In this sprint, we are lead to give a better representation of our data.
  • 34. P a g e 34 | 51 So, we have chosen to represent in this report a few figures, highlighting the side job of our subject. This table is a list of the ten social reason (property) by Department according to the number of stays in each establishment with a descending sort. But to offer good visibility on the distribution of the number of stays, there not a better way than the illustration of this chart with a graph. In the figure below we can see clearly the importance of this variable in each of the institutions. Figure 30: Top 10 establishments accorfing to Ald sessions number We observed and then found that he has an establishment who repeats several times which is AP - HP but not in the same Department. So, we've focused on this property, and we sum perform a deeper representation on the areas of activity of these institutions, which are: AP-HP, Paris, Val-de-Marne, Hauts-de-Seine, Seine-Saint-Denis.
  • 35. P a g e 35 | 51 Figure 31: AP-HP-Seine-Saint-Denis activity domains Figure 32AP-HP-ValDe-Mane activity domains:
  • 36. P a g e 36 | 51 Figure 33:AP-HP-Haut-Seine activity domains Figure 34:AP-HPParis activity domains So, we can see that these institutions treat the same diseases and almost share the same areas of activity. And that the number of stays in diseases related to the areas of activity: digestive,
  • 37. P a g e 37 | 51 orthopedic trauma, nervous system, is very important in this business. Figure 35: Number of ALD sessions by establishment category This pie chart shows the distribution of establishments by category in percentage days.
  • 38. P a g e 38 | 51 Figure 36: number of ALD sessions by age While this camembert clearly represents the segmentation according to the age range of number of stays of all institutions. Moreover, we are interested in the representation of the number of stays in the areas of chemo, Radi, hemodialysis, and delivery. Then we deducted that institutions which deals most these areas are the following: Figure 37: TOP3-Number-Chemo-Sessions-Radio-Hemodialysis -childbirth Each indicator is relative to an activity. That's how that Hospidiag manages its business areas. Otherwise a small comparison between MCO and HC institutions to see if it is the same reasons social or not.
  • 39. P a g e 39 | 51 So, the two areas the five institutions are the same. They have the highest of RSA in MCO and HC. 6. Conclusion Through this chapter, we created an OLAP cube and the we used it to analyze the data with different approaches.
  • 40. P a g e 40 | 51 Data mining 7. Introduction “Data mining” is the significant discovery process new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technology, statistical and mathematical techniques." The data Mining is very recent, dating 1980 He saw its expansion to cope with this new finding that characterized the economic scene, namely the multiplication of very large databases and difficult to use by businesses who did not have enough resources. This is a set of tools that have been developed to study the interactions and unknown phenomena and explore the underlying data. This is also the literal meaning of the term dates mining: Data Mining. It is found both in the management of human resources, in sectors such as the retail ... In an initial step, computationally extracted valid data and that can be operated from major data sources. Eventually the use of these data will detect all the underlying correlations. Data Mining uses the rules of statistics and more specifically mathematical algorithms to compare all the results and conclude on correlations or links between different phenomena. 8. Objective The performance criterion of the participants' contributions is the RMSE (Root Mean Square Error). This is the square root of the arithmetic mean of the squares of the deviations between the forecasts and the target values held. 9. Used tools RStudio RStudio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics. RStudio is available in two editions: RStudio Desktop, where the program is run locally as a regular desktop application; and RStudio Server, which allows accessing RStudio using a web browser while it is running on a remote Linux server. Prepackaged distributions of RStudio Desktop are available for Windows, macOS, and Linux.
  • 41. P a g e 41 | 51 Figure 38: R studio logo 10.Realization Unsupervised methods Association rules are if/then statements that help uncover relationships between seemingly unrelated data in a relational database or other information repository. An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in the data. A consequent is an item that is found in combination with the antecedent. -- Figure 39: Rules with arulesViz
  • 42. P a g e 42 | 51 Figure 40: Presentation of a rule From the association rules obtained we can consider this rule as the most reliable: For the patients of CH DU HAUT BURGEY and whose the age less than 75 years old, will likely to have short number of sessions in ALD. Supervised methods In order to determine the variable cible1, we first made a selection of the variables that are significant the most. Then we used different predictive approaches which they are: multiple linear regression, random forest, Generalized Boosted Regression Models GBM and deep learning and then we made the selection of the best model based on the RMSE score. Generalized linear model > h2o.performance(regression.model) H2ORegressionMetrics: glm ** Reported on training data. ** MSE: 0.002698627 RMSE: 0.05194831 MAE: 0.03549875 RMSLE: 0.04100089 Mean Residual Deviance : 0.002698627 R^2 : 0.9074501 Null Deviance :53298.98 Null D.o.F. :1827898 Residual Deviance :4932.817 Residual D.o.F. :1827878 AIC :-5624648
  • 43. P a g e 43 | 51 Random Forests > h2o.performance(rforest.model) H2ORegressionMetrics: drf ** Reported on training data. ** ** Metrics reported on Out-Of-Bag training samples ** MSE: 0.001330922 RMSE: 0.0364818 MAE: 0.01778781 RMSLE: 0.02686152 Mean Residual Deviance : 0.001330922 GBM: Gradient boosting model > h2o.performance (gbm.model) H2ORegressionMetrics: gbm ** Reported on training data. ** MSE: 0.0004069792 RMSE: 0.02017373 MAE: 0.009795494 RMSLE: 0.01485119 Mean Residual Deviance : 0.0004069792 Model 2 GBM > h2o.performance (gbm.model) H2ORegressionMetrics: gbm ** Reported on training data. ** MSE: 0.0009652376 RMSE: 0.03106827 MAE: 0.01555565 RMSLE: 0.02287432 Deep Learning > h2o.performance(dlearning.model) H2ORegressionMetrics: deeplearning ** Reported on training data. ** ** Metrics reported on temporary training frame with 9999 samples ** MSE: 0.0007069083 RMSE: 0.02658775 MAE: 0.01316837 RMSLE: 0.01945228 Mean Residual Deviance : 0.0007069083 11.Conclusion After creating many regression models, we came to the conclusion that the Gradient Boosting Model (GBM) is the most powerful, it has the minimal Root Mean Square Error (RMSE) with a value of 0.02017373.
  • 44. P a g e 44 | 51 Big Data 12.Introduction During this chapter, we will try to give an overview about the big data and then we will show the process of extracting and analyzing the extracted data. 13.Big Data Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk. Big Data has become of major interest to the IT world. • In general, the term refers to types of relatively new data (video, images, sound, etc.) That generate large files. • It also means large sets of small volumes of data (comments on websites of social networks, photos of the seabed, images from traffic cameras) that take their meaning when combined. • In most cases, these Big Data experiencing rapid growth and some modest data sets will have to grow to become the Big Data. Big Data is characterized by: Volume refers to the vast amounts of data generated every second. Just think of all the emails, twitter messages, photos, video clips, sensor data etc. we produce and share every second. We are not talking Terabytes but Zettabytes or Brontobytes. On Facebook alone we send 10 billion messages per day, click the "like' button 4.5 billion times and upload 350 million new pictures each and every day. If we take all the data generated in the world between the beginning of time and 2008, the same amount of data will soon be generated every minute! This increasingly makes data sets too large to store and analyze using traditional database technology. With big data technology we can now store and use these data sets with the help of
  • 45. P a g e 45 | 51 distributed systems, where parts of the data is stored in different locations and brought together by software. Velocity refers to the speed at which new data is generated and the speed at which data moves around. Just think of social media messages going viral in seconds, the speed at which credit card transactions are checked for fraudulent activities, or the milliseconds it takes trading systems to analyze social media networks to pick up signals that trigger decisions to buy or sell shares. Big data technology allows us now to analyze the data while it is being generated, without ever putting it into databases. Variety refers to the different types of data we can now use. In the past we focused on structured data that neatly fits into tables or relational databases, such as financial data (e.g. sales by product or region). In fact, 80% of the world’s data is now unstructured, and therefore can’t easily be put into tables (think of photos, video sequences or social media updates). With big data technology we can now harness differed 37 types of data (structured and unstructured) including messages, social media conversations, photos, sensor data, video or voice recordings and bring them together with more traditional, structured data. Veracity refers to the messiness or trustworthiness of the data. With many forms of big data, quality and accuracy are less controllable (just think of Twitter posts with hash tags, abbreviations, typos and colloquial speech as well as the reliability and accuracy of content) but big data and analytics technology now allows us to work with this type of data. The volumes often make up for the lack of quality or accuracy. Value: Then there is another V to take into account when looking at Big Data: Value! It is all well and good having access to big data but unless we can turn it into value it is useless. So, you can safely argue that 'value' is the most important V of Big Data. It is important that businesses make a business case for any attempt to collect and leverage big data. It is so easy to fall into the buzz trap and embark on big data initiatives without a clear understanding of costs and benefits. 14.Used Tools Apache Tomcat Tomcat is an application server from the Apache Software Foundation that executes Java servlets and renders Web pages that include Java Server Page coding. Described as a "reference implementation" of the Java Servlet and the Java Server Page specifications, Tomcat
  • 46. P a g e 46 | 51 is the result of an open collaboration of developers and is available from the Apache Web site in both binary and source versions. Tomcat can be used as either a standalone product with its own internal Web server or together with other Web servers, including Apache, Netscape Enterprise Server, Microsoft Internet Information Server (IIS), and Microsoft Personal Web Server. Tomcat requires a Java Runtime Enterprise Environment that conforms to JRE 1.1 or later. Tomcat is one of several open source collaborations that are collectively known as Jakarta. Figure 41: Apache logo Realization First of all we streamed data from twitter, then we filtered the data in order to have significant one and finally we made an analysis of this data. Figure 42: Fetching data from twiter
  • 47. P a g e 47 | 51 Figure 43: Fetching configuration Figure 44: Fetched data
  • 48. P a g e 48 | 51 Figure 45: Table creation and inserting data into it Figure 46: Used words after the Real Madrid Vs Celta Vigo game
  • 49. P a g e 49 | 51 We also streamed data from Facebook using the “Rfacebook” package. The data that we streamed is concerning the comments after the 18th may 2017 game between Real Madrid and Celta Vigo. Figure 47: Activity on the Real Madrid Facebookpage
  • 50. P a g e 50 | 51 Figure 48: Activity on the FCBarcelone facebook page Conclusion While working with data, we fetched data and used it in order to analyze the streamed data based on specific filters.
  • 51. P a g e 51 | 51 General Conclusion These past months of work have allowed us to place ourselves in a professional context and team work on a project of great magnitude. Our internship was particularly Trainer technical perspective. We strengthened our bases in SQL, MDX discovered and above all we have discovered the world Business Intelligence and reporting. Furthermore, this course is a continuation of what we have learned in the previous years on development software. This allowed be familiar with new concepts such as Business Intelligence. We discovered how reporting and analysis data are important in the response that any application provides its customers. Although it took us some time to acquire the notions of real estate, BI, reporting and the language barrier sometimes adds to the difficulty, gradually, we adapted and used to a new working environment and new technologies. Also, we gain knowledge of data mining, machine learning and big data. Finally, the course was rewarding for all our achievements are spent in production.