SlideShare a Scribd company logo
1 of 76
DATA AND BUSINESS PROCESS
INTELLIGENCE
PENTAHO PLATFORM
DEVELOPED AT:
BHAT, GANDHINAGAR-382428
DEVELOPED BY:
BHAGAT FARIDA H. SINGH SWATI
11ITUOS079 11ITUOS068
GUIDED BY:
INTERNAL GUIDE EXTERNAL GUIDE
PROF. R.S. CHHAJED MR. VIJAY PATEL
Department of Information Technology.
Faculty of Technology,
Dharmsinh Desai University,
College Road, Nadiad- 387001.
DDU (Faculty of Tech., Dept. of IT) i
CANDIDATE’S DECLARATION
We declare that the final semester report entitled “DATA AND BUSINESS PROCESS
INTELLIGENCE” is our own work conducted under the supervision of the external
guide MR. Vijay Patel, Institute for Plasma Research, Bhat, Gandhinagar and internal
guide Prof. R.S. Chhajed, Faculty of Technology, DDU, Nadiad.
We further declare that to the best of our knowledge the report for B.TECH SEM-VIII
does not contain part of the work which has been submitted either in this or any other
university without proper citation.
Farida Bhagat H.
Branch: Information Technology
Student ID: 11ITUOS079
Roll: IT-07
Singh Swati
Branch: Information Technology
Student ID: 11ITUOS068
Roll: IT-124
Submitted To:
PROF. R.S. CHHAJED,
Department of Information Technology,
Faculty of Technology,
Dharmsinh Desai University,
Nadiad
DDU (Faculty of Tech., Dept. of IT) ii
DDU (Faculty of Tech., Dept. of IT) iii
DHARMSINH DESAI UNIVERSITY
NADIAD-387001, GUJARAT
CERTIFICATE
This is to certify that the project entitled “DATA AND BUSINESS PROCESS
INTELLIGENCE” is a bonafied report of the work carried out by
1) Miss BHAGAT FARIDA H., Student ID No: 11ITUOS079
2) Miss SINGH SWATI, Student ID No: 11ITUOS068
of Department of Information Technology, semester VIII, under the guidance and
supervision for the award of the degree of Bachelor of Technology at Dharmsinh Desai
University, Gujarat. They were involved in Project training during academic year 2013-
2014.
Prof. R.S.Chhajed
HOD, Department of Information Technology,
Faculty of Technology,
Dharmsinh Desai University, Nadiad
Date:
DDU (Faculty of Tech., Dept. of IT) iv
ACKNOWLEDGEMENTS
We are grateful to Mr. Amit Srivastava (Institute for Plasma Research) for giving us
this opportunity to work under the guidance of prominent Solution Expert in the field of
Software Engineering and also providing us with the required resources at the institute.
We are also thankful to Mr. Vijay Patel (Institute for Plasma Research) for guiding us
in our project and sharing valuable knowledge with us.
It gives us immense pleasure and satisfaction in presenting this report of Project
undertaken during the 8th semester of B.Tech. As it is the first step into our professional
life, we would like to take this opportunity to express our sincere thanks to several
people, without whose help and encouragement, it would be impossible for us to carry
out this desired work.
We would like to express thanks to our Head of Department Prof. R. S. Chhajed who
gave us an opportunity to undertake this work. We are grateful to him for his guidance in
the development process.
Finally we would like to thank all Institute of Plasma Research employees, all the faculty
members of our college, friends and family members for providing their support and
continuous encouragement throughout the project.
Thank you
Bhagat Farida H.
Singh Swati
DDU (Faculty of Tech., Dept. of IT) v
TABLE OF CONTENTS
ABSTRACT……………………………………………………………………………....1
COMPANY PROFILE………………………………………………………………......3
LIST OF FIGURES……………………………………………………………………...4
LIST OF TABLES……………………………………………………………………….6
1. INTRODUCTION…………………..……………………………………………….7
1.1 Project Details……………………………………………………………………7
1.2 Purpose…………………………………………………………………………....7
1.3 Scope………………………………………………………………………………7
1.4 Objective………………………………………………………………………….8
1.5 Technology and Literature Review……………………………………………..8
1.5.1 Alfresco ECM……………………………………………………………8
1.5.2 Pentaho Platform………………………………………………………...9
2. PROJECT MANAGEMENT………………………………………………………10
2.1 Feasibility Study………………………………………………………………...10
2.2 Project Planning………………………………………………………………...10
2.2.1 Project Development Approach……………………………………….10
2.2.2 Project Plan…………………………………………………………..…11
2.2.3 Milestones and Deliverables...……………………………………….…12
2.2.4 Project Scheduling………………………………………………….…..13
3. SYSTEM REQUIREMENTS STUDY………………………………………..….14
3.1 User Characteristics……………..……………………………………………..14
3.2 Hardware and Software Requirements…………………………………….…14
3.2.1 Hardware Requirements……………………………………………….14
DDU (Faculty of Tech., Dept. of IT) vi
3.2.2 Software Requirements……………………………………………..…14
3.3 Constraints…………………………………………………………………...…15
3.3.1 Regulatory Policies…………………………………………………..…15
3.3.2 Hardware Limitations………………………………………………….15
3.3.3 Interfaces to Other Applications………………………………………15
3.3.4 CMIS……………………………………………………………………15
3.3.5 Parallel Operations……………………………………………………..16
3.3.6 Reliability Requirements………………………………………………16
3.3.7 Criticality of the Application………………………………………….16
3.3.8 Safety and Security Considerations………………………………...…16
4. ALFRESCO ECM SYSTEM……………………………………………………...17
4.1 Introduction……………………………………………………………………..17
4.2 Alfresco Overview………………………………………………………………17
4.3 Architecture...…………………………………………………………………...19
4.3.1 Client.……………………………………………………………………19
4.3.2 Server……………………………………………………………………19
4.4 Data Storage in Alfresco……………………………………………………….21
4.5 Relationship Diagrams…………………………………………………………21
5. TRANSFORMATION PHASE……………..…………………………………….24
5.1 Introduction…………………………………………………………………….24
5.2 Pentaho Data Integration Tool….……………………………………………..24
5.2.1 Introduction…………………………………………………………….24
5.2.2 Why Pentaho?..........................................................................................25
5.2.2.1 JasperSoft vs Pentaho vs BIRT……………………………………25
5.2.2.2 Conclusion…………………………………………………………..26
5.2.3 Components of Pentaho………………………………………………..27
5.3 Alfresco Audit Analysis and Reporting Tool………………...……………….28
5.3.1 Introduction…………………………………………………………….28
5.3.2 Working and Installation of A.A.A.R. ..................................................29
DDU (Faculty of Tech., Dept. of IT) vii
5.3.2.1 Pre Requisites……………………………………………………….30
5.3.2.2 Enabling Alfresco Audit Service…………………………………...30
5.3.2.3 Data Mart Creation and Configuration………………………...…30
5.3.2.4 PDI Repository Setting..……………………………………………31
5.3.2.5 First Import………………………………………………………....36
5.3.3 Audit Data Mart………………………………………………………...36
5.3.4 Dimension Tables……………………………………………………….37
5.4 Transformations Using Spoon…………………………………………………38
5.5 Example Transformations………..………………………………………….…38
6. REPORTING PHASE……………..………………………………………….……42
6.1 What is a Report?.……………………………………………………………...42
6.2 Pentaho Report Designer Tool….……………………………………………...42
6.2.1 Introduction……………………………………………………………..42
6.2.2 Working of Pentaho Designer………………………………………….43
6.3 Example Reports………..………………………………………………………44
7. PUBLISHING PHASE……………..………………………………………………46
7.1 Introduction………………..…………………………………………………...46
7.2 Pentaho BI Server………...…………………………………………………….46
7.2.1 Introduction……………………………………………………………...4
6
7.2.2 Example Published Reports……………………………………………47
7.3 Scheduling of Transformations…………….………………………………….50
8. TESTING……………..……………………………………………………………..51
8.1 Testing Strategies….…………………………………………………………....51
8.2 Testing Methods………………………………………………………………...52
8.3 Test Cases……………………………………………………………………….53
8.3.1 User Login and Functionality of Report………………………………53
8.3.2 Viewing Documents, Folders, Permissions, Audits…………………...54
DDU (Faculty of Tech., Dept. of IT) viii
9. USER MANUAL……………………………………………………………………55
9.1 Description………………………………………………………………………55
9.2 Login Page………………………………………………………………………55
9.3 View Reports……………………………………………………………………57
9.4 Scheduling………………………………………………………………………59
9.5 Administration……………………………………………………………….....62
10. LIMITATIONS AND FUTURE ENHANCEMENTS……………………………64
10.1 Limitations……………………………………………………………………..64
10.2 Future Enhancements…………………………………………………………64
11. CONCLUSION AND DISCUSSION……………………………………………...65
11.1 Self Analysis of Project Viabilities……………………………………………65
11.1.1 Self Analysis……………………………………………………………...65
11.1.2 Project Viabilities………………………………………………….…….65
11.2 Problems Encountered and Possible Solutions……………………………...65
11.3 Summary of Project Work…………………………………………………...66
12. REFERENCES…………………………………………………………………….68
Abstract
DDU (Faculty of Tech., Dept. of IT) 1
ABSTRACT
Design and implement a platform for Data and Process intelligence
tool
IPR has selected Alfresco, an Enterprise Content Management (ECM), as an
Electronic Document and Record Management System (EDRMS). Alfresco do
not have powerful reporting functionality and honestly, it is not its job.
Unfortunately, the need for powerful reporting is still there and most of the
answers are tricky solutions, quite hard to manage and scale. Alfresco ECM has a
detailed audit service that exposes a lot of (potentially) useful information.
Alfresco is integrated with Activiti, a Business Process Management (BPM)
Engine. It also has auditing functionality and exposing the audit data related to
processes and tasks.
Data and Process Intelligent tool (the project) will be divided into two parts. The
first part will be Alfresco Data Integration which will provided a solution to
extract, transform, and load (ETL) data (document/folder/process/task) together
with the audit data at a very detailed level in a central warehouse. On top of that,
it will provide the data cleansing and merging functionality and if needed convert
it in to OLAP format for efficient analysis.
The second part will be the reporting functionality. The goal will be generic
reporting tool useful to the end-user in a very easy way. The data will be
published in reports in well-known formats (pdf, Microsoft Excel, csv, etc.) and
stored directly in Alfresco as static documents organized in folders.
To achieve above goal, Alfresco will be integrated with a powerful open source
data integration and reporting tool. The necessary data from the Alfresco
Repository will be extracted, transformed, merged/integrated and loaded in the
data warehouse. The necessary schema transformation (for example OLTP to
OLAP) will be applied to increase the efficiency. The solution will be scalable
Abstract
DDU (Faculty of Tech., Dept. of IT) 2
and generic Reporting System with an open window on the Business Intelligence
world. Saying that, the solutions will be suitable also for publishing (static)
reports containing not only audit data coming from Alfresco but also Key
Performance Indicators (KPIs), analysis and dashboards coming from a complete
Enterprise Data Warehouse.
Company Profile
DDU (Faculty of Tech., Dept. of IT) 3
COMPANY PROFILE
Institute for Plasma Research (IPR) is an autonomous physics research institute
located in Gandhinagar, India. The institute is involved in research in aspects of
plasma science including basic plasma physics, research on magnetically confined
hot plasmas and plasma technologies for industrial applications. It is a large and
leading plasma physics organization in India. The institute is mainly funded
by Department of Atomic Energy. IPR is playing major scientific and technical
role in Indian partnership in the international fusion energy initiative ITER
(International Thermonuclear Experimental Reactor).
IPR is now internationally recognized for its contributions to fundamental and
applied research in plasma physics and associated technologies. It has a scientific
and engineering manpower of 200 with core competency in theoretical plasma
physics, computer modeling, superconducting magnets and cryogenics, ultra high
vacuum, pulsed power, microwave and RF, computer-based control and data
acquisition and industrial, environmental and strategic plasma applications.
The Centre of Plasma Physics - Institute for Plasma Research has active
collaboration with the following Institutes/ Universities:
Bhabha Atomic Research Centre, Bombay
Raja Ramanna Centre for Advanced Technology, Indore
IPP, Juelich, Germany; IPP, Garching, Germany
Kyushu University, Fukuoka, Japan
Physical Research Laboratory, Ahmedabad
National Institute for Interdisciplinary Science and Technology, Bhubaneswar
Ruhr University Bochum, Bochum, Germany
Saha Institute of Nuclear Physics, Calcutta
St. Andrews University, UK
Tokyo Metropolitan Institute of Technology, Tokyo
University of Bayreuth, Germany; University of Kyoto, Japan.
List of Figures
DDU (Faculty of Tech., Dept. of IT) 4
LIST OF FIGURES
1. MVC Architecture…………….……………………………………….Fig 1.1
2. Flowchart of the project……………………………………………….Fig 2.1
3. Gantt Chart…………………………………………………………….Fig 2.2
4. Alfresco Icon…………………………………………………………...Fig 4.1
5. Uses of Alfresco ECM…………………….………………………...…Fig 4.2
6. Alfresco Architecture……………………….…………………………Fig 4.3
7. Relational Diagrams (users, documents and folders)……………..…Fig 4.4
8. Relational Diagrams (permissions)…………………………………...Fig 4.5
9. Relational Diagrams (audits)………………………………………….Fig 4.6
10. Pentaho Data Integration Icon……………………………………..…Fig 5.1
11. Pentaho Icon…………………………………………………………...Fig 5.2
12. A.A.A.R. Icon………………………………………………………….Fig 5.3
13. Working of A.A.A.R…………………………………………………..Fig 5.4
14. PDI Repository Settings Step 1……...………………………………..Fig 5.5
15. PDI Repository Settings Step 2……...………………………………..Fig 5.6
16. PDI Repository Settings Step 3……...………………………………..Fig 5.7
17. PDI Repository Settings Step 4……...………………………………..Fig 5.8
18. PDI Repository Settings Step 5……...………………………………..Fig 5.9
19. PDI Repository Settings Step 6…….………………………………..Fig 5.10
20. PDI Repository Settings Step 7….....………………………………..Fig 5.11
21. PDI Repository Settings Step 8….....………………………………..Fig 5.12
22. Audit Data Mart……………………………………………………...Fig 5.13
23. Dimension Tables…………………………………………………….Fig 5.14
24. Document Information Transformation…………...……………….Fig 5.15
25. Document Permission Transformation…………...………………...Fig 5.16
26. Folder Information Transformation…………….....……………….Fig 5.17
List of Figures
DDU (Faculty of Tech., Dept. of IT) 5
27. Folder Permission Transformation………………...……………….Fig 5.18
28. User Information Transformation……………….....……………….Fig 5.19
29. Pentaho Reporting Tool Icon……………………………………..…..Fig 6.1
30. Document Information Report…….……..………...………………....Fig 6.2
31. Document Permission Report……...……..………...………………....Fig 6.3
32. Folder Information Report…….……..…..………...………………....Fig 6.4
33. Folder Permission Report………….……..………...………………....Fig 6.5
34. User Information Report…………..……..………...………………....Fig 6.6
35. Pentaho BI Server Icon……………………………………….……….Fig 7.1
36. Document Information Report…….……..………...………………....Fig 7.2
37. Document Permission Report……...……..………...………………....Fig 7.3
38. Folder Information Report…..…….……..………...………………....Fig 7.4
39. Folder Permission Report.……...….……..………...………………....Fig 7.5
40. User Information Report…………..……..………...………………....Fig 7.6
41. Scheduling of Transformations…...…………………………………..Fig 7.7
42. Login Step 1………………………………………………...………….Fig 9.1
43. Login Step 2………………………………………………...………….Fig 9.2
44. Login Step 3………………………………………………...………….Fig 9.3
45. View Reports Step 1………………...…………………………………Fig 9.4
46. View Reports Step 1………………...…………………………………Fig 9.5
47. Scheduling Page………...……………………………………………...Fig 9.6
48. Administration Page……………….…………………………………..Fig 9.7
List of Tables
DDU (Faculty of Tech., Dept. of IT) 6
LIST OF TABLES
1. Milestones and Deliverables……….……………………………… Table 2.1
2. Project Scheduling Table…………………………………………...Table 2.2
3. Test Case 1…………………………………………………………..Table 8.1
4. Test Case 2…………………………………………………………..Table 8.2
5. Scheduling Options………………..………………………………..Table 9.1
6. Scheduling Controls………………..……………………………….Table 9.2
7. Administration Options…………………………………………….Table 9.3
Introduction
DDU (Faculty of Tech., Dept. of IT) 7
INTRODUCTION
1.1PROJECT DETAILS
Institute of Plasma Research has selected Alfresco, an Enterprise Content Management
(ECM), as an Electronic Document and Record Management System (EDRMS). Alfresco
do not have powerful reporting functionality. Thus, IPR requires a reporting tool to
present the various details related to metadata of the documents and folders (folders are
used to organize documents), access control applied on documents and folders.
Additional analyses (like most active user, most active documents in last week, months
etc.) are required on audit trailing data generated by the alfresco. Some Key Performance
Indicators (KPI) needs to be generated of the document review and approval process.
Possibilities to create and exports reports in well-known formats (pdf, Microsoft Excel,
csv, etc.) needs to be provided. There will be a central administrator who has the
possibility to configure the access rights on the reports for end users. Additionally end
users shall have the possibility to subscribe the reports and schedule the report generation
and send it via E-mail as attachment in preferred format.
1.2PURPOSE
This system needs to be developed to enhance the way of looking at a traditional
document management system and to make it more user-friendly. Along with all the
features, we need a few customizations for the better usability of the resources. With
these powerful reporting tools, it will become easy and secure to understand the files and
documents in the institute. Also, it would help in decision making so as to what steps
have to be taken on the basis of the reports generated.
1.3SCOPE
The scope of the current project is just to implement a framework/deployment
Introduction
DDU (Faculty of Tech., Dept. of IT) 8
architecture using BI toolset and test it by integrating with Alfresco. Alfresco data mart
will be created and used for developing analysis reports related to document management
system. The reports will be made available securely to the employees of the institute,
collaborators and contractors on internet.
In future generic reporting architecture implemented as part of this project will be used
and extended as a full Data Warehouse solution by integrating and merging other data
management tools of IPR. The full DW solution is out of the scope of this project.
1.4OBJECTIVE
The objective of this project is to ease the visibility of the document management system
and enhance decision making. Alfresco is a powerful content management system.
Unfortunately, the need for powerful reporting is still there and most of the answers are
tricky solutions, quite hard to manage and scale. To achieve above goal, Alfresco will be
integrated with a powerful open source data integration and reporting tool. The necessary
data from the Alfresco Repository will be extracted, transformed, merged/integrated and
loaded in the data warehouse. The necessary schema transformation (for example OLTP
to OLAP) will be applied to increase the efficiency. The solution will be scalable and
generic Reporting System with an open window on the Business Intelligence world.
Saying that, the solutions will be suitable also for publishing (static) reports containing
not only audit data coming from Alfresco but also Key Performance Indicators (KPIs),
analysis and dashboards coming from a complete Enterprise Data Warehouse.
1.5TECHNOLOGY AND LITERATURE REVIEW
1.5.1 ALFRESCO ECM
Open source java based Enterprise Content Management system (ECM) named Alfresco
is selected as a document repository. It uses MVC architecture. Model–view–
Introduction
DDU (Faculty of Tech., Dept. of IT) 9
controller (MVC) is a software pattern for implementing user interfaces. It divides a
given software application into three interconnected parts, so as to separate internal
representations of information from the ways that information is presented to or accepted
from the user.
Model: It consists of application data, business rules, logic and functions. Here, XML is
used for the same.
View: It is the output representation of information. Here, FTL is used for the same.
Controller: It accepts input and converts it to commands for the model or view.
Figure 1.1 MVC Architecture
1.5.2 PENTAHO PLATFORM
Pentaho is a company that offers Pentaho Business Analytics, a suite of open
source Business Intelligence (BI) products which provide data integration, OLAP
services, reporting, dashboard, data mining and ETL capabilities. Pentaho was founded in
2004 by five founders and is headquartered in Orlando, FL, USA.
Pentaho software consists of a suite of analytics products called Pentaho Business
Analytics, providing a complete analytics software platform. This end-to-end solution
includes data integration, metadata, reporting, OLAP analysis, ad-hoc query, dashboards,
and data mining capabilities. The platform is available in two offerings: a community
edition (CE) and an enterprise edition (EE).
Project Management
DDU (Faculty of Tech., Dept. of IT) 10
PROJECT MANAGEMENT
2.1 FEASIBILITY STUDY
Feasibility study includes an analysis and evaluation of a proposed project to determine if
it is technically feasible, is feasible within the estimated cost, and will be profitable.
The following softwares have to be installed for the project:-
1. Alfresco Entity Content Management
2. PostgreSQL and SQuirreL
3. Pentaho Data Integration Tool (K.E.T.T.L.E.)
4. Alfresco Audit Analysis and Reporting Tool (A.A.A.R.)
5. Pentaho Reporting Tool
6. Pentaho BI Server
The study assures that the hardware cost required for one database server plus two web
servers is acceptable and the 500 GB of file storage for the final product is feasible.
2.2PROJECTPLANNING
2.2.1 ProjectDevelopmentApproach
We have used Agile methodology. After the feasibility study, the first thing to be done
was to create a basic flowchart charting out the flow of the project so as to create a mind
map. The base database system is Alfresco, from where we need to load tables using
PostGreSQL or SQuirreL. The number of tables is compressed to create a staging data
warehouse. After the transformations on these tables using Pentaho Data Integration tool,
reports are created using Pentaho Reporting on the BI server, according to the given
requirements of the project.
Project Management
DDU (Faculty of Tech., Dept. of IT) 11
Alfresco
ECM
Staging
database
warehouse
Other
databases
(Optional)
Data
Mart
Reports
ETL Logic
Flow Chart of the Project
ETL Logic using Kettle
Publish
Pentaho BI Server
Pentaho Report DesignerTool
Figure 2.1 Flowchart of the Project
Once, the flowchart was made, we proceeded towards the development part keeping the
flowchart in mind. Thus, we started from studying Alfresco Enterprise Content
Management System and then moved on to Pentaho Tools. We also installed PostgreSQL
and SQuirreL so as to deal with the queries.
2.2.2 ProjectPlan
1. Gather the definition.
2. Check whether the definition is feasible or not in given deadline.
3. Requirement gathering.
4. Study and analysis on gathered requirements.
5. Transformation Phase.
6. Reporting Phase.
7. Deployment.
Project Management
DDU (Faculty of Tech., Dept. of IT) 12
2.2.3 Milestones and Deliverables
Table 2.1 Milestones and Deliverables
Phase Deliverables Purpose
Abstract and System
Feasibility Study
Had complete
understanding of the
flow of the project
To be familiar with
the flow of the project
Requirement
Gathering and
Software Installation
and understanding of
Technology
Had studied the ECM,
it’s architecture and
how the data is stored in
the Alfresco repository
Getting familiar with
the Alfresco platform
Study of Platform
and the tools with it
Had studied and used
the three tools namely,
Pentaho Data
Integration Tool,
Pentaho Report
Designer and Pentaho
BI Server
Better understanding
of the Pentaho
platform and all the
tools and plug ins
associated with it
Transformation
Phase
Completed the
transformation phase
with help of A.A.A.R,
developed some custom
ETL and schedules the
transformation jobs to
run during nights
To make the staging
data warehouse
Reporting phase Made the reports
according to the user’s
requirements
To complete the
reporting phase
Project Management
DDU (Faculty of Tech., Dept. of IT) 13
Deployment Published it on server in
different output types,
like pdf, csv etc
Deploy it on the Web
and hence completing
the project
2.2.4 ProjectScheduling
In project management, a schedule is a listing a project's milestones, activities,
and deliverables, usually with intended start and finish dates.
Table 2.2 Project Scheduling Table
Figure 2.2 Gantt Chart
8-Dec 23-Dec 7-Jan 22-Jan 6-Feb 21-Feb 8-Mar 23-Mar
Abstract and Feasibility Study
Requirement Gathering
Study of Database Management System
Study of platform and associated tools
Transformation Phase
Reporting Phase
Deployment
System Requirement Study
DDU (Faculty of Tech., Dept. of IT) 14
SYSTEM REQUIREMENT STUDY
3.1 USER CHARACTERISTICS
This system is made available on the web so it can be accessed from anywhere. The
users will be scientists, researchers, engineers and other employees of the institute. They
login with their respective credentials user will be logged in.
3.2 HARDWARE AND SOFTWARE REQUIREMENTS
3.2.1 Server and Client side Hardware Requirements:
RAM : 4GB
Hard-disk : 40GB
Processor: 2.4GHz
File Storage: 500GB
3.2.2 Server and Client side Software Requirements:
Windows or Linux based system
PostgreSQL Database
SQuirreL Database Client tool
Alfresco ECM
Pentaho Community Edition 5.0 (PDI, Reporting Tool, BI Server)
Alfresco Audit Analyzing and Reporting tool
Notepad++
3.3 CONSTRAINTS
3.3.1 Regulatory Policies
System Requirement Study
DDU (Faculty of Tech., Dept. of IT) 15
Regulatory policies, or mandates, limit the discretion of individuals and agencies, or
otherwise compel certain types of behavior. These policies are generally thought to be
best applied when good behavior can be easily defined and bad behavior can be easily
regulated and punished through fines or sanctions. IPR is very strict about its policies and
ensures that all the employees follow it properly.
3.3.2 Hardware Limitations
To ensure the smooth working of the system, we need to meet the minimum hardware
requirements. We need at least 2GB RAM, 40GB hard disk and 2.4 GHz processor. All
these requirements are readily available. Hence, there are not really any hardware
limitations.
3.3.3 Interfaces to Other Applications
ETL tool of BI suits generally supports a number of standards-based protocols including
the ODBC, JDBC, REST, web script, FTP and many more for extracting the data from
multiple sources. It is easy to integrate any data management application using supported
input protocols. We have used CMIS (Content Management Inter-operatibility Services)
and JDBC protocol for Alfresco data integration. The published reports will be integrated
back to Alfresco using http protocol. Single sign Up will be implemented by the IT
department for providing the transparent access of reports form Alfresco or any other
web based tools.
3.3.4 CMIS
CMIS (Content Management Interoperability Services) is an OASIS standard designed
for the ECM industry. It enables access to any content management repository that
implements the CMIS standard. We can consider using CMIS if application needs
programmatic access to the content repository.
System Requirement Study
DDU (Faculty of Tech., Dept. of IT) 16
3.3.5Parallel Operations
This is a document management system where around 300 employees will work
concurrently. They can upload a document, review it, modify it, start workflow and even
delete it. Parallel operations include allowing more than single employee to read the
document. Work flows can be started to any document; also any document can be in any
number of workflows. Parallel editing of a document will be restricted by providing
check-in and check-out functionality.
3.3.6 Reliability Requirements
All quality hardware, software and frameworks with valid licenses are required for better
reliability.
3.3.7 Criticality of the Application
Criticality of the module was one of the concerned constraints. The system was being
developed for the users who were mainly employees of the government sector. They had
certain rigid aspects which were to be taken care during development. Any change in
pattern of their workflow would lead to extremely critical conditions. Thus this was a
matter of concern and served as one of the deep rooted constraints.
3.3.8 Safety and Security Consideration
The system provides a tight security to user account. It is secured by password
mechanism which are encrypted and stored to database. Also, the repository is accessible
for modifications only to some privileged users.
Alfresco ECM System
DDU (Faculty of Tech., Dept. of IT) 17
ALFRESCO ECM SYSTEM
4.1INTRODUCTION
Figure 4.1 Alfresco Icon
Alfresco is a free enterprise content management system for both Windows and Linux
operating systems, which manages all the content within an enterprise and provides
services to manage this content.
It comes in three flavors:-
Community edition – It is a free software with some limitations. No clustering
feature is present. (We have used community edition of Alfresco for this project
since we just need to perform ETL logic on the database and not use the advanced
functionalities.)
Enterprise edition – It is commercially licensed and suitable for user who requires
a higher degree of functionalities.
Cloud edition - It is a SaaS (Software as a Service) version of Alfresco.
We would be using Alfresco database as our base database from where we want to
extract information and create a warehouse. For further transformation purpose, we
would be using SQuirreL and K.E.T.T.L.E a.k.a. Pentaho Data Integration tool.
4.2ALFRESCOOVERVIEW
There are various ways in which Alfresco can be used for storing files and folders and it
can also be used by different systems. It is basically a repository, which is a
central location where data are stored and managed.
Alfresco ECM System
DDU (Faculty of Tech., Dept. of IT) 18
Few of the ways in which Alfresco can be used are:
Figure 4.2 Uses of Alfresco ECM
Alfresco ECM is a useful tool to store files and folders of different types. Few of the uses
of Alfresco are:-
Document Management
Records Management
Shared drive replacement
Enterprise portals and intranets
Web Content Management
Knowledge Management
Information Publishing
Case Management
4.3 ARCHITECTURE
Alfresco ECM System
DDU (Faculty of Tech., Dept. of IT) 19
Alfresco has a layered architecture with mainly three parts:-
1. Alfresco Client.
2. Alfresco Content Application Server
3. Physical Storage
4.3.1 Client
Alfresco offers two primary web-based clients: Alfresco Share and Alfresco Explorer.
Alfresco Share can be deployed to its own tier separate from the Alfresco content
application server. It focuses on the collaboration aspects of content management and
streamlining the user experience. Alfresco Share is implemented using Spring Surf and
can be customized without JSF knowledge.
Alfresco Explorer is deployed as part of the Alfresco content application server. It is a
highly customizable power-user client that exposes all features of the Alfresco content
application server and is implemented using Java Server Faces (JSF).
Clients also exist for portals, mobile platforms, Microsoft Office, and the desktop. A
client often overlooked is the folder drive of the operating system, where users share
documents through a network drive. Alfresco can look and act just like a folder drive.
4.3.2 Server
The Alfresco content application server comprises a content repository and value-added
services for building ECM solutions. Two standards define the content repository: CMIS
(Content Management Interoperability Services) and JCR (Java Content Repository).
These standards provide a specification for content definition and storage, content
retrieval, versioning, and permissions. Complying with these standards provides a
reliable, scalable, and efficient implementation.
The Alfresco content application server provides the following categories of services
built upon the content repository:
Alfresco ECM System
DDU (Faculty of Tech., Dept. of IT) 20
1. Content services (transformation, tagging, metadata extraction)
2. Control services (workflow, records management, change sets)
3. Collaboration services (social graph, activities, wiki)
Clients communicate with the Alfresco content application server and its services through
numerous supported protocols. HTTP and SOAP offer programmatic access while CIFS,
FTP, WebDAV, IMAP, and Microsoft SharePoint protocols offer application access. The
Alfresco installer provides an out-of-the-box prepackaged deployment where the
Alfresco content application server and Alfresco Share are deployed as distinct web
applications inside Apache Tomcat.
Figure 4.3 Alfresco Architecture
At the core of the Alfresco system is a repository supported by a server that persist
content, metadata, associations, and full text indexes. Programming interfaces support
multiple languages and protocols upon which developers can create custom applications
and solutions. Out-of-the-box applications provide standard solutions such as document
management, and web content management.
Alfresco ECM System
DDU (Faculty of Tech., Dept. of IT) 21
4.4 DATA STORAGE IN ALFRESCO
There are total 97 tables in the database mainly divided into two parts Alfresco databases
and Activity workflows. The Alfresco database is further divided into three parts- nodes,
access and properties.
1. Node is the parent class of the database which has all identity numbers stored in
it.
2. Access tables deals with the security issues of Alfresco like the permissions and
last modification dates.
3. Properties store the information about which kind of data is stored; its size,
type, ranges etc.
4.5 RELATIONSHIP DIAGRAMS
After studying the tables, we created the relationship diagram of the tables using
SQuirreL.
Since the Relational Diagram for the Alfresco System comprises 97 tables, we selected
the ones that are vital like:-
alf_node – holds the identity of other tables.
alf_qname – It defines a valid identifier for each and every attribute.
alf_node_properties – It connects both node and qname tables and stores all
properties of each node id.
alf_access_control_list – It is used to specify who can do what with an object in
the repository i.e. gives the permission information.
Alfresco ECM System
DDU (Faculty of Tech., Dept. of IT) 22
Figure 4.4 Relation Diagrams for users, documents and folders
Figure 4.5 Relational Diagrams for permissions
Alfresco ECM System
DDU (Faculty of Tech., Dept. of IT) 23
Figure 4.6 Relational Diagram for audits
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 24
TRANSFORMATION PHASE
5.1INTRODUCTION
There are 97 tables in Alfresco ECM System. To create a staging data warehouse, we
first have to perform E.T.L. logic i.e. Extract, Transform and Load.
In computing, ETL refers to a process in database usage and especially in data
warehousing where it:
 Extracts data from homogeneous or heterogeneous data source
 Transforms the data for storing it in proper format or structure for querying
and analysis
 Loads it into the final target (database, more specifically, operational data
store, data mart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes time, so
while the data is being pulled another transformation process executes, processing the
already received data and prepares the data for loading and as soon as there is some
data ready to be loaded into the target, the data loading kicks off without waiting for
the completion of the previous phases.
ETL systems commonly integrate data from multiple applications (systems), typically
developed and supported by different vendors or hosted on separate computer
hardware. The disparate systems containing the original data are frequently managed
and operated by different employees. In our project though, there is only one source
from where the data is extracted i.e. Alfresco.
5.2 PENTAHO DATA INTEGRATION TOOL
5.2.1 Introduction
Pentaho Data Integration (or Kettle) delivers a powerful extraction, transformation,
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 25
and loading (ETL) capabilities, using a metadata-driven approach. It prepares and
blends data to create a complete picture of the business that drives actionable
insights. The complete data integration platform delivers accurate, “analytics ready”
data to end users from any source.
Figure 5.1Pentaho Data Integration Icon
In particular, Pentaho Data Integration is used to: extract Alfresco audit data into the Data
Mart and create the defined reports uploading them back to Alfresco.
5.2.2 Why Pentaho?
Figure 5.2 Pentaho Icon
5.2.2.1 Pentaho vs Jaspersoft vs BIRT
Pentaho and Jaspersoft, both provide the unique advantage of being cost effective but the
differences in terms of features vary. Although Jaspersoft’s report for designing reports is
comparatively better than Pentaho Report Designer, the dashboard capabilities of Pentaho
in terms of functionality are better. This is because dashboard functionality is present
only in the Enterprise edition of Jaspersoft whereas in Pentaho, it is accessible in the
Community edition too.
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 26
When it comes to Extract, Transfer and Load (ETL) tools, the Pentaho Data Integrator is
comparatively better since Jaspersoft falls short of few functions. When it comes to
OLAP analysis, Pentaho Mondrian engine has a stronger case compared to Jaspersoft.
Pentaho users also have huge set of choices in terms of plugin marketplace that is similar
to the app store of iOS and Android. To sum it up, Jaspersoft focus is more on reporting
and analysis and Pentaho’s focus is on data integration, ETL and workflow automation.
BIRT has also emerged as an important tool for business intelligence for those who are
well versed in Java. BIRT is an Eclipse-based open source reporting system for web
applications, especially those based on Java and Java EE where it consists of a report
designer based on Eclipse and a runtime component that be added to the app server. In
terms of basic functionality BIRT is at par with Pentaho and Jaspersoft perhaps a slight
advantage as it is based on Eclipse. Apart from that as a typical BI tool it is expected to
cover common Chart Types. Although BIRT covers most of the charts, it falls short of
Chart types like Ring, Waterfall, Step Area, Step, Difference, Thermometer and Survey
scale wherein Pentaho fills the gaps.
5.2.2.2 Conclusion
Unlike previous two tools, Pentaho is a complete BI suite covering various operations
from reporting to data mining. The key component of Pentaho is the Pentaho Reporting
which is a rich feature set and enterprise friendly. Its BI Server which is a J2EE
application also provides an infrastructure to run and view reports through a web-based
user interface. All of the following open source BI and reporting tools provide a rich
feature set ready for enterprises. It is based on the end user to thoroughly compare and
select either of these tools. All three of these open source business intelligence and
reporting tools provide a rich feature set ready for enterprise use. It will be up to the end
user to do a thorough comparison and select either of these tools. Major differences can
be found in report presentations, with a focus on web or print, or in the availability of a
report server. Pentaho distinguishes itself by being more than just a reporting tool, with a
full suite of components (data mining and integration).
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 27
Among organizations adopting Pentaho, one of the advantages felt is its low integration
time and infrastructural cost compared to SAP BIA, SAS BIA which are one of the big
players in Business Intelligence. Along with that the huge community support available
24/7 with active support forums allows Pentaho users to discuss the challenges and have
their questions cleared while using the tool. Its unlimited visualization and data sources
can handle any kind of data, coupled with a good tool set which has wide applicability
beyond just the base product.
5.2.3 COMPONENTS OF PENTAHO
Kettle is a set of tools and applications which allows data manipulations across multiple
sources. The main components of Pentaho Data Integration are:
 Spoon – It is a graphical tool that makes the design of an ETL process transformation
easy to create. It performs the typical data flow functions like reading, validating,
refining, transforming, writing data to a variety of different data sources and destinations.
Transformations designed in Spoon can be run with Kettle Pan and Kitchen.
 Pan – Pan is an application dedicated to run data transformations designed in Spoon.
 Chef – It is a tool to create jobs which automate the database update process in a
complex way.
 Kitchen – It is an application which helps execute the jobs in a batch mode, usually
using a schedule which makes it easy to start and control the ETL processing.
 Carte – It is a web server which allows remote monitoring of the running Pentaho Data
Integration ETL processes through a web browser.
5.3 ALFRESCO AUDIT ANALYSIS AND REPORTING TOOL
5.3.1 Introduction
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 28
Alfresco is one of the most widely used open source content management systems.
And though it is not part of its core, it is crucial to get metrics out of the Alfresco
system.
Figure 5.3 A.A.A.R. Icon
To that goal, a full-fledged audit layer was built on top of Alfresco using Pentaho. The
principle is that it is used for doing optimized analytics to build a data mart properly
optimized for the information we are extracting for the system and doing all the discovery
on top of that. To do that, one need an ETL tool and then once that is done, Pentaho is
needed to do reporting and exploration on top of that data warehouse. This in-between
tool is called AAAR - Alfresco Audit Analysis and Reporting.
5.3.2 Working and Installation of A.A.A.R.
Alfresco Content Management System can be seen as a primary source and generates
only raw data. On the other hand, Pentaho is a pure BI environment and consists of some
suitable integration and reporting tools.
Thus, A.A.A.R. extracts audit data from the Alfresco E.C.M., stores the data in the Data
Mart, creates reports in well-known formats and publishes them again in the Alfresco
E.C.M.
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 29
Figure 5.4 Working of A.A.A.R.
Alfresco E.C.M. is, at the same time, source and target of the flow. As source of the flow,
Alfresco E.C.M. is enabled with the audit service to track all the activities with detailed
information about who, when, what has been done on the system. Login (failed or
succeed), creation of content, creation of folders, adding or removing of properties or
aspects are only some examples of what is tracked from the audit service.
5.3.2.1 Prerequisites
1. Alfresco E.C.M.
2. PostGreSQL/MySQL
3. Pentaho Data Integration Tool
4. Pentaho Report Designer Tool
5.3.2.2 Enabling Alfresco Audit Service
The very first task to do is to activate the audit service in Alfresco performing the actions.
1. Stop Alfresco.
2. In '<Alfresco>/tomcat/shared/classes/alfresco-global.properties' append:
# Alfresco Audit service
audit.enabled=true
audit.alfresco-access.enabled=true
# Alfresco FTP service
## ATTENTION: Don’t do it if just enabled!
ftp.enabled=true
ftp.port=8082
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 30
3. Start Alfresco.
4. Login into Alfresco to have the very first audit data.
5.3.2.3 Data Mart Creation and Configuration
1. Open a terminal
2. For the PostgreSQL platform use:
cd <PostgreSQL bin>
psql –U postgres –f “<AAAR folder>/AAAR_DataMart.sql”
(use ‘psql.exe’ on Windows platform and ‘./psql’ on Linux based platforms)
3. Exit
4. Extract ‘reports.zip’ in the ‘data-integration’ folder. ‘report.zip’ contains 5 files with
‘prpt’ extension, each one containing one Pentaho Reporting Designer report. By
default, and to let the report production simpler, are saved in the default folder: ‘data-
integration’.
5. Update ‘dm_dim_alfresco’ table with the proper environment settings. Each row of
the table represents one Alfresco installation and for that reason the table is defined
with a
unique row by default, as described below.
desc with value ‘Alfresco’.
login with value ‘admin’.
password with value ‘admin’.
url with value
‘http://localhost:8080/alfresco/service/api/audit/query/alfresco-
access?verbose=true&limit=100000’.
is_active with value ‘Y’.
6. Update ‘dm_reports’ table with your target settings.
5.3.2.4 PDI Repository Settings
The third task is to set the Pentaho Data Integration Jobs properly.
1. Open a terminal
2. For the PostgreSQL platform use:
cd <PostgreSQL bin>
psql –U postgres –f “<AAAR folder>/AAAR_Kettle.sql”
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 31
(use ‘psql.exe’ on Windows platform and ‘./psql’ on Linux based platforms)
3.Exit
4. To set the Pentaho Data Integration repository:
i. Open a new terminal.
cd <data-integration>
ii. Launch ‘Spoon.bat’ if you are on Windows platform or ‘./Spoon.sh’ if you are
on Linux based platforms.
iii. Click on the green plus to add a new repository and define a new repository
connection in the database.
Figure 5.5 Step 1
iv. Add a new database connection to the repository.
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 32
Figure 5.6 Step 2
v. If you choose a PostgreSQL platform set the parameters described below in the
image. At the end push the test button to check the database connection.
Figure 5.7 Step 3
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 33
vi. Set the ID an Name fields and press the ‘ok’ button. Attention not to push the
‘create or upgrade’ button otherwise the E.T.L. will be damaged.
Figure 5.8 Step 4
vii. Connect with the login ‘admin’ and password ‘admin’ to test the connection.
Figure 5.9 Step 5
viii. If everything succeeds, you see the Pentaho Data Integration (Kettle) panel.
viii. From the Pentaho Data Integration panel, click on Tool -> Repository ->
explore.
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 34
Figure 5.10 Step 6
ix. Click on the ‘Connections’ tab and edit (the pencil on the top right) the
AAAR_DataMart connection. In the image below the PostgreSQL case but with
MySql is exactly the same.
Figure 5.11 Step 7
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 35
x. Modify the parameters and click on the test button to check. If everything
succeed you can close all. In the image below the PostgreSQL case but with
MySql is exactly the same.
Figure 5.12 Step 8
5.3.2.5 First Import
Now you are ready to get the audit data in the Data Mart and create the reports publishing
them to Alfresco.
Open a terminal
cd <data-integration>
kitchen.bat /rep:"AAAR_Kettle" /job:"Get all/dir:/Alfresco /user:admin
/pass:admin /level:Basic
kitchen.bat /rep:"AAAR_Kettle" /job:"Report all" /dir:/Alfresco
/user:admin /pass:admin /level:Basic
Finally you can access to Alfresco and look in the repository root where the reports are
uploaded by default.
5.3.3 Audit Data Mart
On the other side of the represented flow, there is a database storing the
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 36
extracts audit data organized in a specific Audit Data Mart. A Data Mart is a structure
that is usually oriented to a specific business line or team and, in this case, represents the
audited actions in the Alfresco E.C.M.
Figure 5.13 Audit Data Mart
5.3.4 Dimension Tables
The implemented Data Mart develops a single Star Schema having one only measure (the
number of audited actions) and the dimensions listed
below:
1. Alfresco instances to manage multiple sources of auditing data.
2. Alfresco users with a complete name.
3. Alfresco contents complete with the repository path.
4. Alfresco actions (login, failedLogin, read, addAspect, etc.).
5. Date of the action. Groupable in day, month and year.
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 37
6. Time of the action. Groupable in minute and hour.
Figure 5.14 Dimension Tables
5.4 TRANSFORMATIONS USING SPOON
The Spoon is the only DI design tool component. The DI Server is a core component that
executes data integration jobs and transformations using the Pentaho Data Integration
Engine. It also provides the services allowing you to schedule and monitor scheduled
activities.
Drag elements onto the Spoon canvas, or choose from a rich library of more than 200
pre-built steps to create a series of data integration processing instructions.
5.5 EXAMPLE TRANSFORMATION
Few of the transformations we have done using Spoon are listed below:-
1. Document Information
2. Document Permission
3. Folder Information
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 38
4. Folder Permission
5. User Information
Figure 5.15 Document Information Transformation
Figure 5.16 Document Permission Transformation
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 39
Figure 5.17 Folder Information Transformation
Figure 5.18 Folder Permission Transformation
Transformation Phase
DDU (Faculty of Tech., Dept. of IT) 40
Figure 5.19 User Information Transformation
Reporting Phase
DDU (Faculty of Tech., Dept. of IT) 41
REPORTING PHASE
6.1 WHAT IS A REPORT?
In its most basic form, a report is a document that contains information for the reader.
When speaking of computer generated reports, these documents refine data from various
sources into a human readable form. Report documents make it easy to distribute specific
fact-based information throughout the company. Reports are also used by the
management departments in decision making.
6.2 PENTAHO REPORT DESIGNERTOOL
6.2.1 Introduction
Pentaho Reporting is a suite of tools for creating pixel perfect reports. With Pentaho
Reporting, we are able to transform data into meaningful information. You can create
HTML, Excel, PDF, Text or printed reports. If you are a developer, you can also produce
CSV and XML reports to feed other systems.
Figure 6.1 Pentaho Reporting Tool Icon
It helps in transforming all the data into meaningful information tailored according to
your audience with a suite of Open Source tools that allows you to create pixel-perfect
reports of your data in PDF, Excel, HTML, Text, Rich-Text-File, XML and CSV. These
computer generated reports easily refine data from various sources into a human readable
form.
Reporting Phase
DDU (Faculty of Tech., Dept. of IT) 42
6.2.2 Working of Pentaho ReportDesignerTool
Once, the transformations are completed using K.E.T.T.L.E., we can import these
transformations from the Data Mart in the Pentaho Report Designer tool with the help of
SQL. Pentaho Report Designer tool has a large selection of elements (Text fields, Labels
etc.) and various GUI representation techniques like pie-charts, tables, graphs etc with
which we can create our reports.
6.3 EXAMPLE REPORTS
According to the transformations done using Spoon, we created reports of the following
requirements using Pentaho Report Designer:-
1. Document Information
2. Document Permission
3. Folder Information
4. Folder Permission
5. User Information
Figure 6.2 Document Information Report
Reporting Phase
DDU (Faculty of Tech., Dept. of IT) 43
Figure 6.3 Document Permission Report
Figure 6.4 Folder Information Report
Reporting Phase
DDU (Faculty of Tech., Dept. of IT) 44
Figure 6.5 Folder Permission Report
Figure 6.6 User Information Report
Publishing Phase
DDU (Faculty of Tech., Dept. of IT) 45
PUBLISHING PHASE
7.1INTRODUCTION
After the reports are made using the designing tool, we need to publish the reports on the
server. Pentaho BI Server or BA platform allows you to access business data in the form
of dashboards, reports or OLAP cubes via a convenient web interface. Additionally it
provides an interface to administer your BI setup and schedule processes. Also, different
types of output types are available like pdf, html, csv etc.
7.2PENTAHO BI SERVER
7.2.1 Introduction
It is commonly referred to as the BI Platform, and recently renamed Business Analytics
Platform (BA Platform). It makes up the core software piece that hosts content created
both in the server itself through plug-ins or files published to the server from the desktop
applications. It includes features for managing security, running reports, displaying
dashboards, report bursting, scripted business rules, OLAP analysis and scheduling out of
the box.
Figure 7.1 Pentaho BI Server Icon
The commercial plug-ins from Pentaho expand out-of-the-box features. A few open-
source plug-in projects also expand capabilities of the server. The Pentaho BA Platform
runs in the Apache Java Application Server. It can be embedded into other Java
Publishing Phase
DDU (Faculty of Tech., Dept. of IT) 46
Application Servers.
7.2.2 Example Published Reports
According to the reports we have created, the following reports can be deployed on the
Web:-
1. Document Information
2. Document Permission
3. Folder Information
4. Folder Permission
5. User Information
Figure 7.2 Document Information Published Report
Publishing Phase
DDU (Faculty of Tech., Dept. of IT) 47
Figure 7.3 Document Permission Published Report
Figure 7.4 Folder Information Published Report
Publishing Phase
DDU (Faculty of Tech., Dept. of IT) 48
Figure 7.5 Folder Permission Published Report
Figure 7.6 User Information Published Report
Publishing Phase
DDU (Faculty of Tech., Dept. of IT) 49
7.3 SCHEDULING OF TRANSFORMATIONS
Once, the project has been completed, for real-time usage, the data warehouse needs to be
updated after every particular interval. For that purpose, we have to create scheduling for
our project so that it gets updated every day reflecting changes done in the last 24 hours.
There are three types to perform scheduling:
1. Using schedule option from action menu in Spoon.
2. Using start element in job i.e. kjb(Kettle jobs) files
3. Using task scheduler
Usually, first method is preferred in industries, but as we are working on community
edition, scheduling option is not provided. Also, the second method is used just for jobs
and does not update transformations so it was not suitable.
So, we scheduled the project using task scheduler. We have scheduled all the
transformations. It has been scheduled in such a way that it will run daily at 11:00 am.
The project has been deployed on web and submitted to our external guide. It will be
used further by IPR on web server for real-time usage.
Figure 7.7 Scheduling of Transformations
Testing
DDU (Faculty of Tech., Dept. of IT) 50
TESTING
8.1 TESTING STRATEGY
Data completeness: Ensures that all expected data is loaded in to target table.
1. Compare records counts between source and target and check for any rejected
records.
2. Check Data should not be truncated in the column of target table.
3. Check unique values has to load in to the target. No duplicate records should
exist.
4. Check boundary value analysis
Data quality: Ensures that the ETL application correctly rejects, substitutes default
values, corrects or ignores and reports invalid data.
Data cleanness: Unnecessary columns should be deleted before loading into the staging
area.
1. Example: If a column have name but it is taking extra space , we have to
“trim” space so before loading in the staging area with the help of expression
transformation space will be trimmed.
2. Example: Suppose telephone number and STD code in different columns and
requirement says it should be in one column then with the help of expression
transformation we will concatenate the values in one column.
Data Transformation: All the business logic implemented by using ETL-Transformation
should reflect.
Integration testing: Ensures that the ETL process functions well with other upstream and
downstream processes.
Testing
DDU (Faculty of Tech., Dept. of IT) 51
User-acceptance testing: Ensures the solution meets users’ current expectations and
anticipates their future expectations.
Regression testing: Ensures existing functionality remains intact each time a new release
of code is completed.
8.2 TESTING METHODS
• Functional test: it verifies that the item is compliant with its specified business
requirements.
• Usability test: it evaluates the item by letting users interact with it, in order to verify that
the item is easy to use and comprehensible.
• Performance test: it checks that the item performance is satisfactory under typical
workload conditions.
• Stress test: it shows how well the item performs with peak loads of data and very heavy
workloads.
• Recovery test: it checks how well an item is able to recover from crashes, hardware
failures and other similar problems.
• Security test: it checks that the item protects data and maintains functionality as
intended.
• Regression test: It checks that the item still functions correctly after a change has
occurred.
Testing
DDU (Faculty of Tech., Dept. of IT) 52
8.3 TEST CASES
8.3.1 USER LOGIN AND USING THE FUNCTIONALITY OF REPORT
Description: This test will validate user name and password and he will be able to select
the desired format of reports with the desired selection option
Table 8.1 Test Case 1
Sr. No Test Case Expected Output Actual Output Test Case
Status
1 User login to
his/her page
BA server should
open
BA server page
opens
Pass
2 User views a
report
Report should be
displayed
User is able to
view report
Pass
3 User while
viewing selects
the type of
output format
User must see the
desired format
output
Desired format
of the report is
displayed
Pass
4 User filters out
the report view
User should see
the filtered report
User is able to
view the
desired report
Pass
Testing
DDU (Faculty of Tech., Dept. of IT) 53
8.3.2 VIEWING DOCUMENTS, FOLDERS, PERMISSIONS, AUDITS
Description: This test case will check whether user is able to view the data of folders,
documents, its permissions and audit data.
Table 8.2 Test Case 2
Sr.
No
Test Case Expected Output Actual Output Test Case
Status
1 User view
the
documents
Document details
should be displayed
Document is
seen
Pass
2 User view
the folders
Folder should be
displayed
Folder is seen Pass
3 User view
the
permissions
of folders
and
documents
Permissions must be
seen by user
Permissions
displayed
Pass
4 User view
the auditing
data
Audit data must be
displayed
Audit data is
seen by user
Pass
User Manual
DDU (Faculty of Tech., Dept. of IT) 54
USER MANUAL
9.1DESCRIPTION
This manual describes the working and use of the project so as to help the end user and
get them familiar with the features.
Our project is divided into three levels. These are:-
1. Source Level
2. DWH Level
3. View Level
The source level is the back-end of our project i.e. Alfresco Database. The DWH level is
PostGreSQL, used in creating our Data Mart. And the view level is the Pentaho tools.
The users will be able to see the view level of the project, specifically the Pentaho
Business Analytics tool where the published reports are deployed. Once in the BA
dashboard, the user can use many functionalities of it. The functionalities are listed
below:-
1. Login Page
2. View Reports
3. Scheduling
4. Administration
9.2 LOGIN PAGE
Before using the BA server, a user has to login into the server using his assigned user
name and password so that the system knows which user has accessed the server and at
what time. This helps in security purposes.
To login, we have to follow the steps below:-
User Manual
DDU (Faculty of Tech., Dept. of IT) 55
1. We have to go the BI server folder using command line prompt. After we have
changed the directory to BI server, we need to start Pentaho.
Figure 9.1 Login Step 1
2. Once we login, the system automatically loads runs Apache tomcat.
Figure 9.2 Login Step 2
User Manual
DDU (Faculty of Tech., Dept. of IT) 56
3. If the tomcat doesn’t find any error, it opens the user console of Pentaho BA
server. The user can now login the server using their own user name and
password.
Figure 9.3 Login Step 3
9.3 VIEW REPORTS
The main requirement of the user is to view the reports on the Web Browser so as to
make decisions among various other uses. To do that, the user has to follow these steps:-
1. Once we login, Home screen opens up as shown in the given figure. For
viewing reports, we have to select ‘Browse Files’ (1) from the drop down list.
User Manual
DDU (Faculty of Tech., Dept. of IT) 57
Figure 9.4 View Reports Step 1
2. Once we select the ‘Browse Files’ options, the console opens up the ‘Folders’
(2) in Home and the associated ‘Files’ (3) of the folder we select in file box.
There is also an option of ‘Folder Actions’ (4) provided in the console which
helps in various functions like creating a new folder, deleting a folder etc.
3.To view a report, we have to select the report from the file box. For example, if
we have to see the report of the Documents Permissions, we need to click at
docpermission-rep (5) file in the file box. It will open the Documents Permissions
report (6) on the browser.
User Manual
DDU (Faculty of Tech., Dept. of IT) 58
Figure 9.5 View Reports Step 2
4. We can apply filters in the report. For example, in this report, we can filter and
list the document according to the permissions by selecting the appropriate
permission (7). Here, we have selected ‘Read’ permission from the ‘select
permissions’ filter.
Also, we can view reports in different styles selecting the appropriate style from
‘Output Type’ (8). Here, we have selected HTML (Single Page) type.
9.3 SCHEDULING
You can schedule reports to run automatically. All of your active scheduled reports
appear in the list of schedules, which you can get to by clicking the Home drop-down
menu, then the Schedules link, in the upper-left corner of the User Console page. You can
also access the list of schedules from the Browse Files page, if you have a report selected.
User Manual
DDU (Faculty of Tech., Dept. of IT) 59
The list of schedules shows which reports are scheduled to run, the recurrence pattern for
the schedule, when it was last run, when it is set to run again, and the current state of the
schedule.
Figure 9.6 Scheduling Page
Table 9.1 Scheduling options
Item Name Function
Schedules indicat
or
Indicates the current User Console perspective
that you are using. Schedules displays a list
ofschedules that you create, a toolbar to work
with your schedules, and a list of times that your
schedules are blocked from running.
Schedule Name Lists your schedules by the name you assign to
them. Click the arrow next to Schedule Nameto
sort schedules alphabetically in ascending or
descending order.
User Manual
DDU (Faculty of Tech., Dept. of IT) 60
Item Name Function
Repeats Describes how often the schedule is set to run.
Source File Displays the name of the file associated with the
schedule.
Output Location Shows the location that the scheduled report is
saved.
Last Run Shows the last time and date when the schedule
was run.
Next Run Shows the next time and date when the schedule
will run again.
Status Indicates the current Status of the schedule. The
state can be either Normal or Paused.
Blockout Times Lists the times that all schedules are blocked
from running.
You can edit and maintain each of your schedules by using the controls above the
schedules list, on the right end of the toolbar.
Table 9.2 Scheduling Controls
Icon Name Function
Refresh Refreshes the list of schedules.
Run Now Runs a selected schedule(s) at will.
Stop Scheduled
Task
Pauses a specified schedule. Use Start
Schedule to start paused jobs.
Start Scheduled
Task
Resumes a previously stopped schedule.
User Manual
DDU (Faculty of Tech., Dept. of IT) 61
Icon Name Function
Edit Scheduled
Task
Edits the details of an existing schedule.
Remove
Scheduled Task
Deletes a specified schedule. If the schedule is
currently running, it continues to run, but it
will not run again.
9.4 ADMINISTRATION
The User Console has one unified place, called the Administration page, where people
logged in with a role that has permissions to administer security can perform system
configuration and maintenance tasks. If you see Administration in the left drop-down
menu on the User Console Home page, you can click it to reveal menu items having to do
with administration of the BA Server. If you do not have administration privileges,
Administration does not appear on the home page.
Figure 9.7 Administration Page
User Manual
DDU (Faculty of Tech., Dept. of IT) 62
Table 9.3 Administration Options
Item Control Name Function
1 Administration Open the Administration perspective of
the User Console. The Administration
perspective enables you to set up users,
configure the mail server, change
authentication settings on the BA
Server, and install software licenses for
Pentaho.
2 Users & Roles Manage the Penatho users or roles for
the BA Server.
3 Authentication Set the security provider for the BA
Server to either the default Pentaho
Security or LDAP/Active Directory.
4 Mail Server Set up the outgoing email server and
the account used to send reports
through email.
5 Licenses Manage Pentaho software licenses.
6 Settings Manage settings for deleting older
generated files, either manually or by
creating a schedule for deletion.
Limitations and Future Enhancements
DDU (Faculty of Tech., Dept. of IT) 63
LIMITATIONS AND FUTURE ENHANCEMENTS
10.1 LIMITATIONS
 All the data is stored in single repository in Alfresco. In case of improper
management in backup of data, there are chances of data loss.
 Since community edition of Pentaho Data Integration Tool had limited number of
functionalities, scheduling had to be done manually.
10.2 FUTURE ENHANCEMENTS
 We could compress 97 tables of Alfresco to 29 tables in data warehouse. This
could be further reduced in future so as to increase the efficiency.
 Sophisticated requirements like hyperlink functions and ticket generation for
employees can be done.
Conclusion and Discussion
DDU (Faculty of Tech., Dept. of IT) 64
CONCLUSION AND DISCUSSION
11.1 SELF ANALYSIS OF PROJECT VIABILITIES
11.1.1 Self Analysis
We have created an information repository i.e. a data warehouse of an already existing
database system Alfresco. We have successfully installed the application and tested its
performance on several fronts. We have successfully completed validation testing. This
project task has been accomplished in such a way that it incorporates several features
demanded for present report generation and decision making requirements.
11.1.2 Project Viabilities
This project is successfully completed and is viable to be used in the Institute of Plasma
Research as a tool for generating reports according to the data stored in their database,
Alfresco. These reports are user-friendly with strong GUI support using a host of
graphical options like pie-charts, line graphs, bar charts etc. Decision making becomes
easier for the management department because of these reports.
11.2 PROBLEMS ENCOUNTERED AND POSSIBLE SOLUTIONS
 Alfresco was a new system, which we have never used before. For three to four
weeks, it was difficult to understand all its functionalities and working. So it took
time to understand full-fledged working of these technologies.
 Alfresco GUI was not accessible in both our computers, due to which, we had to
install PostGreSQL and SQuirreL database systems.
 It took time to finalize the ETL and reporting tools. We finally zeroed it down to
Pentaho over JasperSoft and BIRT.
Conclusion and Discussion
DDU (Faculty of Tech., Dept. of IT) 65
 Pentaho Enterprises is basically a collection of tools. Each stage of our project
could be done by a particular tool/system. Thus, we had to get ourselves familiar
with a host of Pentaho tools.
 Alfresco Audit Analysis and Reporting (A.A.A.R.) had not converted many of
our tables while transforming into data warehouse. Thus, we had to do manually.
11.3 SUMMARY OF PROJECT WORK
 PROJECT TITLE
DATA AND BUSINESS PROCESS INTELLIGENCE
It is a project based on the subject of data mining. Data Warehouse is created
from where data is used to create user-friendly reports.
 PROJECT PLATFORM
PENTAHO
It is an open-source provider of reporting, analysis, dashboard, data mining and
workflow capabilities.
 SOFTWARE USED
Windows/Linux based system
PostgreSQL Database
SQuirreL Database
Alfresco ECM
Pentaho Community Edition 5.0 (PDI, Reporting Tool, BI Server)
Alfresco Audit Analyzing and Reporting tool
Notepad++
 DOCUMENTATION TOOLS
VISIO 2013
Conclusion and Discussion
DDU (Faculty of Tech., Dept. of IT) 66
WORD 2007
EXCEL 2007
 INTERNAL PROJECT GUIDE
PROF. R.S. CHHAJED
 EXTERNAL PROJECT GUIDE
MR. VIJAY PATEL
 COMPANY
INSTITUTE FOR PLASMA RESEARCH
 SUBMITTED BY
BHAGAT FARIDA H.
SINGH SWATI
 SUBMITTED TO
DHARAMSINH DESAI UNIVERSITY
 PROJECT DURATION
8TH DEC 2014 TO 28TH MARCH 2015
References
DDU (Faculty of Tech., Dept. of IT) 67
REFERENCES
http://wiki.pentaho.com/display/Reporting/01.+Creating+Your+First+Report
http://infocenter.pentaho.com/help/index.jsp?topic=%2Freport_designer_user_guide%2Ft
ask_adding_hyperlinks.html
http://www.robertomarchetto.com/pentaho_report_parameter_example
http://docs.alfresco.com/4.2/concepts/alfresco-arch-about.html
http://fcorti.com/alfresco-audit-analysis-reporting/aaar-description-of-the-solution/aaar-
pentaho-data-integration/
http://en.wikipedia.org/wiki/Pentaho
http://www.joyofdata.de/blog/getting-started-with-pentaho-bi-server-5-mondrian-and-
saiku/
https://technet.microsoft.com/en-us/library/aa933151(v=sql.80).aspx
http://datawarehouse4u.info/OLTP-vs-OLAP.html

More Related Content

What's hot

Project.12
Project.12Project.12
Project.12GS Kosta
 
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATAAishwarya Saseendran
 
NIC Project Final Report
NIC Project Final ReportNIC Project Final Report
NIC Project Final ReportKay Karanjia
 
E learning resource Locator Project Report (J2EE)
E learning resource Locator Project Report (J2EE)E learning resource Locator Project Report (J2EE)
E learning resource Locator Project Report (J2EE)Chiranjeevi Adi
 
Placement management system
Placement management systemPlacement management system
Placement management systemMehul Ranavasiya
 
Mcsp 060 project guidelines july 2012
Mcsp 060 project guidelines july 2012Mcsp 060 project guidelines july 2012
Mcsp 060 project guidelines july 2012Abhishek Verma
 
Software design of library circulation system
Software design of  library circulation systemSoftware design of  library circulation system
Software design of library circulation systemMd. Shafiuzzaman Hira
 
report_FYP_Nikko_23582685
report_FYP_Nikko_23582685report_FYP_Nikko_23582685
report_FYP_Nikko_23582685Nikko Hermawan
 
SRS for student database management system
SRS for student database management systemSRS for student database management system
SRS for student database management systemSuman Saurabh
 
BIT (Building Material Retail Online Store) Project Nay Linn Ko
BIT (Building Material Retail Online Store) Project Nay Linn KoBIT (Building Material Retail Online Store) Project Nay Linn Ko
BIT (Building Material Retail Online Store) Project Nay Linn KoNay Linn Ko
 
Human Resource Management System
Human Resource Management SystemHuman Resource Management System
Human Resource Management SystemAdam Waheed
 
Facial recognition attendance system
Facial recognition attendance systemFacial recognition attendance system
Facial recognition attendance systemKuntal Faldu
 
Training and placement –innobuzz pune
Training and placement –innobuzz puneTraining and placement –innobuzz pune
Training and placement –innobuzz punediwakar sharma
 
Dormitory management system project report
Dormitory management system project reportDormitory management system project report
Dormitory management system project reportShomnath Somu
 
Bpr Project - Attendance Management System
Bpr Project - Attendance Management SystemBpr Project - Attendance Management System
Bpr Project - Attendance Management SystemGihan Timantha
 

What's hot (20)

Project.12
Project.12Project.12
Project.12
 
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
 
NIC Project Final Report
NIC Project Final ReportNIC Project Final Report
NIC Project Final Report
 
E learning resource Locator Project Report (J2EE)
E learning resource Locator Project Report (J2EE)E learning resource Locator Project Report (J2EE)
E learning resource Locator Project Report (J2EE)
 
Placement management system
Placement management systemPlacement management system
Placement management system
 
Report 2
Report 2Report 2
Report 2
 
Mcsp 060 project guidelines july 2012
Mcsp 060 project guidelines july 2012Mcsp 060 project guidelines july 2012
Mcsp 060 project guidelines july 2012
 
Software design of library circulation system
Software design of  library circulation systemSoftware design of  library circulation system
Software design of library circulation system
 
report_FYP_Nikko_23582685
report_FYP_Nikko_23582685report_FYP_Nikko_23582685
report_FYP_Nikko_23582685
 
FYP 2 REPORT AMIRUL ARIFF
FYP 2 REPORT AMIRUL ARIFFFYP 2 REPORT AMIRUL ARIFF
FYP 2 REPORT AMIRUL ARIFF
 
SRS for student database management system
SRS for student database management systemSRS for student database management system
SRS for student database management system
 
BIT (Building Material Retail Online Store) Project Nay Linn Ko
BIT (Building Material Retail Online Store) Project Nay Linn KoBIT (Building Material Retail Online Store) Project Nay Linn Ko
BIT (Building Material Retail Online Store) Project Nay Linn Ko
 
Impro
ImproImpro
Impro
 
SAD Final Assignment
SAD Final AssignmentSAD Final Assignment
SAD Final Assignment
 
Human Resource Management System
Human Resource Management SystemHuman Resource Management System
Human Resource Management System
 
Facial recognition attendance system
Facial recognition attendance systemFacial recognition attendance system
Facial recognition attendance system
 
Training and placement –innobuzz pune
Training and placement –innobuzz puneTraining and placement –innobuzz pune
Training and placement –innobuzz pune
 
Dormitory management system project report
Dormitory management system project reportDormitory management system project report
Dormitory management system project report
 
Bpr Project - Attendance Management System
Bpr Project - Attendance Management SystemBpr Project - Attendance Management System
Bpr Project - Attendance Management System
 
Srs sample
Srs sampleSrs sample
Srs sample
 

Similar to DATA AND BUSINESS PROCESS INTELLIGENCE

Android technical quiz app
Android technical quiz appAndroid technical quiz app
Android technical quiz appJagdeep Singh
 
FINAL REPORT DEC
FINAL REPORT DECFINAL REPORT DEC
FINAL REPORT DECAxis Bank
 
Student portal system application -Project Book
Student portal system application -Project BookStudent portal system application -Project Book
Student portal system application -Project BookS.M. Fazla Rabbi
 
Minor Project Synopsis on Data Structure Visualizer
Minor Project Synopsis on Data Structure VisualizerMinor Project Synopsis on Data Structure Visualizer
Minor Project Synopsis on Data Structure VisualizerRonitShrivastava057
 
Project for Student Result System
Project for Student Result SystemProject for Student Result System
Project for Student Result SystemKuMaR AnAnD
 
E learning project report (Yashraj Nigam)
E learning project report (Yashraj Nigam)E learning project report (Yashraj Nigam)
E learning project report (Yashraj Nigam)Yashraj Nigam
 
Minor project report format for 2018 2019 final
Minor project report format for 2018 2019 finalMinor project report format for 2018 2019 final
Minor project report format for 2018 2019 finalShrikantkumar21
 
Report on design and development of low cost 3d printer
Report on design and development of low cost 3d printerReport on design and development of low cost 3d printer
Report on design and development of low cost 3d printerApurva Tolia
 
TY BSc.IT Blackbook Cover page
TY BSc.IT  Blackbook   Cover pageTY BSc.IT  Blackbook   Cover page
TY BSc.IT Blackbook Cover pageAkashChauhan139
 
Study space(report)
Study space(report)Study space(report)
Study space(report)ajaycparmar
 
E filling system (report)
E filling system (report)E filling system (report)
E filling system (report)Badrul Alam
 
IRJET- Course outcome Attainment Estimation System
IRJET-  	  Course outcome Attainment Estimation SystemIRJET-  	  Course outcome Attainment Estimation System
IRJET- Course outcome Attainment Estimation SystemIRJET Journal
 
Online Helpdesk System
Online Helpdesk SystemOnline Helpdesk System
Online Helpdesk SystemJayant Gope
 

Similar to DATA AND BUSINESS PROCESS INTELLIGENCE (20)

Sport.net(2).doc
Sport.net(2).docSport.net(2).doc
Sport.net(2).doc
 
Android technical quiz app
Android technical quiz appAndroid technical quiz app
Android technical quiz app
 
FINAL REPORT DEC
FINAL REPORT DECFINAL REPORT DEC
FINAL REPORT DEC
 
Student portal system application -Project Book
Student portal system application -Project BookStudent portal system application -Project Book
Student portal system application -Project Book
 
Minor Project Synopsis on Data Structure Visualizer
Minor Project Synopsis on Data Structure VisualizerMinor Project Synopsis on Data Structure Visualizer
Minor Project Synopsis on Data Structure Visualizer
 
Online Job Portal
Online Job PortalOnline Job Portal
Online Job Portal
 
Project for Student Result System
Project for Student Result SystemProject for Student Result System
Project for Student Result System
 
E learning project report (Yashraj Nigam)
E learning project report (Yashraj Nigam)E learning project report (Yashraj Nigam)
E learning project report (Yashraj Nigam)
 
Minor project report format for 2018 2019 final
Minor project report format for 2018 2019 finalMinor project report format for 2018 2019 final
Minor project report format for 2018 2019 final
 
report-1.pdf
report-1.pdfreport-1.pdf
report-1.pdf
 
Project Report
Project ReportProject Report
Project Report
 
3 job adda doc 1
3 job adda doc 13 job adda doc 1
3 job adda doc 1
 
3 job adda doc 1
3 job adda doc 13 job adda doc 1
3 job adda doc 1
 
Report on design and development of low cost 3d printer
Report on design and development of low cost 3d printerReport on design and development of low cost 3d printer
Report on design and development of low cost 3d printer
 
AIRPORT MANAGEMENT USING FACE RECOGNITION BASE SYSTEM
AIRPORT MANAGEMENT USING FACE RECOGNITION BASE SYSTEMAIRPORT MANAGEMENT USING FACE RECOGNITION BASE SYSTEM
AIRPORT MANAGEMENT USING FACE RECOGNITION BASE SYSTEM
 
TY BSc.IT Blackbook Cover page
TY BSc.IT  Blackbook   Cover pageTY BSc.IT  Blackbook   Cover page
TY BSc.IT Blackbook Cover page
 
Study space(report)
Study space(report)Study space(report)
Study space(report)
 
E filling system (report)
E filling system (report)E filling system (report)
E filling system (report)
 
IRJET- Course outcome Attainment Estimation System
IRJET-  	  Course outcome Attainment Estimation SystemIRJET-  	  Course outcome Attainment Estimation System
IRJET- Course outcome Attainment Estimation System
 
Online Helpdesk System
Online Helpdesk SystemOnline Helpdesk System
Online Helpdesk System
 

DATA AND BUSINESS PROCESS INTELLIGENCE

  • 1. DATA AND BUSINESS PROCESS INTELLIGENCE PENTAHO PLATFORM DEVELOPED AT: BHAT, GANDHINAGAR-382428 DEVELOPED BY: BHAGAT FARIDA H. SINGH SWATI 11ITUOS079 11ITUOS068 GUIDED BY: INTERNAL GUIDE EXTERNAL GUIDE PROF. R.S. CHHAJED MR. VIJAY PATEL Department of Information Technology. Faculty of Technology, Dharmsinh Desai University, College Road, Nadiad- 387001.
  • 2. DDU (Faculty of Tech., Dept. of IT) i CANDIDATE’S DECLARATION We declare that the final semester report entitled “DATA AND BUSINESS PROCESS INTELLIGENCE” is our own work conducted under the supervision of the external guide MR. Vijay Patel, Institute for Plasma Research, Bhat, Gandhinagar and internal guide Prof. R.S. Chhajed, Faculty of Technology, DDU, Nadiad. We further declare that to the best of our knowledge the report for B.TECH SEM-VIII does not contain part of the work which has been submitted either in this or any other university without proper citation. Farida Bhagat H. Branch: Information Technology Student ID: 11ITUOS079 Roll: IT-07 Singh Swati Branch: Information Technology Student ID: 11ITUOS068 Roll: IT-124 Submitted To: PROF. R.S. CHHAJED, Department of Information Technology, Faculty of Technology, Dharmsinh Desai University, Nadiad
  • 3. DDU (Faculty of Tech., Dept. of IT) ii
  • 4. DDU (Faculty of Tech., Dept. of IT) iii DHARMSINH DESAI UNIVERSITY NADIAD-387001, GUJARAT CERTIFICATE This is to certify that the project entitled “DATA AND BUSINESS PROCESS INTELLIGENCE” is a bonafied report of the work carried out by 1) Miss BHAGAT FARIDA H., Student ID No: 11ITUOS079 2) Miss SINGH SWATI, Student ID No: 11ITUOS068 of Department of Information Technology, semester VIII, under the guidance and supervision for the award of the degree of Bachelor of Technology at Dharmsinh Desai University, Gujarat. They were involved in Project training during academic year 2013- 2014. Prof. R.S.Chhajed HOD, Department of Information Technology, Faculty of Technology, Dharmsinh Desai University, Nadiad Date:
  • 5. DDU (Faculty of Tech., Dept. of IT) iv ACKNOWLEDGEMENTS We are grateful to Mr. Amit Srivastava (Institute for Plasma Research) for giving us this opportunity to work under the guidance of prominent Solution Expert in the field of Software Engineering and also providing us with the required resources at the institute. We are also thankful to Mr. Vijay Patel (Institute for Plasma Research) for guiding us in our project and sharing valuable knowledge with us. It gives us immense pleasure and satisfaction in presenting this report of Project undertaken during the 8th semester of B.Tech. As it is the first step into our professional life, we would like to take this opportunity to express our sincere thanks to several people, without whose help and encouragement, it would be impossible for us to carry out this desired work. We would like to express thanks to our Head of Department Prof. R. S. Chhajed who gave us an opportunity to undertake this work. We are grateful to him for his guidance in the development process. Finally we would like to thank all Institute of Plasma Research employees, all the faculty members of our college, friends and family members for providing their support and continuous encouragement throughout the project. Thank you Bhagat Farida H. Singh Swati
  • 6. DDU (Faculty of Tech., Dept. of IT) v TABLE OF CONTENTS ABSTRACT……………………………………………………………………………....1 COMPANY PROFILE………………………………………………………………......3 LIST OF FIGURES……………………………………………………………………...4 LIST OF TABLES……………………………………………………………………….6 1. INTRODUCTION…………………..……………………………………………….7 1.1 Project Details……………………………………………………………………7 1.2 Purpose…………………………………………………………………………....7 1.3 Scope………………………………………………………………………………7 1.4 Objective………………………………………………………………………….8 1.5 Technology and Literature Review……………………………………………..8 1.5.1 Alfresco ECM……………………………………………………………8 1.5.2 Pentaho Platform………………………………………………………...9 2. PROJECT MANAGEMENT………………………………………………………10 2.1 Feasibility Study………………………………………………………………...10 2.2 Project Planning………………………………………………………………...10 2.2.1 Project Development Approach……………………………………….10 2.2.2 Project Plan…………………………………………………………..…11 2.2.3 Milestones and Deliverables...……………………………………….…12 2.2.4 Project Scheduling………………………………………………….…..13 3. SYSTEM REQUIREMENTS STUDY………………………………………..….14 3.1 User Characteristics……………..……………………………………………..14 3.2 Hardware and Software Requirements…………………………………….…14 3.2.1 Hardware Requirements……………………………………………….14
  • 7. DDU (Faculty of Tech., Dept. of IT) vi 3.2.2 Software Requirements……………………………………………..…14 3.3 Constraints…………………………………………………………………...…15 3.3.1 Regulatory Policies…………………………………………………..…15 3.3.2 Hardware Limitations………………………………………………….15 3.3.3 Interfaces to Other Applications………………………………………15 3.3.4 CMIS……………………………………………………………………15 3.3.5 Parallel Operations……………………………………………………..16 3.3.6 Reliability Requirements………………………………………………16 3.3.7 Criticality of the Application………………………………………….16 3.3.8 Safety and Security Considerations………………………………...…16 4. ALFRESCO ECM SYSTEM……………………………………………………...17 4.1 Introduction……………………………………………………………………..17 4.2 Alfresco Overview………………………………………………………………17 4.3 Architecture...…………………………………………………………………...19 4.3.1 Client.……………………………………………………………………19 4.3.2 Server……………………………………………………………………19 4.4 Data Storage in Alfresco……………………………………………………….21 4.5 Relationship Diagrams…………………………………………………………21 5. TRANSFORMATION PHASE……………..…………………………………….24 5.1 Introduction…………………………………………………………………….24 5.2 Pentaho Data Integration Tool….……………………………………………..24 5.2.1 Introduction…………………………………………………………….24 5.2.2 Why Pentaho?..........................................................................................25 5.2.2.1 JasperSoft vs Pentaho vs BIRT……………………………………25 5.2.2.2 Conclusion…………………………………………………………..26 5.2.3 Components of Pentaho………………………………………………..27 5.3 Alfresco Audit Analysis and Reporting Tool………………...……………….28 5.3.1 Introduction…………………………………………………………….28 5.3.2 Working and Installation of A.A.A.R. ..................................................29
  • 8. DDU (Faculty of Tech., Dept. of IT) vii 5.3.2.1 Pre Requisites……………………………………………………….30 5.3.2.2 Enabling Alfresco Audit Service…………………………………...30 5.3.2.3 Data Mart Creation and Configuration………………………...…30 5.3.2.4 PDI Repository Setting..……………………………………………31 5.3.2.5 First Import………………………………………………………....36 5.3.3 Audit Data Mart………………………………………………………...36 5.3.4 Dimension Tables……………………………………………………….37 5.4 Transformations Using Spoon…………………………………………………38 5.5 Example Transformations………..………………………………………….…38 6. REPORTING PHASE……………..………………………………………….……42 6.1 What is a Report?.……………………………………………………………...42 6.2 Pentaho Report Designer Tool….……………………………………………...42 6.2.1 Introduction……………………………………………………………..42 6.2.2 Working of Pentaho Designer………………………………………….43 6.3 Example Reports………..………………………………………………………44 7. PUBLISHING PHASE……………..………………………………………………46 7.1 Introduction………………..…………………………………………………...46 7.2 Pentaho BI Server………...…………………………………………………….46 7.2.1 Introduction……………………………………………………………...4 6 7.2.2 Example Published Reports……………………………………………47 7.3 Scheduling of Transformations…………….………………………………….50 8. TESTING……………..……………………………………………………………..51 8.1 Testing Strategies….…………………………………………………………....51 8.2 Testing Methods………………………………………………………………...52 8.3 Test Cases……………………………………………………………………….53 8.3.1 User Login and Functionality of Report………………………………53 8.3.2 Viewing Documents, Folders, Permissions, Audits…………………...54
  • 9. DDU (Faculty of Tech., Dept. of IT) viii 9. USER MANUAL……………………………………………………………………55 9.1 Description………………………………………………………………………55 9.2 Login Page………………………………………………………………………55 9.3 View Reports……………………………………………………………………57 9.4 Scheduling………………………………………………………………………59 9.5 Administration……………………………………………………………….....62 10. LIMITATIONS AND FUTURE ENHANCEMENTS……………………………64 10.1 Limitations……………………………………………………………………..64 10.2 Future Enhancements…………………………………………………………64 11. CONCLUSION AND DISCUSSION……………………………………………...65 11.1 Self Analysis of Project Viabilities……………………………………………65 11.1.1 Self Analysis……………………………………………………………...65 11.1.2 Project Viabilities………………………………………………….…….65 11.2 Problems Encountered and Possible Solutions……………………………...65 11.3 Summary of Project Work…………………………………………………...66 12. REFERENCES…………………………………………………………………….68
  • 10. Abstract DDU (Faculty of Tech., Dept. of IT) 1 ABSTRACT Design and implement a platform for Data and Process intelligence tool IPR has selected Alfresco, an Enterprise Content Management (ECM), as an Electronic Document and Record Management System (EDRMS). Alfresco do not have powerful reporting functionality and honestly, it is not its job. Unfortunately, the need for powerful reporting is still there and most of the answers are tricky solutions, quite hard to manage and scale. Alfresco ECM has a detailed audit service that exposes a lot of (potentially) useful information. Alfresco is integrated with Activiti, a Business Process Management (BPM) Engine. It also has auditing functionality and exposing the audit data related to processes and tasks. Data and Process Intelligent tool (the project) will be divided into two parts. The first part will be Alfresco Data Integration which will provided a solution to extract, transform, and load (ETL) data (document/folder/process/task) together with the audit data at a very detailed level in a central warehouse. On top of that, it will provide the data cleansing and merging functionality and if needed convert it in to OLAP format for efficient analysis. The second part will be the reporting functionality. The goal will be generic reporting tool useful to the end-user in a very easy way. The data will be published in reports in well-known formats (pdf, Microsoft Excel, csv, etc.) and stored directly in Alfresco as static documents organized in folders. To achieve above goal, Alfresco will be integrated with a powerful open source data integration and reporting tool. The necessary data from the Alfresco Repository will be extracted, transformed, merged/integrated and loaded in the data warehouse. The necessary schema transformation (for example OLTP to OLAP) will be applied to increase the efficiency. The solution will be scalable
  • 11. Abstract DDU (Faculty of Tech., Dept. of IT) 2 and generic Reporting System with an open window on the Business Intelligence world. Saying that, the solutions will be suitable also for publishing (static) reports containing not only audit data coming from Alfresco but also Key Performance Indicators (KPIs), analysis and dashboards coming from a complete Enterprise Data Warehouse.
  • 12. Company Profile DDU (Faculty of Tech., Dept. of IT) 3 COMPANY PROFILE Institute for Plasma Research (IPR) is an autonomous physics research institute located in Gandhinagar, India. The institute is involved in research in aspects of plasma science including basic plasma physics, research on magnetically confined hot plasmas and plasma technologies for industrial applications. It is a large and leading plasma physics organization in India. The institute is mainly funded by Department of Atomic Energy. IPR is playing major scientific and technical role in Indian partnership in the international fusion energy initiative ITER (International Thermonuclear Experimental Reactor). IPR is now internationally recognized for its contributions to fundamental and applied research in plasma physics and associated technologies. It has a scientific and engineering manpower of 200 with core competency in theoretical plasma physics, computer modeling, superconducting magnets and cryogenics, ultra high vacuum, pulsed power, microwave and RF, computer-based control and data acquisition and industrial, environmental and strategic plasma applications. The Centre of Plasma Physics - Institute for Plasma Research has active collaboration with the following Institutes/ Universities: Bhabha Atomic Research Centre, Bombay Raja Ramanna Centre for Advanced Technology, Indore IPP, Juelich, Germany; IPP, Garching, Germany Kyushu University, Fukuoka, Japan Physical Research Laboratory, Ahmedabad National Institute for Interdisciplinary Science and Technology, Bhubaneswar Ruhr University Bochum, Bochum, Germany Saha Institute of Nuclear Physics, Calcutta St. Andrews University, UK Tokyo Metropolitan Institute of Technology, Tokyo University of Bayreuth, Germany; University of Kyoto, Japan.
  • 13. List of Figures DDU (Faculty of Tech., Dept. of IT) 4 LIST OF FIGURES 1. MVC Architecture…………….……………………………………….Fig 1.1 2. Flowchart of the project……………………………………………….Fig 2.1 3. Gantt Chart…………………………………………………………….Fig 2.2 4. Alfresco Icon…………………………………………………………...Fig 4.1 5. Uses of Alfresco ECM…………………….………………………...…Fig 4.2 6. Alfresco Architecture……………………….…………………………Fig 4.3 7. Relational Diagrams (users, documents and folders)……………..…Fig 4.4 8. Relational Diagrams (permissions)…………………………………...Fig 4.5 9. Relational Diagrams (audits)………………………………………….Fig 4.6 10. Pentaho Data Integration Icon……………………………………..…Fig 5.1 11. Pentaho Icon…………………………………………………………...Fig 5.2 12. A.A.A.R. Icon………………………………………………………….Fig 5.3 13. Working of A.A.A.R…………………………………………………..Fig 5.4 14. PDI Repository Settings Step 1……...………………………………..Fig 5.5 15. PDI Repository Settings Step 2……...………………………………..Fig 5.6 16. PDI Repository Settings Step 3……...………………………………..Fig 5.7 17. PDI Repository Settings Step 4……...………………………………..Fig 5.8 18. PDI Repository Settings Step 5……...………………………………..Fig 5.9 19. PDI Repository Settings Step 6…….………………………………..Fig 5.10 20. PDI Repository Settings Step 7….....………………………………..Fig 5.11 21. PDI Repository Settings Step 8….....………………………………..Fig 5.12 22. Audit Data Mart……………………………………………………...Fig 5.13 23. Dimension Tables…………………………………………………….Fig 5.14 24. Document Information Transformation…………...……………….Fig 5.15 25. Document Permission Transformation…………...………………...Fig 5.16 26. Folder Information Transformation…………….....……………….Fig 5.17
  • 14. List of Figures DDU (Faculty of Tech., Dept. of IT) 5 27. Folder Permission Transformation………………...……………….Fig 5.18 28. User Information Transformation……………….....……………….Fig 5.19 29. Pentaho Reporting Tool Icon……………………………………..…..Fig 6.1 30. Document Information Report…….……..………...………………....Fig 6.2 31. Document Permission Report……...……..………...………………....Fig 6.3 32. Folder Information Report…….……..…..………...………………....Fig 6.4 33. Folder Permission Report………….……..………...………………....Fig 6.5 34. User Information Report…………..……..………...………………....Fig 6.6 35. Pentaho BI Server Icon……………………………………….……….Fig 7.1 36. Document Information Report…….……..………...………………....Fig 7.2 37. Document Permission Report……...……..………...………………....Fig 7.3 38. Folder Information Report…..…….……..………...………………....Fig 7.4 39. Folder Permission Report.……...….……..………...………………....Fig 7.5 40. User Information Report…………..……..………...………………....Fig 7.6 41. Scheduling of Transformations…...…………………………………..Fig 7.7 42. Login Step 1………………………………………………...………….Fig 9.1 43. Login Step 2………………………………………………...………….Fig 9.2 44. Login Step 3………………………………………………...………….Fig 9.3 45. View Reports Step 1………………...…………………………………Fig 9.4 46. View Reports Step 1………………...…………………………………Fig 9.5 47. Scheduling Page………...……………………………………………...Fig 9.6 48. Administration Page……………….…………………………………..Fig 9.7
  • 15. List of Tables DDU (Faculty of Tech., Dept. of IT) 6 LIST OF TABLES 1. Milestones and Deliverables……….……………………………… Table 2.1 2. Project Scheduling Table…………………………………………...Table 2.2 3. Test Case 1…………………………………………………………..Table 8.1 4. Test Case 2…………………………………………………………..Table 8.2 5. Scheduling Options………………..………………………………..Table 9.1 6. Scheduling Controls………………..……………………………….Table 9.2 7. Administration Options…………………………………………….Table 9.3
  • 16. Introduction DDU (Faculty of Tech., Dept. of IT) 7 INTRODUCTION 1.1PROJECT DETAILS Institute of Plasma Research has selected Alfresco, an Enterprise Content Management (ECM), as an Electronic Document and Record Management System (EDRMS). Alfresco do not have powerful reporting functionality. Thus, IPR requires a reporting tool to present the various details related to metadata of the documents and folders (folders are used to organize documents), access control applied on documents and folders. Additional analyses (like most active user, most active documents in last week, months etc.) are required on audit trailing data generated by the alfresco. Some Key Performance Indicators (KPI) needs to be generated of the document review and approval process. Possibilities to create and exports reports in well-known formats (pdf, Microsoft Excel, csv, etc.) needs to be provided. There will be a central administrator who has the possibility to configure the access rights on the reports for end users. Additionally end users shall have the possibility to subscribe the reports and schedule the report generation and send it via E-mail as attachment in preferred format. 1.2PURPOSE This system needs to be developed to enhance the way of looking at a traditional document management system and to make it more user-friendly. Along with all the features, we need a few customizations for the better usability of the resources. With these powerful reporting tools, it will become easy and secure to understand the files and documents in the institute. Also, it would help in decision making so as to what steps have to be taken on the basis of the reports generated. 1.3SCOPE The scope of the current project is just to implement a framework/deployment
  • 17. Introduction DDU (Faculty of Tech., Dept. of IT) 8 architecture using BI toolset and test it by integrating with Alfresco. Alfresco data mart will be created and used for developing analysis reports related to document management system. The reports will be made available securely to the employees of the institute, collaborators and contractors on internet. In future generic reporting architecture implemented as part of this project will be used and extended as a full Data Warehouse solution by integrating and merging other data management tools of IPR. The full DW solution is out of the scope of this project. 1.4OBJECTIVE The objective of this project is to ease the visibility of the document management system and enhance decision making. Alfresco is a powerful content management system. Unfortunately, the need for powerful reporting is still there and most of the answers are tricky solutions, quite hard to manage and scale. To achieve above goal, Alfresco will be integrated with a powerful open source data integration and reporting tool. The necessary data from the Alfresco Repository will be extracted, transformed, merged/integrated and loaded in the data warehouse. The necessary schema transformation (for example OLTP to OLAP) will be applied to increase the efficiency. The solution will be scalable and generic Reporting System with an open window on the Business Intelligence world. Saying that, the solutions will be suitable also for publishing (static) reports containing not only audit data coming from Alfresco but also Key Performance Indicators (KPIs), analysis and dashboards coming from a complete Enterprise Data Warehouse. 1.5TECHNOLOGY AND LITERATURE REVIEW 1.5.1 ALFRESCO ECM Open source java based Enterprise Content Management system (ECM) named Alfresco is selected as a document repository. It uses MVC architecture. Model–view–
  • 18. Introduction DDU (Faculty of Tech., Dept. of IT) 9 controller (MVC) is a software pattern for implementing user interfaces. It divides a given software application into three interconnected parts, so as to separate internal representations of information from the ways that information is presented to or accepted from the user. Model: It consists of application data, business rules, logic and functions. Here, XML is used for the same. View: It is the output representation of information. Here, FTL is used for the same. Controller: It accepts input and converts it to commands for the model or view. Figure 1.1 MVC Architecture 1.5.2 PENTAHO PLATFORM Pentaho is a company that offers Pentaho Business Analytics, a suite of open source Business Intelligence (BI) products which provide data integration, OLAP services, reporting, dashboard, data mining and ETL capabilities. Pentaho was founded in 2004 by five founders and is headquartered in Orlando, FL, USA. Pentaho software consists of a suite of analytics products called Pentaho Business Analytics, providing a complete analytics software platform. This end-to-end solution includes data integration, metadata, reporting, OLAP analysis, ad-hoc query, dashboards, and data mining capabilities. The platform is available in two offerings: a community edition (CE) and an enterprise edition (EE).
  • 19. Project Management DDU (Faculty of Tech., Dept. of IT) 10 PROJECT MANAGEMENT 2.1 FEASIBILITY STUDY Feasibility study includes an analysis and evaluation of a proposed project to determine if it is technically feasible, is feasible within the estimated cost, and will be profitable. The following softwares have to be installed for the project:- 1. Alfresco Entity Content Management 2. PostgreSQL and SQuirreL 3. Pentaho Data Integration Tool (K.E.T.T.L.E.) 4. Alfresco Audit Analysis and Reporting Tool (A.A.A.R.) 5. Pentaho Reporting Tool 6. Pentaho BI Server The study assures that the hardware cost required for one database server plus two web servers is acceptable and the 500 GB of file storage for the final product is feasible. 2.2PROJECTPLANNING 2.2.1 ProjectDevelopmentApproach We have used Agile methodology. After the feasibility study, the first thing to be done was to create a basic flowchart charting out the flow of the project so as to create a mind map. The base database system is Alfresco, from where we need to load tables using PostGreSQL or SQuirreL. The number of tables is compressed to create a staging data warehouse. After the transformations on these tables using Pentaho Data Integration tool, reports are created using Pentaho Reporting on the BI server, according to the given requirements of the project.
  • 20. Project Management DDU (Faculty of Tech., Dept. of IT) 11 Alfresco ECM Staging database warehouse Other databases (Optional) Data Mart Reports ETL Logic Flow Chart of the Project ETL Logic using Kettle Publish Pentaho BI Server Pentaho Report DesignerTool Figure 2.1 Flowchart of the Project Once, the flowchart was made, we proceeded towards the development part keeping the flowchart in mind. Thus, we started from studying Alfresco Enterprise Content Management System and then moved on to Pentaho Tools. We also installed PostgreSQL and SQuirreL so as to deal with the queries. 2.2.2 ProjectPlan 1. Gather the definition. 2. Check whether the definition is feasible or not in given deadline. 3. Requirement gathering. 4. Study and analysis on gathered requirements. 5. Transformation Phase. 6. Reporting Phase. 7. Deployment.
  • 21. Project Management DDU (Faculty of Tech., Dept. of IT) 12 2.2.3 Milestones and Deliverables Table 2.1 Milestones and Deliverables Phase Deliverables Purpose Abstract and System Feasibility Study Had complete understanding of the flow of the project To be familiar with the flow of the project Requirement Gathering and Software Installation and understanding of Technology Had studied the ECM, it’s architecture and how the data is stored in the Alfresco repository Getting familiar with the Alfresco platform Study of Platform and the tools with it Had studied and used the three tools namely, Pentaho Data Integration Tool, Pentaho Report Designer and Pentaho BI Server Better understanding of the Pentaho platform and all the tools and plug ins associated with it Transformation Phase Completed the transformation phase with help of A.A.A.R, developed some custom ETL and schedules the transformation jobs to run during nights To make the staging data warehouse Reporting phase Made the reports according to the user’s requirements To complete the reporting phase
  • 22. Project Management DDU (Faculty of Tech., Dept. of IT) 13 Deployment Published it on server in different output types, like pdf, csv etc Deploy it on the Web and hence completing the project 2.2.4 ProjectScheduling In project management, a schedule is a listing a project's milestones, activities, and deliverables, usually with intended start and finish dates. Table 2.2 Project Scheduling Table Figure 2.2 Gantt Chart 8-Dec 23-Dec 7-Jan 22-Jan 6-Feb 21-Feb 8-Mar 23-Mar Abstract and Feasibility Study Requirement Gathering Study of Database Management System Study of platform and associated tools Transformation Phase Reporting Phase Deployment
  • 23. System Requirement Study DDU (Faculty of Tech., Dept. of IT) 14 SYSTEM REQUIREMENT STUDY 3.1 USER CHARACTERISTICS This system is made available on the web so it can be accessed from anywhere. The users will be scientists, researchers, engineers and other employees of the institute. They login with their respective credentials user will be logged in. 3.2 HARDWARE AND SOFTWARE REQUIREMENTS 3.2.1 Server and Client side Hardware Requirements: RAM : 4GB Hard-disk : 40GB Processor: 2.4GHz File Storage: 500GB 3.2.2 Server and Client side Software Requirements: Windows or Linux based system PostgreSQL Database SQuirreL Database Client tool Alfresco ECM Pentaho Community Edition 5.0 (PDI, Reporting Tool, BI Server) Alfresco Audit Analyzing and Reporting tool Notepad++ 3.3 CONSTRAINTS 3.3.1 Regulatory Policies
  • 24. System Requirement Study DDU (Faculty of Tech., Dept. of IT) 15 Regulatory policies, or mandates, limit the discretion of individuals and agencies, or otherwise compel certain types of behavior. These policies are generally thought to be best applied when good behavior can be easily defined and bad behavior can be easily regulated and punished through fines or sanctions. IPR is very strict about its policies and ensures that all the employees follow it properly. 3.3.2 Hardware Limitations To ensure the smooth working of the system, we need to meet the minimum hardware requirements. We need at least 2GB RAM, 40GB hard disk and 2.4 GHz processor. All these requirements are readily available. Hence, there are not really any hardware limitations. 3.3.3 Interfaces to Other Applications ETL tool of BI suits generally supports a number of standards-based protocols including the ODBC, JDBC, REST, web script, FTP and many more for extracting the data from multiple sources. It is easy to integrate any data management application using supported input protocols. We have used CMIS (Content Management Inter-operatibility Services) and JDBC protocol for Alfresco data integration. The published reports will be integrated back to Alfresco using http protocol. Single sign Up will be implemented by the IT department for providing the transparent access of reports form Alfresco or any other web based tools. 3.3.4 CMIS CMIS (Content Management Interoperability Services) is an OASIS standard designed for the ECM industry. It enables access to any content management repository that implements the CMIS standard. We can consider using CMIS if application needs programmatic access to the content repository.
  • 25. System Requirement Study DDU (Faculty of Tech., Dept. of IT) 16 3.3.5Parallel Operations This is a document management system where around 300 employees will work concurrently. They can upload a document, review it, modify it, start workflow and even delete it. Parallel operations include allowing more than single employee to read the document. Work flows can be started to any document; also any document can be in any number of workflows. Parallel editing of a document will be restricted by providing check-in and check-out functionality. 3.3.6 Reliability Requirements All quality hardware, software and frameworks with valid licenses are required for better reliability. 3.3.7 Criticality of the Application Criticality of the module was one of the concerned constraints. The system was being developed for the users who were mainly employees of the government sector. They had certain rigid aspects which were to be taken care during development. Any change in pattern of their workflow would lead to extremely critical conditions. Thus this was a matter of concern and served as one of the deep rooted constraints. 3.3.8 Safety and Security Consideration The system provides a tight security to user account. It is secured by password mechanism which are encrypted and stored to database. Also, the repository is accessible for modifications only to some privileged users.
  • 26. Alfresco ECM System DDU (Faculty of Tech., Dept. of IT) 17 ALFRESCO ECM SYSTEM 4.1INTRODUCTION Figure 4.1 Alfresco Icon Alfresco is a free enterprise content management system for both Windows and Linux operating systems, which manages all the content within an enterprise and provides services to manage this content. It comes in three flavors:- Community edition – It is a free software with some limitations. No clustering feature is present. (We have used community edition of Alfresco for this project since we just need to perform ETL logic on the database and not use the advanced functionalities.) Enterprise edition – It is commercially licensed and suitable for user who requires a higher degree of functionalities. Cloud edition - It is a SaaS (Software as a Service) version of Alfresco. We would be using Alfresco database as our base database from where we want to extract information and create a warehouse. For further transformation purpose, we would be using SQuirreL and K.E.T.T.L.E a.k.a. Pentaho Data Integration tool. 4.2ALFRESCOOVERVIEW There are various ways in which Alfresco can be used for storing files and folders and it can also be used by different systems. It is basically a repository, which is a central location where data are stored and managed.
  • 27. Alfresco ECM System DDU (Faculty of Tech., Dept. of IT) 18 Few of the ways in which Alfresco can be used are: Figure 4.2 Uses of Alfresco ECM Alfresco ECM is a useful tool to store files and folders of different types. Few of the uses of Alfresco are:- Document Management Records Management Shared drive replacement Enterprise portals and intranets Web Content Management Knowledge Management Information Publishing Case Management 4.3 ARCHITECTURE
  • 28. Alfresco ECM System DDU (Faculty of Tech., Dept. of IT) 19 Alfresco has a layered architecture with mainly three parts:- 1. Alfresco Client. 2. Alfresco Content Application Server 3. Physical Storage 4.3.1 Client Alfresco offers two primary web-based clients: Alfresco Share and Alfresco Explorer. Alfresco Share can be deployed to its own tier separate from the Alfresco content application server. It focuses on the collaboration aspects of content management and streamlining the user experience. Alfresco Share is implemented using Spring Surf and can be customized without JSF knowledge. Alfresco Explorer is deployed as part of the Alfresco content application server. It is a highly customizable power-user client that exposes all features of the Alfresco content application server and is implemented using Java Server Faces (JSF). Clients also exist for portals, mobile platforms, Microsoft Office, and the desktop. A client often overlooked is the folder drive of the operating system, where users share documents through a network drive. Alfresco can look and act just like a folder drive. 4.3.2 Server The Alfresco content application server comprises a content repository and value-added services for building ECM solutions. Two standards define the content repository: CMIS (Content Management Interoperability Services) and JCR (Java Content Repository). These standards provide a specification for content definition and storage, content retrieval, versioning, and permissions. Complying with these standards provides a reliable, scalable, and efficient implementation. The Alfresco content application server provides the following categories of services built upon the content repository:
  • 29. Alfresco ECM System DDU (Faculty of Tech., Dept. of IT) 20 1. Content services (transformation, tagging, metadata extraction) 2. Control services (workflow, records management, change sets) 3. Collaboration services (social graph, activities, wiki) Clients communicate with the Alfresco content application server and its services through numerous supported protocols. HTTP and SOAP offer programmatic access while CIFS, FTP, WebDAV, IMAP, and Microsoft SharePoint protocols offer application access. The Alfresco installer provides an out-of-the-box prepackaged deployment where the Alfresco content application server and Alfresco Share are deployed as distinct web applications inside Apache Tomcat. Figure 4.3 Alfresco Architecture At the core of the Alfresco system is a repository supported by a server that persist content, metadata, associations, and full text indexes. Programming interfaces support multiple languages and protocols upon which developers can create custom applications and solutions. Out-of-the-box applications provide standard solutions such as document management, and web content management.
  • 30. Alfresco ECM System DDU (Faculty of Tech., Dept. of IT) 21 4.4 DATA STORAGE IN ALFRESCO There are total 97 tables in the database mainly divided into two parts Alfresco databases and Activity workflows. The Alfresco database is further divided into three parts- nodes, access and properties. 1. Node is the parent class of the database which has all identity numbers stored in it. 2. Access tables deals with the security issues of Alfresco like the permissions and last modification dates. 3. Properties store the information about which kind of data is stored; its size, type, ranges etc. 4.5 RELATIONSHIP DIAGRAMS After studying the tables, we created the relationship diagram of the tables using SQuirreL. Since the Relational Diagram for the Alfresco System comprises 97 tables, we selected the ones that are vital like:- alf_node – holds the identity of other tables. alf_qname – It defines a valid identifier for each and every attribute. alf_node_properties – It connects both node and qname tables and stores all properties of each node id. alf_access_control_list – It is used to specify who can do what with an object in the repository i.e. gives the permission information.
  • 31. Alfresco ECM System DDU (Faculty of Tech., Dept. of IT) 22 Figure 4.4 Relation Diagrams for users, documents and folders Figure 4.5 Relational Diagrams for permissions
  • 32. Alfresco ECM System DDU (Faculty of Tech., Dept. of IT) 23 Figure 4.6 Relational Diagram for audits
  • 33. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 24 TRANSFORMATION PHASE 5.1INTRODUCTION There are 97 tables in Alfresco ECM System. To create a staging data warehouse, we first have to perform E.T.L. logic i.e. Extract, Transform and Load. In computing, ETL refers to a process in database usage and especially in data warehousing where it:  Extracts data from homogeneous or heterogeneous data source  Transforms the data for storing it in proper format or structure for querying and analysis  Loads it into the final target (database, more specifically, operational data store, data mart, or data warehouse) Usually all the three phases execute in parallel since the data extraction takes time, so while the data is being pulled another transformation process executes, processing the already received data and prepares the data for loading and as soon as there is some data ready to be loaded into the target, the data loading kicks off without waiting for the completion of the previous phases. ETL systems commonly integrate data from multiple applications (systems), typically developed and supported by different vendors or hosted on separate computer hardware. The disparate systems containing the original data are frequently managed and operated by different employees. In our project though, there is only one source from where the data is extracted i.e. Alfresco. 5.2 PENTAHO DATA INTEGRATION TOOL 5.2.1 Introduction Pentaho Data Integration (or Kettle) delivers a powerful extraction, transformation,
  • 34. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 25 and loading (ETL) capabilities, using a metadata-driven approach. It prepares and blends data to create a complete picture of the business that drives actionable insights. The complete data integration platform delivers accurate, “analytics ready” data to end users from any source. Figure 5.1Pentaho Data Integration Icon In particular, Pentaho Data Integration is used to: extract Alfresco audit data into the Data Mart and create the defined reports uploading them back to Alfresco. 5.2.2 Why Pentaho? Figure 5.2 Pentaho Icon 5.2.2.1 Pentaho vs Jaspersoft vs BIRT Pentaho and Jaspersoft, both provide the unique advantage of being cost effective but the differences in terms of features vary. Although Jaspersoft’s report for designing reports is comparatively better than Pentaho Report Designer, the dashboard capabilities of Pentaho in terms of functionality are better. This is because dashboard functionality is present only in the Enterprise edition of Jaspersoft whereas in Pentaho, it is accessible in the Community edition too.
  • 35. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 26 When it comes to Extract, Transfer and Load (ETL) tools, the Pentaho Data Integrator is comparatively better since Jaspersoft falls short of few functions. When it comes to OLAP analysis, Pentaho Mondrian engine has a stronger case compared to Jaspersoft. Pentaho users also have huge set of choices in terms of plugin marketplace that is similar to the app store of iOS and Android. To sum it up, Jaspersoft focus is more on reporting and analysis and Pentaho’s focus is on data integration, ETL and workflow automation. BIRT has also emerged as an important tool for business intelligence for those who are well versed in Java. BIRT is an Eclipse-based open source reporting system for web applications, especially those based on Java and Java EE where it consists of a report designer based on Eclipse and a runtime component that be added to the app server. In terms of basic functionality BIRT is at par with Pentaho and Jaspersoft perhaps a slight advantage as it is based on Eclipse. Apart from that as a typical BI tool it is expected to cover common Chart Types. Although BIRT covers most of the charts, it falls short of Chart types like Ring, Waterfall, Step Area, Step, Difference, Thermometer and Survey scale wherein Pentaho fills the gaps. 5.2.2.2 Conclusion Unlike previous two tools, Pentaho is a complete BI suite covering various operations from reporting to data mining. The key component of Pentaho is the Pentaho Reporting which is a rich feature set and enterprise friendly. Its BI Server which is a J2EE application also provides an infrastructure to run and view reports through a web-based user interface. All of the following open source BI and reporting tools provide a rich feature set ready for enterprises. It is based on the end user to thoroughly compare and select either of these tools. All three of these open source business intelligence and reporting tools provide a rich feature set ready for enterprise use. It will be up to the end user to do a thorough comparison and select either of these tools. Major differences can be found in report presentations, with a focus on web or print, or in the availability of a report server. Pentaho distinguishes itself by being more than just a reporting tool, with a full suite of components (data mining and integration).
  • 36. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 27 Among organizations adopting Pentaho, one of the advantages felt is its low integration time and infrastructural cost compared to SAP BIA, SAS BIA which are one of the big players in Business Intelligence. Along with that the huge community support available 24/7 with active support forums allows Pentaho users to discuss the challenges and have their questions cleared while using the tool. Its unlimited visualization and data sources can handle any kind of data, coupled with a good tool set which has wide applicability beyond just the base product. 5.2.3 COMPONENTS OF PENTAHO Kettle is a set of tools and applications which allows data manipulations across multiple sources. The main components of Pentaho Data Integration are:  Spoon – It is a graphical tool that makes the design of an ETL process transformation easy to create. It performs the typical data flow functions like reading, validating, refining, transforming, writing data to a variety of different data sources and destinations. Transformations designed in Spoon can be run with Kettle Pan and Kitchen.  Pan – Pan is an application dedicated to run data transformations designed in Spoon.  Chef – It is a tool to create jobs which automate the database update process in a complex way.  Kitchen – It is an application which helps execute the jobs in a batch mode, usually using a schedule which makes it easy to start and control the ETL processing.  Carte – It is a web server which allows remote monitoring of the running Pentaho Data Integration ETL processes through a web browser. 5.3 ALFRESCO AUDIT ANALYSIS AND REPORTING TOOL 5.3.1 Introduction
  • 37. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 28 Alfresco is one of the most widely used open source content management systems. And though it is not part of its core, it is crucial to get metrics out of the Alfresco system. Figure 5.3 A.A.A.R. Icon To that goal, a full-fledged audit layer was built on top of Alfresco using Pentaho. The principle is that it is used for doing optimized analytics to build a data mart properly optimized for the information we are extracting for the system and doing all the discovery on top of that. To do that, one need an ETL tool and then once that is done, Pentaho is needed to do reporting and exploration on top of that data warehouse. This in-between tool is called AAAR - Alfresco Audit Analysis and Reporting. 5.3.2 Working and Installation of A.A.A.R. Alfresco Content Management System can be seen as a primary source and generates only raw data. On the other hand, Pentaho is a pure BI environment and consists of some suitable integration and reporting tools. Thus, A.A.A.R. extracts audit data from the Alfresco E.C.M., stores the data in the Data Mart, creates reports in well-known formats and publishes them again in the Alfresco E.C.M.
  • 38. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 29 Figure 5.4 Working of A.A.A.R. Alfresco E.C.M. is, at the same time, source and target of the flow. As source of the flow, Alfresco E.C.M. is enabled with the audit service to track all the activities with detailed information about who, when, what has been done on the system. Login (failed or succeed), creation of content, creation of folders, adding or removing of properties or aspects are only some examples of what is tracked from the audit service. 5.3.2.1 Prerequisites 1. Alfresco E.C.M. 2. PostGreSQL/MySQL 3. Pentaho Data Integration Tool 4. Pentaho Report Designer Tool 5.3.2.2 Enabling Alfresco Audit Service The very first task to do is to activate the audit service in Alfresco performing the actions. 1. Stop Alfresco. 2. In '<Alfresco>/tomcat/shared/classes/alfresco-global.properties' append: # Alfresco Audit service audit.enabled=true audit.alfresco-access.enabled=true # Alfresco FTP service ## ATTENTION: Don’t do it if just enabled! ftp.enabled=true ftp.port=8082
  • 39. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 30 3. Start Alfresco. 4. Login into Alfresco to have the very first audit data. 5.3.2.3 Data Mart Creation and Configuration 1. Open a terminal 2. For the PostgreSQL platform use: cd <PostgreSQL bin> psql –U postgres –f “<AAAR folder>/AAAR_DataMart.sql” (use ‘psql.exe’ on Windows platform and ‘./psql’ on Linux based platforms) 3. Exit 4. Extract ‘reports.zip’ in the ‘data-integration’ folder. ‘report.zip’ contains 5 files with ‘prpt’ extension, each one containing one Pentaho Reporting Designer report. By default, and to let the report production simpler, are saved in the default folder: ‘data- integration’. 5. Update ‘dm_dim_alfresco’ table with the proper environment settings. Each row of the table represents one Alfresco installation and for that reason the table is defined with a unique row by default, as described below. desc with value ‘Alfresco’. login with value ‘admin’. password with value ‘admin’. url with value ‘http://localhost:8080/alfresco/service/api/audit/query/alfresco- access?verbose=true&limit=100000’. is_active with value ‘Y’. 6. Update ‘dm_reports’ table with your target settings. 5.3.2.4 PDI Repository Settings The third task is to set the Pentaho Data Integration Jobs properly. 1. Open a terminal 2. For the PostgreSQL platform use: cd <PostgreSQL bin> psql –U postgres –f “<AAAR folder>/AAAR_Kettle.sql”
  • 40. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 31 (use ‘psql.exe’ on Windows platform and ‘./psql’ on Linux based platforms) 3.Exit 4. To set the Pentaho Data Integration repository: i. Open a new terminal. cd <data-integration> ii. Launch ‘Spoon.bat’ if you are on Windows platform or ‘./Spoon.sh’ if you are on Linux based platforms. iii. Click on the green plus to add a new repository and define a new repository connection in the database. Figure 5.5 Step 1 iv. Add a new database connection to the repository.
  • 41. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 32 Figure 5.6 Step 2 v. If you choose a PostgreSQL platform set the parameters described below in the image. At the end push the test button to check the database connection. Figure 5.7 Step 3
  • 42. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 33 vi. Set the ID an Name fields and press the ‘ok’ button. Attention not to push the ‘create or upgrade’ button otherwise the E.T.L. will be damaged. Figure 5.8 Step 4 vii. Connect with the login ‘admin’ and password ‘admin’ to test the connection. Figure 5.9 Step 5 viii. If everything succeeds, you see the Pentaho Data Integration (Kettle) panel. viii. From the Pentaho Data Integration panel, click on Tool -> Repository -> explore.
  • 43. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 34 Figure 5.10 Step 6 ix. Click on the ‘Connections’ tab and edit (the pencil on the top right) the AAAR_DataMart connection. In the image below the PostgreSQL case but with MySql is exactly the same. Figure 5.11 Step 7
  • 44. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 35 x. Modify the parameters and click on the test button to check. If everything succeed you can close all. In the image below the PostgreSQL case but with MySql is exactly the same. Figure 5.12 Step 8 5.3.2.5 First Import Now you are ready to get the audit data in the Data Mart and create the reports publishing them to Alfresco. Open a terminal cd <data-integration> kitchen.bat /rep:"AAAR_Kettle" /job:"Get all/dir:/Alfresco /user:admin /pass:admin /level:Basic kitchen.bat /rep:"AAAR_Kettle" /job:"Report all" /dir:/Alfresco /user:admin /pass:admin /level:Basic Finally you can access to Alfresco and look in the repository root where the reports are uploaded by default. 5.3.3 Audit Data Mart On the other side of the represented flow, there is a database storing the
  • 45. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 36 extracts audit data organized in a specific Audit Data Mart. A Data Mart is a structure that is usually oriented to a specific business line or team and, in this case, represents the audited actions in the Alfresco E.C.M. Figure 5.13 Audit Data Mart 5.3.4 Dimension Tables The implemented Data Mart develops a single Star Schema having one only measure (the number of audited actions) and the dimensions listed below: 1. Alfresco instances to manage multiple sources of auditing data. 2. Alfresco users with a complete name. 3. Alfresco contents complete with the repository path. 4. Alfresco actions (login, failedLogin, read, addAspect, etc.). 5. Date of the action. Groupable in day, month and year.
  • 46. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 37 6. Time of the action. Groupable in minute and hour. Figure 5.14 Dimension Tables 5.4 TRANSFORMATIONS USING SPOON The Spoon is the only DI design tool component. The DI Server is a core component that executes data integration jobs and transformations using the Pentaho Data Integration Engine. It also provides the services allowing you to schedule and monitor scheduled activities. Drag elements onto the Spoon canvas, or choose from a rich library of more than 200 pre-built steps to create a series of data integration processing instructions. 5.5 EXAMPLE TRANSFORMATION Few of the transformations we have done using Spoon are listed below:- 1. Document Information 2. Document Permission 3. Folder Information
  • 47. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 38 4. Folder Permission 5. User Information Figure 5.15 Document Information Transformation Figure 5.16 Document Permission Transformation
  • 48. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 39 Figure 5.17 Folder Information Transformation Figure 5.18 Folder Permission Transformation
  • 49. Transformation Phase DDU (Faculty of Tech., Dept. of IT) 40 Figure 5.19 User Information Transformation
  • 50. Reporting Phase DDU (Faculty of Tech., Dept. of IT) 41 REPORTING PHASE 6.1 WHAT IS A REPORT? In its most basic form, a report is a document that contains information for the reader. When speaking of computer generated reports, these documents refine data from various sources into a human readable form. Report documents make it easy to distribute specific fact-based information throughout the company. Reports are also used by the management departments in decision making. 6.2 PENTAHO REPORT DESIGNERTOOL 6.2.1 Introduction Pentaho Reporting is a suite of tools for creating pixel perfect reports. With Pentaho Reporting, we are able to transform data into meaningful information. You can create HTML, Excel, PDF, Text or printed reports. If you are a developer, you can also produce CSV and XML reports to feed other systems. Figure 6.1 Pentaho Reporting Tool Icon It helps in transforming all the data into meaningful information tailored according to your audience with a suite of Open Source tools that allows you to create pixel-perfect reports of your data in PDF, Excel, HTML, Text, Rich-Text-File, XML and CSV. These computer generated reports easily refine data from various sources into a human readable form.
  • 51. Reporting Phase DDU (Faculty of Tech., Dept. of IT) 42 6.2.2 Working of Pentaho ReportDesignerTool Once, the transformations are completed using K.E.T.T.L.E., we can import these transformations from the Data Mart in the Pentaho Report Designer tool with the help of SQL. Pentaho Report Designer tool has a large selection of elements (Text fields, Labels etc.) and various GUI representation techniques like pie-charts, tables, graphs etc with which we can create our reports. 6.3 EXAMPLE REPORTS According to the transformations done using Spoon, we created reports of the following requirements using Pentaho Report Designer:- 1. Document Information 2. Document Permission 3. Folder Information 4. Folder Permission 5. User Information Figure 6.2 Document Information Report
  • 52. Reporting Phase DDU (Faculty of Tech., Dept. of IT) 43 Figure 6.3 Document Permission Report Figure 6.4 Folder Information Report
  • 53. Reporting Phase DDU (Faculty of Tech., Dept. of IT) 44 Figure 6.5 Folder Permission Report Figure 6.6 User Information Report
  • 54. Publishing Phase DDU (Faculty of Tech., Dept. of IT) 45 PUBLISHING PHASE 7.1INTRODUCTION After the reports are made using the designing tool, we need to publish the reports on the server. Pentaho BI Server or BA platform allows you to access business data in the form of dashboards, reports or OLAP cubes via a convenient web interface. Additionally it provides an interface to administer your BI setup and schedule processes. Also, different types of output types are available like pdf, html, csv etc. 7.2PENTAHO BI SERVER 7.2.1 Introduction It is commonly referred to as the BI Platform, and recently renamed Business Analytics Platform (BA Platform). It makes up the core software piece that hosts content created both in the server itself through plug-ins or files published to the server from the desktop applications. It includes features for managing security, running reports, displaying dashboards, report bursting, scripted business rules, OLAP analysis and scheduling out of the box. Figure 7.1 Pentaho BI Server Icon The commercial plug-ins from Pentaho expand out-of-the-box features. A few open- source plug-in projects also expand capabilities of the server. The Pentaho BA Platform runs in the Apache Java Application Server. It can be embedded into other Java
  • 55. Publishing Phase DDU (Faculty of Tech., Dept. of IT) 46 Application Servers. 7.2.2 Example Published Reports According to the reports we have created, the following reports can be deployed on the Web:- 1. Document Information 2. Document Permission 3. Folder Information 4. Folder Permission 5. User Information Figure 7.2 Document Information Published Report
  • 56. Publishing Phase DDU (Faculty of Tech., Dept. of IT) 47 Figure 7.3 Document Permission Published Report Figure 7.4 Folder Information Published Report
  • 57. Publishing Phase DDU (Faculty of Tech., Dept. of IT) 48 Figure 7.5 Folder Permission Published Report Figure 7.6 User Information Published Report
  • 58. Publishing Phase DDU (Faculty of Tech., Dept. of IT) 49 7.3 SCHEDULING OF TRANSFORMATIONS Once, the project has been completed, for real-time usage, the data warehouse needs to be updated after every particular interval. For that purpose, we have to create scheduling for our project so that it gets updated every day reflecting changes done in the last 24 hours. There are three types to perform scheduling: 1. Using schedule option from action menu in Spoon. 2. Using start element in job i.e. kjb(Kettle jobs) files 3. Using task scheduler Usually, first method is preferred in industries, but as we are working on community edition, scheduling option is not provided. Also, the second method is used just for jobs and does not update transformations so it was not suitable. So, we scheduled the project using task scheduler. We have scheduled all the transformations. It has been scheduled in such a way that it will run daily at 11:00 am. The project has been deployed on web and submitted to our external guide. It will be used further by IPR on web server for real-time usage. Figure 7.7 Scheduling of Transformations
  • 59. Testing DDU (Faculty of Tech., Dept. of IT) 50 TESTING 8.1 TESTING STRATEGY Data completeness: Ensures that all expected data is loaded in to target table. 1. Compare records counts between source and target and check for any rejected records. 2. Check Data should not be truncated in the column of target table. 3. Check unique values has to load in to the target. No duplicate records should exist. 4. Check boundary value analysis Data quality: Ensures that the ETL application correctly rejects, substitutes default values, corrects or ignores and reports invalid data. Data cleanness: Unnecessary columns should be deleted before loading into the staging area. 1. Example: If a column have name but it is taking extra space , we have to “trim” space so before loading in the staging area with the help of expression transformation space will be trimmed. 2. Example: Suppose telephone number and STD code in different columns and requirement says it should be in one column then with the help of expression transformation we will concatenate the values in one column. Data Transformation: All the business logic implemented by using ETL-Transformation should reflect. Integration testing: Ensures that the ETL process functions well with other upstream and downstream processes.
  • 60. Testing DDU (Faculty of Tech., Dept. of IT) 51 User-acceptance testing: Ensures the solution meets users’ current expectations and anticipates their future expectations. Regression testing: Ensures existing functionality remains intact each time a new release of code is completed. 8.2 TESTING METHODS • Functional test: it verifies that the item is compliant with its specified business requirements. • Usability test: it evaluates the item by letting users interact with it, in order to verify that the item is easy to use and comprehensible. • Performance test: it checks that the item performance is satisfactory under typical workload conditions. • Stress test: it shows how well the item performs with peak loads of data and very heavy workloads. • Recovery test: it checks how well an item is able to recover from crashes, hardware failures and other similar problems. • Security test: it checks that the item protects data and maintains functionality as intended. • Regression test: It checks that the item still functions correctly after a change has occurred.
  • 61. Testing DDU (Faculty of Tech., Dept. of IT) 52 8.3 TEST CASES 8.3.1 USER LOGIN AND USING THE FUNCTIONALITY OF REPORT Description: This test will validate user name and password and he will be able to select the desired format of reports with the desired selection option Table 8.1 Test Case 1 Sr. No Test Case Expected Output Actual Output Test Case Status 1 User login to his/her page BA server should open BA server page opens Pass 2 User views a report Report should be displayed User is able to view report Pass 3 User while viewing selects the type of output format User must see the desired format output Desired format of the report is displayed Pass 4 User filters out the report view User should see the filtered report User is able to view the desired report Pass
  • 62. Testing DDU (Faculty of Tech., Dept. of IT) 53 8.3.2 VIEWING DOCUMENTS, FOLDERS, PERMISSIONS, AUDITS Description: This test case will check whether user is able to view the data of folders, documents, its permissions and audit data. Table 8.2 Test Case 2 Sr. No Test Case Expected Output Actual Output Test Case Status 1 User view the documents Document details should be displayed Document is seen Pass 2 User view the folders Folder should be displayed Folder is seen Pass 3 User view the permissions of folders and documents Permissions must be seen by user Permissions displayed Pass 4 User view the auditing data Audit data must be displayed Audit data is seen by user Pass
  • 63. User Manual DDU (Faculty of Tech., Dept. of IT) 54 USER MANUAL 9.1DESCRIPTION This manual describes the working and use of the project so as to help the end user and get them familiar with the features. Our project is divided into three levels. These are:- 1. Source Level 2. DWH Level 3. View Level The source level is the back-end of our project i.e. Alfresco Database. The DWH level is PostGreSQL, used in creating our Data Mart. And the view level is the Pentaho tools. The users will be able to see the view level of the project, specifically the Pentaho Business Analytics tool where the published reports are deployed. Once in the BA dashboard, the user can use many functionalities of it. The functionalities are listed below:- 1. Login Page 2. View Reports 3. Scheduling 4. Administration 9.2 LOGIN PAGE Before using the BA server, a user has to login into the server using his assigned user name and password so that the system knows which user has accessed the server and at what time. This helps in security purposes. To login, we have to follow the steps below:-
  • 64. User Manual DDU (Faculty of Tech., Dept. of IT) 55 1. We have to go the BI server folder using command line prompt. After we have changed the directory to BI server, we need to start Pentaho. Figure 9.1 Login Step 1 2. Once we login, the system automatically loads runs Apache tomcat. Figure 9.2 Login Step 2
  • 65. User Manual DDU (Faculty of Tech., Dept. of IT) 56 3. If the tomcat doesn’t find any error, it opens the user console of Pentaho BA server. The user can now login the server using their own user name and password. Figure 9.3 Login Step 3 9.3 VIEW REPORTS The main requirement of the user is to view the reports on the Web Browser so as to make decisions among various other uses. To do that, the user has to follow these steps:- 1. Once we login, Home screen opens up as shown in the given figure. For viewing reports, we have to select ‘Browse Files’ (1) from the drop down list.
  • 66. User Manual DDU (Faculty of Tech., Dept. of IT) 57 Figure 9.4 View Reports Step 1 2. Once we select the ‘Browse Files’ options, the console opens up the ‘Folders’ (2) in Home and the associated ‘Files’ (3) of the folder we select in file box. There is also an option of ‘Folder Actions’ (4) provided in the console which helps in various functions like creating a new folder, deleting a folder etc. 3.To view a report, we have to select the report from the file box. For example, if we have to see the report of the Documents Permissions, we need to click at docpermission-rep (5) file in the file box. It will open the Documents Permissions report (6) on the browser.
  • 67. User Manual DDU (Faculty of Tech., Dept. of IT) 58 Figure 9.5 View Reports Step 2 4. We can apply filters in the report. For example, in this report, we can filter and list the document according to the permissions by selecting the appropriate permission (7). Here, we have selected ‘Read’ permission from the ‘select permissions’ filter. Also, we can view reports in different styles selecting the appropriate style from ‘Output Type’ (8). Here, we have selected HTML (Single Page) type. 9.3 SCHEDULING You can schedule reports to run automatically. All of your active scheduled reports appear in the list of schedules, which you can get to by clicking the Home drop-down menu, then the Schedules link, in the upper-left corner of the User Console page. You can also access the list of schedules from the Browse Files page, if you have a report selected.
  • 68. User Manual DDU (Faculty of Tech., Dept. of IT) 59 The list of schedules shows which reports are scheduled to run, the recurrence pattern for the schedule, when it was last run, when it is set to run again, and the current state of the schedule. Figure 9.6 Scheduling Page Table 9.1 Scheduling options Item Name Function Schedules indicat or Indicates the current User Console perspective that you are using. Schedules displays a list ofschedules that you create, a toolbar to work with your schedules, and a list of times that your schedules are blocked from running. Schedule Name Lists your schedules by the name you assign to them. Click the arrow next to Schedule Nameto sort schedules alphabetically in ascending or descending order.
  • 69. User Manual DDU (Faculty of Tech., Dept. of IT) 60 Item Name Function Repeats Describes how often the schedule is set to run. Source File Displays the name of the file associated with the schedule. Output Location Shows the location that the scheduled report is saved. Last Run Shows the last time and date when the schedule was run. Next Run Shows the next time and date when the schedule will run again. Status Indicates the current Status of the schedule. The state can be either Normal or Paused. Blockout Times Lists the times that all schedules are blocked from running. You can edit and maintain each of your schedules by using the controls above the schedules list, on the right end of the toolbar. Table 9.2 Scheduling Controls Icon Name Function Refresh Refreshes the list of schedules. Run Now Runs a selected schedule(s) at will. Stop Scheduled Task Pauses a specified schedule. Use Start Schedule to start paused jobs. Start Scheduled Task Resumes a previously stopped schedule.
  • 70. User Manual DDU (Faculty of Tech., Dept. of IT) 61 Icon Name Function Edit Scheduled Task Edits the details of an existing schedule. Remove Scheduled Task Deletes a specified schedule. If the schedule is currently running, it continues to run, but it will not run again. 9.4 ADMINISTRATION The User Console has one unified place, called the Administration page, where people logged in with a role that has permissions to administer security can perform system configuration and maintenance tasks. If you see Administration in the left drop-down menu on the User Console Home page, you can click it to reveal menu items having to do with administration of the BA Server. If you do not have administration privileges, Administration does not appear on the home page. Figure 9.7 Administration Page
  • 71. User Manual DDU (Faculty of Tech., Dept. of IT) 62 Table 9.3 Administration Options Item Control Name Function 1 Administration Open the Administration perspective of the User Console. The Administration perspective enables you to set up users, configure the mail server, change authentication settings on the BA Server, and install software licenses for Pentaho. 2 Users & Roles Manage the Penatho users or roles for the BA Server. 3 Authentication Set the security provider for the BA Server to either the default Pentaho Security or LDAP/Active Directory. 4 Mail Server Set up the outgoing email server and the account used to send reports through email. 5 Licenses Manage Pentaho software licenses. 6 Settings Manage settings for deleting older generated files, either manually or by creating a schedule for deletion.
  • 72. Limitations and Future Enhancements DDU (Faculty of Tech., Dept. of IT) 63 LIMITATIONS AND FUTURE ENHANCEMENTS 10.1 LIMITATIONS  All the data is stored in single repository in Alfresco. In case of improper management in backup of data, there are chances of data loss.  Since community edition of Pentaho Data Integration Tool had limited number of functionalities, scheduling had to be done manually. 10.2 FUTURE ENHANCEMENTS  We could compress 97 tables of Alfresco to 29 tables in data warehouse. This could be further reduced in future so as to increase the efficiency.  Sophisticated requirements like hyperlink functions and ticket generation for employees can be done.
  • 73. Conclusion and Discussion DDU (Faculty of Tech., Dept. of IT) 64 CONCLUSION AND DISCUSSION 11.1 SELF ANALYSIS OF PROJECT VIABILITIES 11.1.1 Self Analysis We have created an information repository i.e. a data warehouse of an already existing database system Alfresco. We have successfully installed the application and tested its performance on several fronts. We have successfully completed validation testing. This project task has been accomplished in such a way that it incorporates several features demanded for present report generation and decision making requirements. 11.1.2 Project Viabilities This project is successfully completed and is viable to be used in the Institute of Plasma Research as a tool for generating reports according to the data stored in their database, Alfresco. These reports are user-friendly with strong GUI support using a host of graphical options like pie-charts, line graphs, bar charts etc. Decision making becomes easier for the management department because of these reports. 11.2 PROBLEMS ENCOUNTERED AND POSSIBLE SOLUTIONS  Alfresco was a new system, which we have never used before. For three to four weeks, it was difficult to understand all its functionalities and working. So it took time to understand full-fledged working of these technologies.  Alfresco GUI was not accessible in both our computers, due to which, we had to install PostGreSQL and SQuirreL database systems.  It took time to finalize the ETL and reporting tools. We finally zeroed it down to Pentaho over JasperSoft and BIRT.
  • 74. Conclusion and Discussion DDU (Faculty of Tech., Dept. of IT) 65  Pentaho Enterprises is basically a collection of tools. Each stage of our project could be done by a particular tool/system. Thus, we had to get ourselves familiar with a host of Pentaho tools.  Alfresco Audit Analysis and Reporting (A.A.A.R.) had not converted many of our tables while transforming into data warehouse. Thus, we had to do manually. 11.3 SUMMARY OF PROJECT WORK  PROJECT TITLE DATA AND BUSINESS PROCESS INTELLIGENCE It is a project based on the subject of data mining. Data Warehouse is created from where data is used to create user-friendly reports.  PROJECT PLATFORM PENTAHO It is an open-source provider of reporting, analysis, dashboard, data mining and workflow capabilities.  SOFTWARE USED Windows/Linux based system PostgreSQL Database SQuirreL Database Alfresco ECM Pentaho Community Edition 5.0 (PDI, Reporting Tool, BI Server) Alfresco Audit Analyzing and Reporting tool Notepad++  DOCUMENTATION TOOLS VISIO 2013
  • 75. Conclusion and Discussion DDU (Faculty of Tech., Dept. of IT) 66 WORD 2007 EXCEL 2007  INTERNAL PROJECT GUIDE PROF. R.S. CHHAJED  EXTERNAL PROJECT GUIDE MR. VIJAY PATEL  COMPANY INSTITUTE FOR PLASMA RESEARCH  SUBMITTED BY BHAGAT FARIDA H. SINGH SWATI  SUBMITTED TO DHARAMSINH DESAI UNIVERSITY  PROJECT DURATION 8TH DEC 2014 TO 28TH MARCH 2015
  • 76. References DDU (Faculty of Tech., Dept. of IT) 67 REFERENCES http://wiki.pentaho.com/display/Reporting/01.+Creating+Your+First+Report http://infocenter.pentaho.com/help/index.jsp?topic=%2Freport_designer_user_guide%2Ft ask_adding_hyperlinks.html http://www.robertomarchetto.com/pentaho_report_parameter_example http://docs.alfresco.com/4.2/concepts/alfresco-arch-about.html http://fcorti.com/alfresco-audit-analysis-reporting/aaar-description-of-the-solution/aaar- pentaho-data-integration/ http://en.wikipedia.org/wiki/Pentaho http://www.joyofdata.de/blog/getting-started-with-pentaho-bi-server-5-mondrian-and- saiku/ https://technet.microsoft.com/en-us/library/aa933151(v=sql.80).aspx http://datawarehouse4u.info/OLTP-vs-OLAP.html