Your SlideShare is downloading. ×
Rando Veizi: Data warehouse and Pentaho suite
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Rando Veizi: Data warehouse and Pentaho suite

645
views

Published on

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
645
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. qwertyuiopasdfghjklzxcvbnmq wertyuiopasdfghjklzxcvbnmqw ertyuiopasdfghjklzxcvbnmqwer tyuiopasdfghjklzxcvbnmqwerty uiopasdfghjklzxcvbnmqwertyui Innovation and New Technologies Professor: Carlo Vaccari opasdfghjklzxcvbnmqwertyuiop asdfghjklzxcvbnmqwertyuiopas dfghjklzxcvbnmqwertyuiopasdf ghjklzxcvbnmqwertyuiopasdfgh jklzxcvbnmqwertyuiopasdfghjkl zxcvbnmqwertyuiopasdfghjklzx cvbnmqwertyuiopasdfghjklzxcv bnmqwertyuiopasdfghjklzxcvbn mqwertyuiopasdfghjklzxcvbnm qwertyuiopasdfghjklzxcvbnmq wertyuiopasdfghjklzxcvbnmqw 2/4/2014 Student: Rando Veizi
  • 2. Contents Data Warehouses .................................................................................................................................... 2 History ................................................................................................................................................. 2 Introduction ........................................................................................................................................ 3 Why DW ? ........................................................................................................................................... 5 DW environment................................................................................................................................. 5 Bottom-up Design ............................................................................................................................... 6 Top-down Design ................................................................................................................................ 7 Top-down vs bottom-up ..................................................................................................................... 8 The hybrid design................................................................................................................................ 9 DW vs OS ........................................................................................................................................... 10 Pentaho Suite ........................................................................................................................................ 11 Introduction ...................................................................................................................................... 11 Installing Pentaho Suite .................................................................................................................... 12 Starting the BI Platform: ............................................................................................................... 12 How to Log Into the Pentaho User Console .................................................................................. 12 Trying some tools… ........................................................................................................................... 13 Community Dashboard Editor (CDE)............................................................................................. 13 Saiku .............................................................................................................................................. 14 Data warehouses and Pentaho Suite : .............................................................................................. 15
  • 3. Data Warehouses History The DW notion dates to the late 80s when some IDM researches developed “business data warehouse” . At first the idea of DW was intended to create a model of architecture for the data flow that goes from the operational system to the decision support environments. That concept wanted to support different problems associated with this flow such as the high costs associated with it. Without DW , o good amount of redundancy was needed to support multiple decision support environments. In bigger companies it was normal for multiple decision support environments to operate independently. Even if each environment served different users, the usually needed much of the same stored data. The processes of managing data from different sources in most of the cases from long-term existing OS-s (Legacy Systems) was partially replicated for each one of the environments. Moreover, the operational systems were frequently re-examined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from DM that were tailored for ready access by users.
  • 4. Introduction Figure 1: All data warehouses processes in one picture Data warehouse (DW,DWH), or Enterprise DW (EDW),is a database that is used for   Reporting Data analysis A central repository of data (DW) is created by integrating data from different disparate sources DW stores historical data and can be used to create trending reports for senior management reporting such as annual and quarterly comparisons. The data that is stored in the DW gets uploaded by the operational system such as sales or marketing. This data itself can pass through (but it is not always this way) an operational data store for certain operations before it can be used in the DW for reporting. The ETL-based DW uses staging ,data integration , and access layers to house its key functions.   The staging database stores data that has been extracted from each of the data systems The integration layer integrates data sets by transforming this data from the staging layer to an ODS(operational data store) database
  • 5.   The integrated data then will be moved to another location, to another database called data warehouse database where it will be divided in groups(called dimensions) in facts and aggregate facts all arranged into a hierarchical classification. The combination of these facts and dimensions can also be called star schema. The function of the access layer is to retrieve data If a DW is constructed from an integrated data source systems it does not require nor ETL, staging databases or even ODS databases. These systems can be considered as a part of a distributed operational store layer. The integrated data source systems and DW are all integrated since no transformation of dimensional or reference data is done and this is different from ETL. This integrated DW architecture supports the drill down from the aggregate of the DW to the transcriptional data of the integrated source data systems. A data mart is a DW in “miniature”, and it is focused on a specific area of interest. Essentially DW can be subdivided in data marts for better performance and in ease of use(easy to use) within the area. So basically an organization can create 1 to n data marts and it can go towards a larger and more complex enterprise DW . In this definition DW is focuses on data storage. To the main source of the data happens the following:     Is cleaned transformed catalogued made available for use (for managers, business professionals for data mining or analytical processing)
  • 6. Why DW ? DW always keeps a copy from the source transaction systems. This kind of architecture gives us the possibility to : 1. Group data from different sources into a single database and this way only one query engine is needed to present the data. 2. Reduce the level of database isolation in the transaction processing systems that is caused by trying to run large analysis queries in transaction processing databases. 3. Save and keep the data history, even though source transaction systems do not. 4. Takes and integrates the data from many source systems creating a central view across the enterprise. 5. Provide consistent codes and descriptions that improves the quality of data. 6. Restructure the data so that it can be more user-friendly to the business users. 7. Structure the data so it can have a very good query performance, leaving the OS(Operative System). 8. Make the decision-support queries user-friendly to write. DW environment The environment for DW and DM comprises the following :      Source systems that provide data to the DW or DM Technologies and processes that prepare data to be used Ample architectures that store data into an organization’s DW or DM. Lots of tools and apps for a different range of users. Metadata, data quality and governance processes should be in the place where they belong to ensure that DW/DM meets its purposes. These days the most successful companies are those that can act, respond very quickly and in a flexible way to market changes and new opportunities. A key to this response is the good and efficient use of data and the information by analysts and managers.
  • 7. Bottom-up Design Figure 2: Bottom-Up By building a series of data marts to an agreed architecture, the enterprise data warehouse can be assembled slice by slice, until it is complete enough to regard the data marts as subsets of the now much greater whole. Architecture is key to success, as the data marts must not be built in isolation. Users need therefore to design data marts in the knowledge that each will eventually form part of a larger enterprise data warehouse. Such an approach can prove attractive to businesses. Each data mart can be implemented within six to nine months. Each can tackle an identifiable business problem making it possible to calculate returns on investment (ROI). The approach also offers a valuable learning curve for the build team, who can test out products and processes until they get it right. An approach to data warehouse design known as bottom-up was designed by Ralph Kimball. In this approach DM are first created to provide reporting and analytical capabilities for specific business processes. Primarily, DM contains dimension and facts. Facts contain either atomic data and summarized data if necessary. A data mart often models a precise business area that can be sales or production. All there DM can be summarized(integrated) to create a comprehensive DW. The DW bus architecture is primarily an “implementation of the bus” , a collection of conformed dimensions and facts. Those are dimensions that are shared between facts in at least two DM. 7 The integration of the DM in the DW is centered on the conformed dimensions, that define possible integration points between DM. The process that takes place when more than two DM integrate is called DRILL-ACROSS(DA) . A DA summarizes the data along the keys of the conformed dimensions of each fact that participates in the DA always followed by a join on the keys of these grouped facts. The most important management task is to make sure that the DM dimensions among data marts are consistent. Business value can be returned as quickly as the first data marts can be created, and the method lends itself well to an exploratory and iterative approach to building data warehouses.
  • 8. Example: DW effort can start in the department of sales, if build a Sales DM. After this DM is completed it can be expanded in another kind of DM that can be a production one for example. For DM-ts to be integrable with each other is needed from them to share the same bus. If the DM integration succeeds, than the DW through this 2 DM-s can deliver integrated information about sales and production which usually is a very important value for the business. Top-down Design Figure 3: Top-down The opposite of starting with individual business issues and expanding up the organisation hierarchy is to start at the top. A top down enterprise data warehouse and a subset data marts strategy is "the most elegant design approach", says Doug Hackney of business intelligence systems specialist, the Enterprise Group. He says that such an approach would vastly ease maintenance, summarisation, metadata management and extraction, transformation and loading (ETL) of data. An approach to data warehouse design known as top-down was designed by Bill Inmon. This approach is designed using “Atomic” data that is a normalized enterprise data model. Its function is to store the type of data that is at the lowest level of detail in the DW. Dimensional DM containing needed for specific business processes of departments are created from the DW. According to Inmon the DW is the center of CIF(Corporate Information Factory), which provides a logical framework for delivering business intelligence and business management capabilities.
  • 9. Top-down vs bottom-up All in one picture : Figure 4: T-D vs B-U
  • 10. The hybrid design The Hybrid Data Warehouse (Hybrid) is uniquely suited to support both EDW and datamart applications in one database. It can accommodate large volumes of historical data typically found in the EDW, while also performing well for OLAP queries typically done in datamarts. The Hybrid database structure contains both normalized snowflakes and de-normalized star schemas. The controlled redundancy inherent in this design provides good response time for a variety queries. The Hybrid architecture can also be used to implement the ODS in the same database, as long as sub-second response times are not a requirement. Because the ODS can be used by operational systems, the response time of the database can become an issue. Because there is only one database schema, the Hybrid model significantly reduces the cost of developing the ETL processes. Real-time (or near real-time) updates can be supported by pushing data updates out immediately to the Hybrid Warehouse directly from the operational system, or by connecting the ETL engine to an Enterprise Service Bus (ESB). The Hybrid model was used to develop one of the largest databases in Canada. It includes 34 dimensional roles with multiple hierarchies, has over 1500 attributes, and handles 40 million transactions per day in near real time, which translates into one billion rows per month. The Hybrid model may not be able to fully replace an ODS requirement for sub-second response time. But it can offer a one stop solution for organizations that have very large data volumes and are looking for a cost effective way to support a variety of BI requirements across the organization.
  • 11. DW vs OS The fundamental difference between OS and DW system is that the OS are designed to support transaction processing whereas data warehousing systems are designed to support online analytical processing(OLAP). Based on this fundamental difference, data usage patterns associated with operational systems are significantly different than usage patterns associated with data warehousing systems. As a result, data warehousing systems are designed and optimized using methodologies that drastically differ from that of operational systems. The table below summarizes many of the differences between operational systems and data warehousing systems. Operative Systems Data Warehousing Operational systems are generally designed to support high-volume transaction processing with minimal backend reporting. Operational systems are generally processoriented or process-driven, meaning that they are focused on specific business processes or tasks. Example tasks include billing, registration, etc. Data warehousing systems are generally designed to support high-volume analytical processing (i.e. OLAP) and subsequent, often elaborate report generation. Data warehousing systems are generally subjectoriented, organized around business areas that the organization needs information about. Such subject areas are usually populated with data from one or more operational systems. As an example, revenue may be a subject area of a data warehouse that incorporates data from operational systems that contain student tuition data, alumni gift data, financial aid data, etc. Data warehousing systems are generally concerned with historical data. Data within a data warehouse is generally non-volatile, meaning that new data may be added regularly, but once loaded, the data is rarely changed, thus preserving an ever-growing history of information. In short, data within a data warehouse is generally readonly. Data warehousing systems are generally optimized to perform fast retrievals of relatively large volumes of data. Data warehousing systems are generally integrated at a layer above the application layer, avoiding data redundancy problems. Operational systems are generally concerned with current data. Data within operational systems are generally updated regularly according to need. Operational systems are generally optimized to perform fast inserts and updates of relatively small volumes of data. Operational systems are generally application-specific, resulting in a multitude of partially or non-integrated systems and redundant data (e.g. billing data is not integrated with payroll data). Operational systems generally require a non-trivial level of computing skills amongst the end-user community. Data warehousing systems generally appeal to an enduser community with a wide range of computing skills, from novice to expert users. Table 1: DW vs OS
  • 12. Pentaho Suite Introduction Pentaho was founded in 2004. It is headquartered in Orlando, FL, USA. One of the most important advantages that it has is that it offers a suite of open source business intelligence (BI) products. These products called Pentaho Business Analytics provide data integration , OLAP(online analytical processing) services, reporting dashboarding and, mining and ETL capabilities. Pentaho is the open source business intelligence development platform which has different components integrated with it. You have both open source and commercial versions available to support your BI need. This article is scoped to help open source business intelligence developer to integrate CTOOLS on CDF to fulfil their dashboard development BI needs. Figure 5: Pentaho community edition vs pentaho enterprise edition
  • 13. Installing Pentaho Suite Now I will show you how to install Pentaho Suite community edition(CE) along with some tools and explain their purpose. a) Download Pentaho Server from http://community.pentaho.com/. Choose zip or tar.gz according to preferences b) Tomcat Install c) Set up MySQL d) Configure the BI Server Starting the BI Platform: In order to use and configure the Pentaho BI Platform, you must start the BI Server, then the Pentaho Administration Console. 1. To start the BI Server, run the start-pentaho script in the /biserver-ce/ directory. 2. To start the Pentaho Administration Console, run the start script (on Windows) or startup script (onLinux) in the /biserver-ce/administration-console/ directory. How to Log Into the Pentaho User Console 1. Open a Web browser and type in the Web or IP address of the Pentaho server, which is http://localhost:8080/pentaho/ by default. You'll see an introductory screen with some Pentaho-related information and a Login button in the center of the screen. 2. Click Login. The login dialog will appear. 3. For the locally installed version of the BI Suite, select Joe from the user drop-down box, and type in password into the password field, then click Login. For hosted demo users, select Guest and type in guest as the password instead. You are now logged into the Pentaho User Console and ready to start creating and running reports.
  • 14. Figure 6: Pentaho’s Login interface Trying some tools… Community Dashboard Editor (CDE) is one of the plugins designed for Pentaho BI Server, contributed and maintained by Pentaho Partner webdetails. -The pourpose of this tool is to create dashboards -Community Dashboard Editor (CDE) was born to simplify the creation, edition and rendering processes of the CTools Dashboards. -CDE is a very powerful and complete tool, combining front end with data sources and custom components in a seamless way. Now to create a Dashboard I followed some examples here and here. First of all after we install CDE our Pentaho interface will change , and this icon will be added :
  • 15. By experimenting and following guides I was able create something(screenshots below): And that is a dashboard about how many exams did I take every year in my bachelor degree. Saiku Another tool that I studied is saiku. Saiku is a modular open-source analysis suite offering lightweight OLAP which remains easily embeddable, extendable and configurable. It is similar in form and function to the Pentaho Analyzer Plugin. It allows a user to visually create queries by dragging parts of a previously defined OLAP schema onto a canvas, where other activities can take place like filtering, sorting, creating calculated members from other measures, exporting the result table to PDF or MS Excel, and optionally graphing the data. A restful server connects to existing OLAP systems, which then powers user-friendly, intuitive analytics via a lightweight JQuery-based frontend.
  • 16. Turning data into information shouldn't be hard, it should be easy and fun. The Saiku project is all about creating tools that are easy-to-use by anyone who wants to crunch numbers, visualize information, gain insight from data and act on it. Follow this link and you will understand much easier how does saiku work I you are willing to understand more you can go to these web addresses http://pedroalvesbi.blogspot.it/2011/06 or http://codeissue.com/articles/a04e87158bb8552/pentaho-bi-ctools-cdf-cdacde-saiku-analytics-etc-using-cygwin Data warehouses and Pentaho Suite : Open-source Pentaho provides business intelligence (BI) and data warehousing solutions at a fraction of the cost of proprietary solutions. To know more about the fusion of data warehouses and pentaho suite integration you might like to buy(or downoad) and take a look to Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL.

×