Advertisement
Advertisement

More Related Content

Advertisement

Recently uploaded(20)

Advertisement

What is a Data Warehouse and How Do I Test It?

  1. © 2015 Real-Time Technology Solutions, Inc. New York  Philadelphia  Atlanta  www.rtts.com What is a Data Warehouse and How Do I Test It? A primer for Testers on Data Warehouses, the ETL process and Business Intelligence and how to test them
  2. © 2011 Real-Time Technology Solutions, Inc. New York  Philadelphia  Atlanta  www.rtts.com built by QuerySurge™ About FACTS Founded: 1996 – consulting firm Locations: New York (HQ), Atlanta, Philadelphia, Phoenix Strategic Partners: IBM, Microsoft, HP, Oracle, Teradata, HortonWorks, Cloudera, Amazon Software: QuerySurge RTTS is the leading provider of software & data quality for critical business systems
  3. Overview  What is Big Data?  What is a Data Warehouse? o About the ETL Process o The Data Warehouse marketplace  What is Business Intelligence? o The architecture o The BI marketplace  Testing the DW Architecture o Entry points o The Mapping document o Functional test implementation o Test Tools  Testing BI o Functional test implementation o Performance Testing  Data Warehouse Test Tool demo  Q&A
  4. ETL Business Intelligence (BI) software CxOs are using Business Intelligence & Analytics to make critical business decisions – with the assumption that the underlying data is fine. “The average organization loses $8.2 million annually through poor Data Quality.” - Gartner Data Architecture The Executive Office and Critical Data potential problem areas
  5. What is Big Data?
  6. Big data – defined as too much volume, velocity and variability to work on normal database architectures. What is Big Data? “The market for big data is $70 billion and growing by 15% a year.” - EMC COO Pat Gelsinger Size Defined as 5 petabytes or more 1 petabyte = 1,000 terabytes 1,000 terabytes = 1,000,000 gigabytes 1,000,000 gigabytes = 1,000,000,000 megabytes
  7. Big Data Impact Handles more than 1 million customer transactions every hour. • data imported into databases that contain > 2.5 petabytes of data • the equivalent of 167 times the information contained in all the books in the US Library of Congress. Facebook handles 40 billion photos from its user base. Google processes 1 Terabyte per hour Twitter processes 85 million tweets per day eBay processes 80 Terabytes per day others
  8. Requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Technologies include: • massively parallel processing (MPP) databases • data warehouses • Data mining grids • distributed file systems • distributed databases • cloud computing platforms • the Internet, and • scalable storage system Big Data Solutions
  9. What is a Data Warehouse?
  10. What is a Data Warehouse? Data Warehouse • typically a relational database that is designed for query and analysis rather than for transaction processing • a place where historical data is stored for archival, analysis and security purposes. • contains either raw data or formatted data • combines data from multiple sources • Sales • salaries • operational data • human resource data • inventory data • web logs • Social networks • Internet text and docs • other Legacy DB CRM/ERP DB Finance DB
  11. Data Warehouse: Business Case Why build a Data Warehouse? • Data stored in operational systems (OLTP) not easily accessible • OLTP systems are not designed for end-user analysis • The data in OLTP is constantly changing • May be deficient in historical data • Diverse forms of data stored in different platforms and/or dissimilar formats
  12. Data Warehouse: Business Case The Data Warehouse Business Solution • Collects data from different sources (other databases, files, web services, etc) • Integrates data into logical business areas • Provides direct access to data with powerful reporting tools (BI)
  13. Data Warehouse: About the data The Data Warehouse data • Subject-oriented • Integrated • Non-volatile • Time-variant
  14. Data Warehouse: the ETL process ETL = Extract, Transform, Load Why ETL? Need to load the data warehouse regularly (daily/weekly) so that it can serve its purpose of facilitating business analysis. Extract - data from one or more OLTP systems and copied into the warehouse Extract Transform – removing inconsistencies, assemble to a common format, adding missing fields, summarizing detailed data and deriving new fields to store calculated data. Transform Load – map the data and load it into the DW Load
  15. Data Warehouse: the ETL process Extract Legacy DB CRM/ERP DB Finance DB Source Data ETL Process Target Data Warehouse Transform Load
  16. Data Warehouse: the Marketplace “The data warehousing market will see a compound annual growth rate of 11.5% through 2013 to reach a total of $13.2 billion in revenue.” - consulting specialist The 451 Group Data Warehouse size Small data warehouses: < 5 TB Midsize data warehouses: 5 TB - 20 TB Large data warehouses: >20 TB - Analyst firm Gartner Leaders in Data Warehouse Data Management Systems       - Analyst firm Gartner’s ‘Magic Quadrant for Data Warehouse Database Management Systems’
  17. Data Warehouse: the Marketplace Delivery Models • stand-alone DBMS software • Cloud offerings • data warehouse appliances Leading Appliance Makers
  18. Business Intelligence (BI)
  19. Business Intelligence (BI) B.I. – What is it? • Software applications used in spotting, digging-out, and analyzing business data • provides simple access to data which can be used in day to day operations, integrates data into logical business areas • provides historical, current and predictive views of business operations • made up of several related activities, including data mining, online analytical processing, querying and reporting.
  20. Business Intelligence (BI): Who uses it? Wal-Mart uses vast amounts of data and category analysis to dominate the industry. Amazon and Yahoo follow a "test and learn" approach to business changes. Hardee’s, Wendy’s, and T.G.I. Friday’s use BI to make strategic decisions.
  21. Business Intelligence (BI) & Data Marts Data Mart A database that has the same characteristics as a data warehouse, but is usually smaller and is focused on the data for one division or one workgroup within an enterprise. Typically hold aggregated data and some granular data. It is a subset of the DW and makes it more efficient for Business Intelligence reporting. Legacy DB CRM/E RP DB Finance DB ETL ETL Source Data ETL Process Target DW ETL Process Data Mart
  22. Business Intelligence (BI) Legacy DB CRM/ERP DB Finance DB ETL ETL Source Data ETL Process Target DW ETL Process Data Mart
  23. BI: the Marketplace “Worldwide business intelligence (BI) platform, analytic applications and performance management (PM) software revenue reached $10.5 billion in 2010, a 13.4 percent increase from 2009 revenue of $9.3 billion” “The four large "stack" vendors (SAP, Oracle, IBM and Microsoft) continue to consolidate the market, owning 59 percent of the market share. ” - Analyst firm Gartner - Analyst firm Forrester Research’s ‘Forrester Wave’ Leaders in BI        
  24. Testing a Data Warehouse (DWH)
  25. DataWarehouseTesting The Challenge Comprehensive testing of data at every point throughout data process is becoming increasingly important as more data is being used in strategic decision-making. Yet current strategies are time-consuming, resource- intensive and inefficient. What's Involved in Data Testing? According to authors Doug Vucevic and Wayne Yaddow in the book "Testing the Data Warehouse Practicum: Assuring Data Content, Data Structures and Quality", some of the main challenges of data testing are: Data Completeness Verifying that all data has been loaded from the sources to the target. Data Transformation Ensuring that all data has been transformed correctly during the extract- transform-load (ETL) process. Data Quality Ensuring that the ETL process correctly rejects, substitutes default values, corrects or ignores and reports invalid data. Regression Testing Ensuring existing functionality remains intact each time a new release of code is completed.
  26. Resources involved • Business Analysts create requirements • QA Testers develop and execute test plans and test cases. ***Skill Set required: Strong SQL!!! • Architects set up test environments • Developers perform unit tests • DBAs test for performance and stress • Business Users perform functional User Acceptance Tests Testing the DWH: Resources Involved For the purposes of this presentation, we will focus on a strategy for Testers.
  27. An effective data warehouse testing strategy focuses on the main structures within the data warehouse architecture: 1) The Sources 2) The ETL layer 3) The data warehouse itself 4) The front-end (BI) data warehouse applications Testing the Data Warehouse: the Strategy
  28. Testing the Data Warehouse: Entry Points Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions). The goal: provide rapid localization of data issues between points test entry point(s) test entry point test entry point Legacy DB CRM/ERP DB Finance DB ETL ETL Source Data ETL Process Target DW ETL Process Data Mart Business Intelligence software
  29. Target DW Testing the Data Warehouse: Entry Points Legacy DB CRM/ERP DB Finance DB Source Data File File Staging DB ETL Process ETL ETL ETL ETL ETL ETL test entry pointstest entry points test entry points test entry points Data MartsETL Process ETL ETL possible architectureETL ETL ETL ETL ETL ETL ETL Process Business Intelligence software
  30. Testing the DWH: the Mapping Document a.k.a. Source to Target Map It’s the critical element required to efficiently plan the ETL process. Intention:  capture business rules  data flow mapping and  data movement requirements. Mapping Doc specifies:  Source input definition  Target/output details  Business & data transformation rules  Data quality requirements
  31. Testing the DWH: the Mapping Document SELECT c.idCustomer "Customer ID", c.lastName "Customer Last Name", c.firstName "Customer First Name", o.idOrder "Order Number", p.name "Product Name", op.quantity "Quantity Ordered", CASE WHEN os.idOrderStatus = 5 AND o.refundDate IS NOT NULL THEN 'Returned' WHEN (os.idOrderStatus = 3 OR os.idOrderStatus = 4) AND o.shipDate IS NOT NULL THEN 'Delivered' ELSE 'Processing' END "Order Status" FROM Sales.Orders o, Sales.OrderStatus os, Sales.OrderProduct op, Sales.Product p, Sales.Category cat, Sales.Customer c WHERE o.order_idOrderStatus = os.idorderstatus AND op.orderProduct_idOrder = o.idOrder AND op.orderProduct_idProduct = p.idProduct AND p.product_idCategory = cat.idCategory AND cat.name = 'Electronics' AND o.order_idCustomer = c.idCustomer AND o.orderDate BETWEEN '01-SEP-10' AND '07-SEP-10' ORDER BY c.idCustomer, c.lastName, c.firstName, o.idorder Source SELECT u.idUser "Customer ID", u.lastName "Customer Last Name", u.firstName "Customer First Name", p.idPurchase "Purchase Number", i.name "Item Name", oi.quantity "Quantity Ordered", ps.status "Purchase Status" FROM dw.Purchase p, dw.PurchaseStatus ps, dw.OrderItem oi, dw.Item i, dw.user_ u, dw.category cat WHERE p.purchase_idPurchaseStatus = ps.idPurchaseStatus AND oi.orderItem_idPurchase = p.idPurchase AND oi.orderItem_idItem = i.idItem AND p.purchase_idUser = u.idUser AND i.item_idCategory = cat.idCategory AND cat.name = 'Electronics' AND SUBSTR(p.purchaseDate, 1, 5) BETWEEN '09-01' AND '09-07' AND SUBSTR(p.purchaseDate, -2) = '10' ORDER BY u.idUser, u.lastname, u.firstname, p.idpurchase Target
  32. Testing the DWH: Implementation Implementation of Functional Test What is going on in the marketplace? 1. Manual Execution 2. Automated execution with standard test tools 3. Bulk automation with Data Warehouse Testing Tool (i.e. QuerySurge)
  33. © 2015 Real-Time Technology Solutions, Inc. Review Mapping Docs Write SQL in favorite editor Run TESTs Dump results to a file Compare results manually or w/compare tool Report Defects and issues Tools Tasks Timeline Testing the DWH: Manual Testing Flow
  34. Manual ETL Testing Flow Comments  Check points across each leg so that each transformation is checked.  If a file compare tool is used, care must be taken to ensure that the result rows for each query are in the same order (the db is under no obligation to return rows in a specified order, unless the sql indicates an order).  This process can quickly result in 100’s or 1,000’s of source and target query pairs.  Process is labor intensive. Even with multiple people, a VERY small sampling can be performed. Testing the DWH: Manual Testing Flow
  35. Functional Automation ETL Testing flow 1. Similar to previous - Extract mappings from mapping document 2. Write pairs of queries that test between any two points in the architecture. 3. Issue the queries via a Functional Automation tool 4. Have the functional Scripts dump the query result-sets to files 5. Compare the files, either by writing automation code or by using a file compare tool. This process is dependent on the speed of the automation tool; Normally, only a fraction of the data can be covered per ETL per build. Functional Tester Testing the DWH: Typical Functional Automation Testing Flow
  36. 36 © 2015 Real-Time Technology Solutions, Inc. SQL (source) SQL (target) SQL (source) SQL (target) Legacy DB CRM/ERP DB Finance DB Testing the Data Warehouse: Specialized Data Warehouse Test Tool QuerySurge™
  37. QuerySurge™ the collaborative Data Warehouse Testing solution that finds bad data & provides a holistic view of your data’s health built by
  38. • Reduce your costs & risks • Improve your data quality • Accelerate your testing cycles • Share information with your team with QuerySurge™ you can: built by QuerySurge™ • Provides huge ROI (i.e. 1,300%)* *based on client’s calculation of Return on Investment
  39. the QuerySurge advantage built by QuerySurge™ Automate the entire testing cycle  Automate kickoff, tests, comparison, auto-emailed results Create Tests easily with no SQL programming  ensures minimal time & effort to create tests / obtain results Test across different platforms  data warehouse, Hadoop, NoSQL, database, flat file, XML Collaborate with team  Data Health dashboard, shared tests & auto-emailed reports Verify more data & do it quickly  verifies up to 100% of all data up to 1,000 x faster Integrate for Continuous Delivery  Integrates with most Build, ETL & QA management software
  40. QuerySurge™ Architecture Web-based… Installs on... Linux Connects to… …or any other JDBC compliant data source built by QuerySurge™ QuerySurge Controller QuerySurge Server QuerySurge Agents Flat Files
  41. Collaboration Testers - functional testing - regression testing - result analysis Developers / DBAs - unit testing - result analysis Data Analysts - review, analyze data - verify mapping failures Operations teams - monitoring - result analysis Managers - oversight - result analysis Share information on the built by QuerySurge™
  42. Strategy • Execute business user reports and verify results from report to Data Mart » Logical Calculations − Verify logical calculations to back-end Data Mart by creating SQL queries that incorporate and return the calculations from the Data Mart. Compare to report. (Example: Total sales for the month of January) » Data Validation − Verify data validations to back-end Data Mart by creating SQL queries that incorporate and return the equivalent data from the Data Mart. Compare to report. (Example: List of all customers that spent more than $100) » Parameter Validation − For reports that have parameters, create multiple tests that incorporates a reasonable amount of test coverage. Testing the Data Warehouse: Functional Test of Business Intelligence software
  43. Testing the DWH: Functional Test of BI Functional Testing of BI 1. BI Developer creates reports based on Business user requirements 2. Testers verify reports by: • Running reports using a range of parameter permutations. • Verify that data is correct o Record counts on report to backend data mart o Verify field data elements o Verify field lengths and field level data o Verify logical dependencies Functional Tester Automation tools can and should be used for regression purposes.
  44. Common Challenges • BI systems often have reports that require complex SQL queries across dozens of tables encompassing 100’s of 1,000’s of records, from multiple databases. • Challenge: Determining the performance characteristics under differing conditions and workloads . • Need to know the ability of the system to scale to the # of concurrent users. • Must test how length of time for user to receive report after requesting it with the parameters he/she specifies. Testing the Data Warehouse: Performance Test of BI
  45. Testing the DWH: Performance Test of BI Strategy • Determine a typical workload for the business intelligence system. • Identify different user roles, what kinds of work they do on the system, and how often they do this work. • Determine how many users of each role there are. • Choose a performance tool that can record the protocol activity of the system and allow the performance tester to modify data parameters. • Create scripts by recording the protocol traffic emitted by the BI system as the targeted reports were opened and refreshed. • Prepare and execute series of concurrent multi-user tests • Make sure each virtual user emulates the activity of real users accessing business intelligence reports based on separate concerns. • Monitor response times, throughput, network activity, and system activity for issues • Review results and provide recommendations. Using this approach, the workload activity of the entire population of business intelligence users can be reproduced in controlled conditions Performance Tester
  46. Summary What is a Data Warehouse and How Do I Test It? • Big Data is a growing technical concern and has reached $70 billion in scope. • The Data Warehouse and Business Intelligence software marketplace is a $22 billion market and growing. • Functional testing of a data warehouse implementation is a complex undertaking and requires strong SQL skills by the Tester • Manual testing and automated testing using standard tools provide a very small % of coverage. • Business Intelligence software must be properly tested for both functionality and performance.
  47. © 2015 Real-Time Technology Solutions, Inc. 47 To see the video of this Webinar please visit: http://www.querysurge.com/solutions/data-warehouse-testing What is a Data Warehouse and How Do I Test It?

Editor's Notes

  1. Volume -- of data is getting higher/bigger than ever. Velocity -- of data is increasing e.g. Complex Event Processing of real time data.  Variety -- of data is spiraling e.g. unstructured video and voice. Variability -- of data types is also increasing
  2. Corporate data in an organization is generated and stored in a variety of operational systems.  Operational systems are systems like order-entry and invoicing, that are tuned to handle day-to-day transactions. OLTP - On-line Transaction Processing systems.
  3. Access Data Directly With data warehousing, as a decision-maker, you do not need to rely on IS personnel to fulfill your querying needs.  You can access data directly, when and how you want. You can execute queries and build reports on your workstation, freeing the IS department to focus on tasks such as building applications.
  4. Subject-Oriented OLTP = application-oriented & current— designed to support application processing. DW = subject-oriented & historical— designed to aid decision-making. Integrated Data in a warehouse is integrated by consolidating data from different operational systems. Non-Volatile Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Where an operational system replaces existing data with new data, a data warehouse continually absorbs new data, integrating it with existing data. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to OLTP systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
  5. Designing and maintaining the ETL process is often considered one of the most difficult and resource-intensive portions of a data warehouse project. Many data warehousing projects use ETL tools to manage this process. Other data warehouse builders create their own ETL tools and processes, either inside or outside the database. Besides the support of extraction, transformation, and loading, there are some other tasks that are important for a successful ETL implementation as part of the daily operations of the data warehouse and its support for further enhancements.
  6. Informatica’s software is the premier used for ETL, but was not mentioned in Gartner’s report because they don’t have DW software.
  7. Companies use BI to improve decision making, cut costs and identify new business opportunities. BI is more than just corporate reporting and more than a set of tools to coax data out of enterprise systems. CIOs use BI to identify inefficient business processes that are ripe for re-engineering. With today’s BI tools, business folks can jump in and start analyzing data themselves, rather than wait for IT to run complex reports.
  8. Restaurant chains such as Hardee’s, Wendy’s, Ruby Tuesday and T.G.I. Friday’s are heavy users of BI software. They use BI to make strategic decisions, such as what new products to add to their menus, which dishes to remove and which underperforming stores to close. They also use BI for tactical matters such as renegotiating contracts with food suppliers and identifying opportunities to improve inefficient processes. Because restaurant chains are so operations-driven, and because BI is so central to helping them run their businesses, they are among the elite group of companies across all industries that are actually getting real value from these systems.
  9. Each of these units must be treated separately and in combination, and since there may be multiple components in each (multiple feeds to ETL, multiple databases or data repositories that constitute the warehouse, and multiple front-end applications), each of these subsystems must be individually validated.
  10. 1. (comment) Usually, the points are across each ETL “Leg”, so that each transformation is checked stepwise. 4. If a file compare tool is used, care must be taken to ensure that the result rows for each query are in the same order (the db is under no obligation to return rows in a specified order, unless the sql indicates an order). Output is not fancy from file compare tools (usually); reporting will be ad hoc using Excel or similar This process can quick result in 100’s or 1,000’s of pairs of queries – since if you write several testing queries for each mapping across each leg, the multipliers raise the numbers quickly. (avg # of testing queries per mapping X # of mappings per leg X # of legs) Clearly, this process is labor intensive, and even with several people executing, only a tiny fraction of the data can be covered per ETL per build.  
  11. Functional Automation ETL Testing flow As above - Extract mappings from mapping document Write pairs of queries that test between any two points in the architecture. Usually, the points are across each ETL “Leg”, so that each transformation is checked stepwise. Issue the queries via a Functional Automation tool Have the functional Scripts dump the query result-sets to files Compare the files, either by writing automation code or by using a file compare tool. The tool logs serve as reporting output. Clearly, this process is dependent on the speed of the automation tool; even with several tool instances executing, typically only a fraction of the data can be covered per ETL per build.
  12. QuerySurge provides insight into the health of your data throughout your organization through BI dashboards and reporting at your fingertips. It is a collaborative tool that allows for distributed use of the tool throughout your organization and provides for a sharable, holistic view of your data’s health and your organization’s level of maturity of your data management.
  13. QuerySurge helps your team coordinate your data quality initiatives while speeding up your development and testing cycles and finding your bad data. Why risk having your team identify trends and develop strategic initiatives when the underlying data is incorrect? QuerySurge reduces this risk.
  14. Your distributed team from around the world can use any of these web browsers: Internet Explorer, Chrome, Firefox and Safari. Installs on operating systems: Windows & Linux. QS connects to any JDBC-compliant data source. Even if it is not listed here.
  15. QuerySurge can utilized by active practitioners such as testers & developers to create and launch tests, or by managers, analysts and operations to view data test results and the overall health of the data. QuerySurge facilitates this by providing 2 types of licenses: (1) full user & (2) participant user. (1) Full User – This type of user has unlimited access to create QueryPairs, Suites, and Scenarios. This user can also schedule and run tests, see results, run and export reports, and export data. Perfect for anyone creating and/or running data tests while performing analysis of results. (2) Participant User – This user cannot create or run tests, but has access to all other information - including viewing all query pairs, results, and reports, receiving email notifications, and exporting test results and reports. Perfect for managers, analysts, architects, DBAs, developers, and operations users who need to know the health of their data.
  16. Business Intelligence systems often have reports that require the use of complex SQL queries across dozens of tables encompassing hundreds of thousands of records, originating from several individual databases.   Determining the performance characteristics of these systems under differing conditions and workloads has always been a challenge.   In most cases, we wish to test how long it would take a user to receive a report after requesting it with the parameters he or she specifies.     In order to begin, we must determine a typical workload for the business intelligence system. We determine how many user roles there are, what kinds of work they do on the system, and how often they do this work. We determine how many users of each role there are.     It is important to choose a performance tool that has the ability to record the protocol activity of the system and allow the performance tester to modify that to make it generalized and applicable to different data parameters to emulate the variable data that are often required of business intelligence systems.   We create scripts by recording the protocol traffic emitted by the business intelligence system as the targeted reports were opened and refreshed. These scripts were modified in order to accept data parameters from a file or prepared information, determined by a subject matter expert. These parameters were put into data files, in a form suitable for use with the test scripts.
Advertisement