Agile Business Intelligence


Published on

Applying Agile Delivery to Business Intelligence, focusing on ETL development

Published in: Technology, Business

Agile Business Intelligence

  1. 1. - David Portnoy 312.970.9740- © Copyright 2012-2014 Datalytx, Inc. Applying Agile Delivery to– – Business Intelligence Topic: Data Integration & ETL
  2. 2. This group extends the TDWI community online and is designed to foster peer network and discussion of key issues relevant to business intelligence and data warehousing managers. TDWI (The Data Warehousing Institute™) provides education, training, certification, news, and research for executives and information technology (IT) professionals worldwide. Founded in 1995, TDWI is the premier educational institute for business intelligence and data warehousing. Our Web site is Why this topic?  There’s a lot of confusion and misconception about the meaning of Agile, especially as it applies to BI  Many in corporate IT still believe that Agile cannot easily be applied to BI  Posts on this topic in the TDWI forum in LinkedIn would benefit from being organized and summarized
  3. 3. What we’ll cover  Misconceptions about Agile BI  Core techniques of Agile BI  Review of ETL tool landscape and benefits  Decision factors for choosing the ETL environment  Mitigating aspects of ETL tools that make Agile harder  How to implement an Agile BI development environment Due to the prevailing confusion and misconceptions, it’s easier to start with what Agile BI is not
  4. 4. Misconceptions about Agile in the BI community There’s a common misconception that Agile BI applies to practically any methodology or tool that helps develop BI projects faster or in a more flexible way. Some examples of misconceptions:  Agile is primarily adding iterations to typical projects  Agile implies starting to code without planning or design  Agile involves particular data models, such as Data Vault  Agile involves rapid prototyping techniques, as can be achieved by certain metadata driven tools  Agile involves self-serve reporting, such as Tableau  Agile involves moving ETL from a separate code base into the reporting layer, as made possible by in-memory processing, such as with QlickView  Agile involves building real-time or low- latency DW, rather than traditional batch  Agile operates in a hosted cloud environment, especially PaaS (Platform as a Service)
  5. 5. The culprits for the myths and misconceptions #1 Vendors claim that their products are agile. #2 The BI community as a whole does not have a long history or substantial practice with agile development. Therefore they are more likely to be swayed by vendor pitches.
  6. 6. The culprits for the myths and misconceptions (cont.)  In the software development world, that’s equivalent to saying that new frameworks, such as Ruby on Rails, are needed for Agile development. (Few credible publications or developers would make such a claim.)  The implication that other BI tools can’t be used to achieve Agile BI is simply not true. (Even general purpose development platforms can be applied to BI.)  In reality, team composition, proficiency with existing technologies and management’s acceptance of agile is a bigger impact than a specific type of BI tool. “...Agile BI methodology differs from [agile software development] in that it requires new and different technologies and architectures for support. Metadata- generated BI applications are one such example...” Example source of misconceptions: The article goes on to claim that these particular tools are needed in order to achieve “development done faster”, “react[ing] more quickly to ... requirements“, incremental product delivery, “rapid prototypes versus specifications”, “reacting versus planning”, “personal interactions ... versus documentation”, etc. Forrester Research article “Agile Out of the Box”, 2010  This list is just buzz words associated with agile without substantial evidence of why other tools are insufficient.  Rapid prototyping is confused with the role of end-to-end working software.  On the contrary, arguments can be made why the tools identified could be detrimental to agile teams. (See TDWI LinkedIn group discussion “The Role of ETL tools in Agile BI”.) What’s wrong with itWhat’s being said
  7. 7. The reality Yes, many of the items misclassified as necessary for Agile still help projects ramp up and complete faster. Yes, many improve the flexibility of dealing with changes in source data, business logic and reporting. Yes, many provide additional visibility into complex logic and functional changes across team members and stakeholders.  Data Vault model  Rapid prototyping tools  Metadata driven BI tools  Self-serve reporting  In-memory processing  Hosted cloud (PaaS) environment But none of them are required to have successful Agile BI projects
  8. 8. So what are the requirements for implementing Agile BI? Productive Agile BI teams operate almost identically to Agile methodology used for software development. ...With just the minimal tweaks to accommodate: 1. Integration of available ETL and reporting tools into the development environment 2. Changes to regression testing due to the fact that databases have state 3. Challenges of managing large data sets in the deployment process
  9. 9. Techniques for implementing Agile in BI  Timebox deliverables – of course  Measure completion with working software! (Prototypes using non- production tools are OK. But need to get end-to-end data flow working ASAP.)  Highly efficient, daily team synchronization in which entire team participates.  Monitor completion of features (stories), not time spent. Calculate team velocity to improve planning.  Hold sprint retrospectives to learn from mistakes.  Leverage techniques of Agile app dev:  Manage everything in version control, including data model and test data sets  Assume refactoring of working code can occur later to improve performance and maintainability  Use Test Driven Development (TDD), to ensure understanding of requirements and reduce rework  Implement Continuous Integration to automate build, tests, deployment  Measure project success by delivery of business value, not delivery of predefined requirements on time and on budget  Accept that it’s OK to fail, but fail early and adapt. (Non agile projects don’t recognize failure until time or budget runs out.)
  10. 10. What’s the reason for low adoption of Agile in BI? Application Development Business Intelligence Development Environment Custom app development using standard, general purpose languages well suited for automation Proprietary vendor architectures and DSLs (domain specific languages) not well suited for automation Team skills Have skills to write automation for continuous integration Rely on vendors to provide these features Costs Low up front investment by leveraging open source platforms High up front investment in vendor- specific tools: DW appliance, data modeling, ETL, OLAP, Reporting, etc. Releases Software is stateless and therefore easier to test and deploy with each build Databases have state, with each build needing to start with a certain data set. High data volumes may take hours to load a changed data model or roll back changes. Agile is widely adopted in application development ...but not in BI Potential reasons might stem from differences between the two worlds
  11. 11. Now let’s get into the specifics of ETL in Agile BI
  12. 12. ETL tools have evolved over the years  Graphical development accomplishing ETL through parameterization and configuration, rather than code generation  Avoids complexities with code management and deployment  Intuitive development UI enabling developers to manipulate ETL metadata  From metadata, generate code in a general purpose (such as C or Java) or domain specific (such as SQL or MDX) language  Types: One-shot generators (that require switching to a native dev env) vs. full development environments with managed version deployments  Origin: Reusable code compiled from a few similar projects  Just change parameters to reuse for specific loading, logging, change data capture, database connections, etc.  One-time solutions  Built with focus on short-term delivery and minimal up front cost Custom Code Frameworks Code Generators Engines
  13. 13. We can categorize the major ETL players The vendors  Traditional vendors: Informatica, SSIS, DataStage  Open source: Talend, Pentaho Kettle  Metadata driven, automated discovery, federated integration: Kalido, BI Ready, Wherescape, Composite Software The most common alternative  SQL + shell scripts  Native DB load utilities
  14. 14. ETL tools have lots of value  Built-in commonly used features for transformation and job control  Without ETL tools, we’re reinventing the wheel on many BI design patterns that have been implemented countless times throughout history  Abstracts complex logic into a graphical components or domain specific language that leverages best practices and is often more maintainable over the potentially long project life span  Graphical representation of data model, data flow and job flow provide visibility into business logic, especially useful for less technical team members  Provides a degree of self-documentation without the need to update the graphical representation of logic separately from source code  Master Data Management (MDM)  Data cleansing  Change Data Capture (CDC)  Data lineage and data dependency functionality  Processing of SCD (Slowly Changing Dimensions)  Parallelization of tasks that can be run concurrently  Advanced merging functionality
  15. 15. But many ETL tools are not well suited to an Agile BI environment
  16. 16. First, these tools may not be ideal for Agile in general... Some ETL tools are...  Not well suited for code refactoring, branching, and merging because the code is not in text files that can be used modern version control, such as Git  Not well suited for use with automation in Continuous Integration, because they’re often standalone environments with no provisions for external automation  Not well suited for TDD (Test Driven Development), unless the vendors explicitly made provisions for unit test automation  Proprietary and have “black box” features that might make testing more challenging or decrease portability of test cases  Expensive, with high up-front license cost also putting more capital at risk – unless open source ETL, of course
  17. 17. Second, they may negatively impact productivity of Agile teams ETL tools may...  Require a proprietary, vendor-specific skill set not present in the organization  Cause work priority to be stove-piped and limited to skill set, rather than overall business value  Prevent the ability to leverage the full dev team, since they fall under a separate development environment from the rest of apps  Result in a productivity hit, since some professional developers are more productive writing code in native languages than using GUI tools, even after training  Not provide compelling enough reasons for developers to learn any one ETL tool, since the lack of industry standards decreases skill portability
  18. 18. Third, there are other challenges and considerations There are challenges and limitations with ETL tools even outside of Agile  Require allocation of additional resources to manage version upgrades of the ETL tool, even if the code base hasn’t been changing  When the type of processing needed is outside of core ETL tool features, complexity can grow quickly  Usefulness of visual representations for data models, data flows and job flows is reduced as complexity increases  Some find GUI development less efficient than traditional coding, especially for complex or unique type of processing  Often the sophisticated features are underutilized, resulting in expensive tools being used just for job scheduling
  19. 19. Fourth, BI is increasingly involving Big Data Big Data implementations often make ETL tools less compelling  Large volumes make it more efficient to  Manipulate data in place using ELT, rather than have multiple staging areas  Use native methods (MapReduce /Java, SQL, Hive, etc.) that allow for more control and performance optimization  High velocity of data makes it harder to use ETL tools that have traditionally been designed around batch-oriented processing.  High variability of data makes ETL tools less attractive, since they expect a fixed schema and don’t gracefully accommodate changes. Common examples include unstructured web log data in flat files and logical objects from apps stored in key-value pair format.  MPP vendors, such as Teradata and Netezza make a case for doing ELT (rather than ETL) processing natively and provide built-in features to do so  Currently ETL tools are rarely used with the Hadoop ecosystem for many of the reasons stated, as well as licensing cost
  20. 20. That said, how do we implement an- Agile BI environment?
  21. 21. First, use ETL tools when it makes sense Pick the right ETL tool for the job...  We covered the potential benefits and problems of using such ETL tools for Agile BI. Look for situations where benefits outweigh the problems. For example, a good situation to employ ETL tools might be: A use case requiring sophisticated data cleansing transformations, complex job control logic, and data volumes easily handled by traditional SMP database architectures.  Outside of such situations, consider using SQL, DB-specific native code, or general purpose languages already in use elsewhere in the organization. Is it OK to start with using an ETL tool as a job scheduler?  Yes, assuming it’s an efficient way to handle much needed job control logic, including failures, event triggers, and dependencies.  Plus, you get the option to adopt other capabilities of the tool over time with low project risk. While traditional ETL tools can simplify a complex task, they can also overcomplicate a simple task.
  22. 22. Second, when you do use ETL tools, look for ways to mitigate these issues identified So what’s the solution? L Issue Approach High up-front license cost Use open source tools or less expensive licenses like with SQL Server. Aggressive vendor negotiations, in light of lower cost alternatives. Use with Continuous Integration See following slides. Some vendors, like Microsoft, may make provisions for automated builds within their environment. Otherwise look for opportunities to simplify, partially automate, and notify team of build state. Use with version control Where possible, save ETL logic to XML, create dumps of repository, and generate code from metadata. Then manage in common version control tool. Decreased portability Move code to general purpose development languages, including SQL and MDX. Consider tools that generate generic code from GUI or metadata. Vendor-specific skill set Build cross-functional team by...  Training existing developers  Hiring well-rounded developers willing to learn ETL tools Risk of introducing another development environment Start using ETL tools now and “grow” into using the functionality  Continue coding in what you know: native RDBMS code or even general app dev languages  Start using ETL as a glorified job scheduler to wrap native code  When refactoring code, take the opportunity to push more logic into the ETL tool  Gradually start using other features such as MDM, data quality, notifications, enterprise service bus, etc.
  23. 23. Continuous Integration: Methodology  Each developer should have a sandbox: 1-to-1 app instance to DB instance (CI by Martin Fowler)  Automate: Table deployment, usage stats, schema verification, data migration verification, DB testing, migration to prod  Version control all DB assets, ideally using a distributed tool like Git  Use tool like dbDeploy and link app build, DB version, and forward/reverse DDL & DML scripts  Generate a test data set with a dimension annotating what each is testing; Becomes a company asset that enables TDD of BI For cases where an application consumes data from the data warehouse:  BI developers should learn software coding practices; Application developers should learn data modeling, SQL, DB tuning  Consuming apps use 2 phased builds: Build 1, DB is stubbed out and runs within minutes Build 2, includes real DB for end-to-end testing, but might run for a while  Bugs found in Build 2, trigger additions to the test data set; Next time same bug is caught in Build 1 Shared developer schema Dev 1 Dev 2 Dev 3 Typical BI dev env with contention during development Sandboxed dev env appropriate for agile development Schema Dev 1Dev 1 Dev 2 Dev 3 Schema Dev 2 Schema Dev 3  
  24. 24. Continuous Integration: Tools & Configuration How dbDeploy works dbDeploy is treated as a custom Ant task: 1. Logs & assigns version #s to changes in SQL files 2. Save changelog table since prior version 3. Generates DDL & DML scripts to apply to DB in other envs Tool Type Purpose Ant Build tool Automates steps to build & deploy software Jenkins Continuous Integration Monitors source code repository (Git) for checkins, automatically launching build-test cycles and publishing results. Git Source control / repository Source code repository optimized for branching and merging, making it efficient for each developer to have their own sandbox environment. It triggers CI built-test cycles. dbDeploy, dbMaintain, etc. Database refactoring manager Automates the process of establishing which database refactorings need to be run against a specific database in order to migrate it to a particular build. DbUnit, DbFit, SQLUnit Unit test automation Common tool to aid TDDD (Test-driven DB development). Manage DB state between test runs, import/export test datasets, run unit tests and log exceptions. Regression testing of DDL, DML, stored procedures. Developer Env.  Repository (Git) CI Environment Check out Build Tool Deploy & Test Test server Prod server Project Code Check in Success / Fail Tag
  25. 25. Continuous Refactoring & Releases of Databases Dev Sandbox Project Integration Sandbox Test / QA Sandbox Production Highly iterative development Characteristics Environment Deployment Frequency Risk / impact of bug Project-Level Testing System Integration Testing Operations & Support Frequent Infrequent Controlled Low impact Medium impact High impact Based on presentation by Pramod Sadalage Testing Test data set (Used for TDD) Test data set Benchmark data Production data
  26. 26. Continuous Integration: Possible Configuration for Microsoft BI Stack PowerDelivery  Addresses TFS’s weakness in coordinating the promotion of builds through multiple environments of the delivery pipeline: triggering build on commit, promoting commit build to test, promoting test build to prod Windows PowerShell  Task-based command-line shell & scripting language (built on .NET) for task automation Team Foundation Server  Microsoft's application lifecycle management (ALM) solution. Collaboration platform that supports agile delivery practices  Build machine is configured for continuous integration, so latest working version is refreshed and available to the entire distributed team SQL Server Data Tools  Develop, debug, and execute database unit tests interactively in Visual Studio.  Puts database testing on an equal footing with application testing.  Can then be run from command line or from a build machine  Integrated with testing, bug tracking, and project management using TFS