Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DI&A Slides: Data-Centric Development


Published on

Efforts to improve computer software have led to the general use of certain methodologies, such as the Agile System Development Lifecycle, that are extremely focused on software coding. Even common technologies used for Big Data analytics, such as Hadoop and commodity disc storage, require additional programmer attention to implement capabilities that used to be handled by relational database management systems and (the) smart disc. How are organizations that are successfully “data driven” changing to focus on data-centric development?

Published in: Technology
  • Be the first to comment

  • Be the first to like this

DI&A Slides: Data-Centric Development

  1. 1. The First Step in Information Management Produced by: MONTHLY SERIES Brought to you in partnership with: June 1, 2017 Data-Centric Development
  2. 2. Welcome, Malcolm Chisholm  First San Francisco Partners’ Chief Innovation Officer  More than 25 years of experience in data management  Areas of expertise: data-centric development methodology, data governance, master/reference data management, metadata engineering, business rules management/ execution, data architecture and design pg 2© 2017 First San Francisco Partners
  3. 3. Polling Questions pg 3© 2017 First San Francisco Partners  Do you have key data-centric projects (i.e. Data Lake) you are implementing or have implemented this year? − Yes − No − Not sure  Do you employ (or have employed) Waterfall or Agile methods to manage your data-centric projects? − Yes − No − Not sure
  4. 4. Topics for Today’s Webinar  Data-Centric Development Defined  The Focus on Programming in Agile Development  How to Include a Data Focus in Agile Development  The Focus on Programming in Big Data  The Data-Centric Development Life Cycle  Using Conceptual Data Modeling to Make Development Data-Centric  Data-Centric Case Study  Closing Remarks, Resources and Q&A pg 4© 2017 First San Francisco Partners
  5. 5. Data-Centric Development Defined
  6. 6. Data-Centric vs. Process-Centric Projects pg 6© 2017 First San Francisco Partners Computerized Systems Control Systems Information Systems Process-Centric Systems Data-Centric Systems Data-centric Projects Process-centric Projects Types of Computerized Systems  Distinction between data-centric and process-centric projects  Process-centric: traditional projects for computerized systems that automate a process and where data is a by-product  Data-centric: focused on building a system that is to purely manage data and not to automate a business process  Some overlap: data-centric will involve automation (e.g., ETL) and process-centric will involve data (e.g., used for measuring process efficiency)
  7. 7.  Data-Centric Development Project – gets value out of pre-existing data or from curating data to completely separate areas of the enterprise to derive value from it.  Example: Data Warehouse starts with pre-existing production data  Example: Customer Master Data Management (MDM) hub which other applications will use to get “golden records” for customers  Example: Big Data projects for analytics  Process-Centric Development Project – automates some aspect of the enterprise; often, a manual process that is automated or an existing automated process that is upgraded (no focus on the data).  Example: Point-of-sale system  Example: Payroll system  Example: Medical billing system  There is always overlap: process comes into data-centric projects, and data always exists in process-centric projects. It is the overall focus that is different. What is a Data-Centric Development Project pg 7© 2017 First San Francisco Partners
  8. 8. Fundamental Classes of Data-Centric Projects pg 8© 2017 First San Francisco Partners  Project types that are fundamentally data-centric: − Data Warehouses − Data Marts − Operational Data Stores (ODSs) − Data Lakes − Reference Data Management (RDM) − Master Data Management (MDM)
  9. 9. 1960s 1970s 1980s 1990s 2000s 2010s Mainframes Package Implementation Distributed Computing Internet Cloud Manual Process Automation Data Warehouses / BI MDM Big Data Technology PCs Business Use Cases pg 9© 2017 First San Francisco Partners  Over 50+ years, different technology answered different use cases  General move from process-centricity to data-centricity  Systems Development Life Cycle (SDLC) is a 1960s-era methodology  SDLC is still almost universally used, including Agile Projects Have Been Run the Same Way for Many Years
  10. 10. The Focus on Programming in Agile Development
  11. 11. Requirements Analysis Design Development Quality Assurance Production Post-Production Waterfall Systems Development Life Cycle (SDLC) 1. Waterfall presumes there is a process to be automated. In a data-centric project, the starting point is existing production dataBut 2. Business Analysts expect users to state requirements. Users never understand the data at the outsetBut 3. Waterfall is linear. With data-centric, there are true cycles of iteration as understanding of source data evolves But 4. Waterfall QA phase only tests functionality, not data. With data-centric, data quality needs to be testedBut  And there are many other mismatches. Waterfall SDLC is a development project methodology created in the mid-1960s for process-centric projects. It is highly embedded in IT, but not well-aligned to data- centric projects. pg 11© 2017 First San Francisco Partners Waterfall SDLC vs. Data-Centric Projects
  12. 12. Requirements Analysis Design Development Quality Assurance Production Post-Production Requirements Analysis Design Development Quality Assurance Production Post-Production Requirements Analysis Design Development Quality Assurance Production Post-Production Sprint Sprint Sprint Epics User Stories Backlog ManagementProject Increments pg 12© 2017 First San Francisco Partners  Agile is more popular today, but retains aspects of Waterfall and has no particular data-centric aspects. Agile
  13. 13. Tell me your requirements! They’re kind of like this… Business Analyst Business User Data Warehouse Development Project Business User Business Analyst Hey, this report doesn’t make any sense to me! It’s your problem because your requirements were bad January February March April May June July 3 Years Later Data Extract from Data Warehouse Data Scientist 1 Data Scientist 2 I just don’t understand this data I have no clue who to even ask  These problems reflect the way that development projects are managed – the traditional SDLC. pg 13© 2017 First San Francisco Partners What Can Go Wrong with Data-Centric Projects?
  14. 14. How to Include a Data Focus in Agile Development
  15. 15. Agile / Waterfall What Data-Centric Projects Need • Oriented to automating an unautomated or partially automated process • Oriented to getting value out of existing production data or curating data for other processes to use • Users are able to articulate processes reasonably well for requirements • Users typically do not know the details of the source data and may not even know where it is – or if it exists • Testing focuses on whether the functionality matches requirements • Testing focuses on data quality of source data and data produced by transformations, calculations and derivations • No testing artifacts are carried over into Production • Data quality rules developed in testing are put into Production for continuous data quality monitoring • Knowledge gained during the project is used only for development activities within the project • Knowledge gained during the project is part of what is delivered and is used later for developing reports after Production implementation • Stakeholders are predominantly the business users who will benefit from the functionality • Stakeholders also include representatives from business areas who will use the data outside of the application, or may do so in the future, e.g. data scientists # 1 2 3 4 5 7 • Legal questions about processes are rare • Legal, privacy and compliance concerns exist both for the curation and permitted business use of data 6 pg 15© 2017 First San Francisco Partners Mismatch of Traditional Project Methodology to Data-Centric
  16. 16. The Focus on Programming in Big Data
  17. 17. Data Problems in Big Data Environments pg 17© 2017 First San Francisco Partners • The technology and processes to get data into a Big Data Environment are relatively simple • But there are huge challenges with understanding the source data Big Data Environment (Data Lake) Emails Documents Web Pages XML Relational Flat Files Audio Image Video I N G E S T I O N Source A Source B Source C Source D Source E
  18. 18. Columnar Databases in Big Data Environments pg 18© 2017 First San Francisco Partners • Columnar Databases are used a lot in Big Data • They have to be organized to look like their queries, and to house the data that comes into them from the sources • Thus Target design and Source data analysis are huge issues rowID Column Family Column Qualifier “Timestamp” Payload Doe|1968-11-04|John “CUSTOMER” “EMPLOYEE”Doe|1968-11-04|John Examples of Column Family “PURCHASER”Doe|1968-11-04|John …and hundreds more… Structure of a record in HBase
  19. 19. The Data-Centric Development Life Cycle
  20. 20. First San Francisco Partners’ DCLC:  Recognizes specific activities needed for a data-centric project instead of abstracting them into over- generalizations like “analysis.”  Provides for real iterations that lead to refinement of information requirements, instead of a single- requirements activity.  Understands some activities can be carried out in parallel, instead of the SDLC and Agile’s linear flow. pg 21© 2017 First San Francisco Partners Introducing the Data-Centric Development Life Cycle (DCLC)
  21. 21. 100% Process- centric 100% Data- centric  The full DCLC is appropriate for projects that are heavily data-centric.  However, even projects that are overwhelmingly process-centric can benefit from some elements of the DCLC.  This is because process-centric projects will be creating data that may be used in the future in some analytics environment (that may not even exist yet). pg 22© 2017 First San Francisco Partners Some Elements of DCLC are Needed for All Projects
  22. 22. Questions to Ask About Process-Centric Projects pg 23© 2017 First San Francisco Partners  Will the data be used for analysis outside of the system that is being built? − e.g., Will it be fed into a Data Warehouse or Data Lake? − e.g., Will it be sold?  Are there stakeholders in the data that the system will produce who are not the business sponsors or controlled by the business sponsors? − e.g. Data Scientists or Marketing  Are any data feeds needed as inputs? − Versus only data entry  Does data quality matter to the business sponsors of the project? − Are they bringing this concern to the project, rather than mildly agreeing with outside suggestions? Any significant “yes” responses mean you should consider deploying elements of the DCLC on the project.
  23. 23. Using Conceptual Data Modeling to Make Development Data-Centric
  24. 24. What is Semantics pg 24© 2017 First San Francisco Partners Terms Terminology – what words are used in what contexts (subject fields) by what communities. Needs some level of Ontology for contexts. Not done much in USA. Concepts Definitions and allied metadata. Requires some level of Ontology to make distinctions clear. Often not done in finance for Reference Data (Codes). Often done poorly. Classifications, Taxonomies Groupings of concepts based on common characters, or a particular management need. Problem of how to actually do classification often not addressed Hierarchies Systems of relations between individuals, not concepts. Problem of mixing different relation types within the hierarchy. E.g. Legal vs Risk vs Sales Rules Calculations, Derivations, Constrains among concepts. These are not Definitions, but are often confused with them. Ontologies A particular view of financial reality, composed of all the other items described below plus more relations. A model of business information without any though as to how it will be stored as data. S E M A N T I C S All this plays a role in the early stages of the Data-Centric Development Life Cycle
  25. 25. Subject-Area Models pg 25© 2017 First San Francisco Partners The highest level of conceptual model, but very useful in Data Discovery
  26. 26. Data-Centric Case Study
  27. 27. High-Level Overview pg 27© 2017 First San Francisco Partners  Shared with enterprise leadership how data management and FSFP’s Data-Centric Development Life Cycle methodology could positively impact a major Data Warehouse project and fill a critical project gap without causing extra work  Used momentum and resources of project to advance maturity of Data Governance practices
  28. 28. How We Got Traction pg 28© 2017 First San Francisco Partners  DCLC tied directly in to project deliverables  Momentum coming from project deadlines  Integration of clear governance goals into tactical deliverables  Active participation of Data Governance manager  Inclusion of data analysis in requirements gathering
  29. 29. Results pg 29© 2017 First San Francisco Partners  Clear project data requirements that will enable re-use of data in new reporting environment  Cross-functional agreement to data definitions and concepts  Business glossary ready for go-live  Clear business ownership of data  Data Governance team positioned for success
  30. 30. Closing Remarks, Resources and Q&A
  31. 31. Webinar Takeaways and Resources  Takeaways − Identify Data-centric projects vs Process- centric ones − Consider taking a Data-centric approach for Data-centric projects.  Suggested resources: − DCLC articles on the FSFP blog − DCLC two-page overview pg 31© 2017 First San Francisco Partners
  32. 32. Questions? pg 32© 2017 First San Francisco Partners MONTHLY SERIES
  33. 33. Thank you! Please join us Thursday, July 6 for the “Governing Quality Analytics” webinar. Malcolm Chisholm @MDChisholm John Ladley @jladley