Successfully reported this slideshow.

Democratizing Data

2

Share

1 of 21
1 of 21

Democratizing Data

2

Share

Download to read offline

We present our solution for building an AI Architecture that provides engineering teams the ability to leverage data to drive insight and help our customers solve their problems. We started with siloed data, entities that were described differently by each product, different formats, complicated security and access schemes, data spread over numerous locations and systems.

We present our solution for building an AI Architecture that provides engineering teams the ability to leverage data to drive insight and help our customers solve their problems. We started with siloed data, entities that were described differently by each product, different formats, complicated security and access schemes, data spread over numerous locations and systems.

More Related Content

Democratizing Data

  1. 1. Democratizing Data Architecting Terabyte Common Data Models and Configuration Driven Pipelines For AI Platforms Cindy Mottershead, AI Architect, Blackbaud Shiran Algai, Senior Manager of Software Development, Blackbaud
  2. 2. Agenda Shiran Algai ▪ Problem Statement ▪ Architecture Journey Cindy Mottershead ▪ Architecture decisions ▪ Common Data Model ▪ Configuration Driven Pipeline ▪ Transformation building blocks ▪ AI Feedback Loop
  3. 3. We are the world’s leading cloud software company powering social good. Millions of users in over 100 countries The world’s 18th largest SaaS applications provider* Fortune 56 Companies Changing the World* *2017
  4. 4. Problem ▪ Data is very siloed ▪ Similar entities are described entirely differently by every product ▪ Bringing on new sources continues to compound the issue ▪ Data is frequently entered slightly differently for same entity ▪ Engineering teams are unable to leverage data to drive insight and help our customers solve their problems. ▪ AI ETL cycle far too long ▪ AI ability to explore data extremely limited
  5. 5. First Steps ▪ Had beginnings of a few data lake projects, but scattered a bit throughout organization ▪ Built consensus and momentum toward a common delta lake ▪ Started on MS tooling in Azure (data factory, U-SQL run by data lake analytics jobs, etc.) ▪ Leverage as many Azure PaaS tools as possible ▪ Batch only ▪ Picked a small "bore hole through the mountain" approach
  6. 6. First Steps
  7. 7. Pivoting ▪ Painful adding new readers for different sources not natively supported (Avro, Parquet) ▪ Gaps in Azure data tooling for our specific use cases ▪ Desire for batch AND streaming through a similar path ▪ Need the ability to compact records, recreating legacy datasets in the platform ▪ Ability to hire data engineers in the market easily
  8. 8. Solution
  9. 9. Data Platform Ecosystem ▪ Delta Lake ▪ Azure Data Lake Store ▪ Data Catalog Service ▪ Lake Authorization Service ▪ Ingestion Service ▪ Output service ▪ Async messaging contract broker service
  10. 10. Service A Service B Data Catalog Uses ACB as a source for new catalog entries Async Contract Broker Service Stores message schemas Prevents breaking schema changes Ingestion Service Automatically subscribes to new and existing topics … 82 more Lake Staging Zone Raw Zone Compacted daily Trusted Zone CDM tables Service Bus Topic
  11. 11. Common Data Models ▪ Downstream services + Data Scientists all leverage same common models, accelerating development ▪ Common defined structure ▪ Consistent Naming of tables, structures, fields ▪ Consistent across all applications and application types ▪ Manage multiple data sources ▪ Remove complexities & specifics of source systems ▪ Shows the data “As is” (natural values) ▪ Provides common groupings & coding of data values (derived values) ▪ Integrated with Value-Added Services
  12. 12. CDM_Person
  13. 13. Common Data Model Input ▪ Thousands of relational tables ▪ Csv, json, parquet, avro, etc formatted input files ▪ Normalized and denormalized input ▪ Nested objects ▪ SQL Server, Mariadb, Oracle, flat files ▪ Change events
  14. 14. Configuration Driven Pipeline ▪ Common Id ▪ Metadata Map ▪ Pipeline ▪ Transformations
  15. 15. Transformation building blocks a) Filters b) View c) One to One (with SQL transform, with Lookup) d) One row to Many Rows (unpivot) e) Many rows to array in one column f) Aggregations
  16. 16. ML Feedback Loops Full cycle of model deployment, tying actions taken back into model PROVIDE FULL CYCLE OF DATA FROM PRESENTATION, USER INTERACTION, RESULT S ALLOWS MONITORING AND TUNING OF ML MODELS PROVIDES METRICS FOR ROADMAP PRIORITIZATION PROVIDES METRICS FOR A/B TESTING
  17. 17. Tying It All Together ▪ Data flows from various products ▪ Ingested ▪ Transformed via Configuration Driven Pipelines ▪ One Common Data Model ▪ Data flows out of common data models back into ecosystem ▪ Baked in feedback loops
  18. 18. Democratized Data ▪ Data Scientists can access data directly from the CDM ▪ CDM is a Delta table ▪ Views are created for security access (no access to PII) ▪ Access is controlled at the view level ▪ Data is projected (using Schema on Read) to any destination location (blob, SQL Server, Cosmos, etc) ▪ Data Scientists and Engineers can request any dataset they need by specifying metadata ▪ Requested data is transformed based on the metadata description ▪ Data is streamed or batched out to destination based on metadata frequency info
  19. 19. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×