This document provides guidelines for building cloud BI project architectures. It discusses considerations for architectural design such as data sources, volumes, model complexity and sharing needs. It then presents four common architecture templates - Hulk, Iron Man, Thor and Hawkeye - tailored to different needs around reporting demand, data volume and complexity. Key aspects of architectures like sources, transportation, processing, storage, live calculation, data access and orchestration are examined. Finally, it compares features of technologies that can fulfill different functional roles.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
Dragan Berić will take a deep dive into Lakehouse architecture, a game-changing concept bridging the best elements of data lake and data warehouse. The presentation will focus on the Delta Lake format as the foundation of the Lakehouse philosophy, and Databricks as the primary platform for its implementation.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
Dragan Berić will take a deep dive into Lakehouse architecture, a game-changing concept bridging the best elements of data lake and data warehouse. The presentation will focus on the Delta Lake format as the foundation of the Lakehouse philosophy, and Databricks as the primary platform for its implementation.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
A Work of Zhamak Dehghani
Principal consultant
ThoughtWorks
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://fast.wistia.net/embed/iframe/vys2juvzc3?videoFoam
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.
With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.
Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
A talk presented by Max Schultze from Zalando and Arif Wider from ThoughtWorks at NDC Oslo 2020.
Abstract:
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando - europe’s biggest online fashion retailer - we realised that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh.
The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership.
This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture and will outline the ongoing efforts to make creation of data products as simple as applying a template.
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
The Future of Data Integration: Data Mesh, and a Special Deep Dive into Stream Processing with GoldenGate, Apache Kafka and Apache Spark. This video is a replay of a Live Webinar hosted on 03/19/2020.
Join us for a timely 45min webinar to see our take on the future of Data Integration. As the global industry shift towards the “Fourth Industrial Revolution” continues, outmoded styles of centralized batch processing and ETL tooling continue to be replaced by realtime, streaming, microservices and distributed data architecture patterns.
This webinar will start with a brief look at the macro-trends happening around distributed data management and how that affects Data Integration. Next, we’ll discuss the event-driven integrations provided by GoldenGate Big Data, and continue with a deep-dive into some essential patterns we see when replicating Database change events into Apache Kafka. In this deep-dive we will explain how to effectively deal with issues like Transaction Consistency, Table/Topic Mappings, managing the DB Change Stream, and various Deployment Topologies to consider. Finally, we’ll wrap up with a brief look into how Stream Processing will help to empower modern Data Integration by supplying realtime data transformations, time-series analytics, and embedded Machine Learning from within data pipelines.
GoldenGate: https://www.oracle.com/middleware/tec...
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
What’s New with Databricks Machine LearningDatabricks
In this session, the Databricks product team provides a deeper dive into the machine learning announcements. Join us for a detailed demo that gives you insights into the latest innovations that simplify the ML lifecycle — from preparing data, discovering features, and training and managing models in production.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
Getting Started with Databricks SQL AnalyticsDatabricks
It has long been said that business intelligence needs a relational warehouse, but that view is changing. With the Lakehouse architecture being shouted from the rooftops, Databricks have released SQL Analytics, an alternative workspace for SQL-savvy users to interact with an analytics-tuned cluster. But how does it work? Where do you start? What does a typical Data Analyst’s user journey look like with the tool?
This session will introduce the new workspace and walk through the various key features – how you set up a SQL Endpoint, the query workspace, creating rich dashboards and connecting up BI tools such as Microsoft Power BI.
If you’re truly trying to create a Lakehouse experience that satisfies your SQL-loving Data Analysts, this is a tool you’ll need to be familiar with and include in your design patterns, and this session will set you on the right path.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Analytics Platform System (APS) from Microsoft (formally called Parallel Data Warehouse or PDW) , which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and APS. I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.
A Work of Zhamak Dehghani
Principal consultant
ThoughtWorks
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://fast.wistia.net/embed/iframe/vys2juvzc3?videoFoam
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.
With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.
Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
A talk presented by Max Schultze from Zalando and Arif Wider from ThoughtWorks at NDC Oslo 2020.
Abstract:
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando - europe’s biggest online fashion retailer - we realised that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh.
The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership.
This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture and will outline the ongoing efforts to make creation of data products as simple as applying a template.
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
The Future of Data Integration: Data Mesh, and a Special Deep Dive into Stream Processing with GoldenGate, Apache Kafka and Apache Spark. This video is a replay of a Live Webinar hosted on 03/19/2020.
Join us for a timely 45min webinar to see our take on the future of Data Integration. As the global industry shift towards the “Fourth Industrial Revolution” continues, outmoded styles of centralized batch processing and ETL tooling continue to be replaced by realtime, streaming, microservices and distributed data architecture patterns.
This webinar will start with a brief look at the macro-trends happening around distributed data management and how that affects Data Integration. Next, we’ll discuss the event-driven integrations provided by GoldenGate Big Data, and continue with a deep-dive into some essential patterns we see when replicating Database change events into Apache Kafka. In this deep-dive we will explain how to effectively deal with issues like Transaction Consistency, Table/Topic Mappings, managing the DB Change Stream, and various Deployment Topologies to consider. Finally, we’ll wrap up with a brief look into how Stream Processing will help to empower modern Data Integration by supplying realtime data transformations, time-series analytics, and embedded Machine Learning from within data pipelines.
GoldenGate: https://www.oracle.com/middleware/tec...
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
What’s New with Databricks Machine LearningDatabricks
In this session, the Databricks product team provides a deeper dive into the machine learning announcements. Join us for a detailed demo that gives you insights into the latest innovations that simplify the ML lifecycle — from preparing data, discovering features, and training and managing models in production.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
Getting Started with Databricks SQL AnalyticsDatabricks
It has long been said that business intelligence needs a relational warehouse, but that view is changing. With the Lakehouse architecture being shouted from the rooftops, Databricks have released SQL Analytics, an alternative workspace for SQL-savvy users to interact with an analytics-tuned cluster. But how does it work? Where do you start? What does a typical Data Analyst’s user journey look like with the tool?
This session will introduce the new workspace and walk through the various key features – how you set up a SQL Endpoint, the query workspace, creating rich dashboards and connecting up BI tools such as Microsoft Power BI.
If you’re truly trying to create a Lakehouse experience that satisfies your SQL-loving Data Analysts, this is a tool you’ll need to be familiar with and include in your design patterns, and this session will set you on the right path.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Analytics Platform System (APS) from Microsoft (formally called Parallel Data Warehouse or PDW) , which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and APS. I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.
At our March Data Analytics Meetup, Dan Rodriguez and Cherian Mathew demonstrated the variations in Microsoft Azure programs and how they are impacting digital transformation.
Analytics and Lakehouse Integration Options for Oracle ApplicationsRay Février
This Red Hot session is designed for customers who are currently using Oracle Cloud applications such as Fusion and EPM, and are interested in gaining a better understanding of the integration options that are available to them.
Here is a high level agenda:
- We will start by discussing the modern data platform on OCI, the Lakehouse architecture and the OCI related services that supports it.
- We will then discuss the data extraction methods available on OCI for Fusion and EPM.
- Last but not least, we will end with a few best practices and possible use cases.
In the interest of time, we will mainly focus on integration patterns that are recommended for Fusion and EPM, but don’t hesitate to reach out if you would to talk to us about other Oracle applications.
Enjoy!
What is in a modern BI architecture? In this presentation, we explore PaaS, Azure Active Directory and Storage options including SQL Database and SQL Datawarehouse.
Building the Next-gen Digital Meter Platform for FluviusDatabricks
Fluvius is the network operator for electricity and gas in Flanders, Belgium. Their goal is to modernize the way people look at energy consumption using a digital meter that captures consumption and injection data from any electrical installation in Flanders ranging from households to large companies. After full roll-out there will be roughly 7 million digital meters active in Flanders collecting up to terabytes of data per day. Combine this with regulation that Fluvius has to maintain a record of these reading for at least 3 years, we are talking petabyte scale. delaware BeLux was assigned by Fluvius to setup a modern data platform and did so on Azure using Databricks as the core component to collect, store, process and serve these volumes of data to every single consumer and beyond in Flanders. This enables the Belgian energy market to innovate and move forward. Maarten took up the role as project manager and solution architect.
Data Driven Advanced Analytics using Denodo Platform on AWSDenodo
Watch full webinar here: https://buff.ly/3JC8gCS
Accelerating cloud adoption and modernizing analytics in the cloud has become a necessity to facilitate timely, insightful, and impactful decision making. However, with the widespread data in an organization across disparate hybrid cloud data sources poses a challenge with real time and well governed analytics. Data Virtualization is a modern data integration technique in which a single semantic layer can be built to help drive data democratization and speed up the analytics in an efficient and cost-effective manner.
Watch this session to learn:
- How various AWS services (Redshift, S3, RDS) can be quickly integrated using Denodo Platform’s logical data management by implementing a logical data fabric (LDF)
- How LDF helps you manage and deliver your data for data science and analytics programs, supporting your business users.
- How governed Data Services layer enables self-service analytics in your complex AWS data landscape
The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
Simplifying Cloud Architectures with Data VirtualizationDenodo
Watch here: https://bit.ly/2yxLo6f
Moving applications and data to the Cloud is a priority for many organizations. The benefits - in terms of flexibility, agility, and cost savings - are driving Cloud adoption. However, the journey to the Cloud is not as easy as many people think. The process of moving application and data to the Cloud is challenging and can entail widespread disruption across the organization if not carefully managed. Even when systems are migrated to the Cloud, the resultant hybrid or multi-Cloud architecture is more complex for users to navigate, making it harder for them to get the data that they need to do their jobs.
Data Virtualization can help organizations at all stages of their journey to the Cloud - during migration and also in the resultant hybrid or multi-Cloud architectures. Attend this webinar to learn how Data Virtualization can:
- Help organizations manage risk and minimize the disruption caused as systems are moved to the Cloud
- Provide a single point of access for data that is both on-premise and in the Cloud, making it easier for users to find and access the data that they need
- Provide a security layer to protect and manage your data when it's distributed across hybrid or multi-Cloud architectures
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
Check out this presentation to learn the basics of using Attunity Replicate to stream real-time data to Azure Data Lake Storage Gen2 for analytics projects.
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Similar to Azure BI Cloud Architectural Guidelines.pdf (20)
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
2. Executive summary
This document is intended to provide guidelines for building architectures on cloud BI projects.
Considerations
To define an architecture for your project, we suggest you look at these criteria :
Source: where your data is located
ETL complexity: what kind of business rules
and transformations you need to support
Data volumes: The sheer size of the data
Model complexity: the business problem
you are representing and the kind of KPIs
you’ll need to support
Sharing needs: whether the data is only
used for this project or if it needs to
integrate with core data assets
Reporting demand: the expected
rendering speed, and for how many users
Templates
We’ve defined four templates based on common needs patterns which can be reused as-is or slightly modified to suit your particular case.
When you need pure muscle-power
Ex: CDB Reporting, Datalens
For simple reporting over large data
Ex: Radarly, Digital Dashboard
For complex reporting over light data
Ex: Budgit
Quick and agile project for POCs or short-lived needs
Ex: Weezevent
Hulk
Iron man
Thor
Hawkeye
3. Architectural considerations
Sources
Cloud data source are
simple to capture.
On-premises data sources
can imply a form of
gateway (or IR), a push from
the local infrastructure or a
VPN access linking cloud
resources to local networks.
Data volumes
Small data volumes can
generally be processed in
memory all at once and fit
within the 1GB of data
limitation in Power BI.
Medium data volumes can
be processed with a single
machine whereas large
data volumes require
cluster-based, parallelized
processing.
Data interests
Local data interests can be
managed in a fully
autonomous way, isolated
from other projects and
stakeholders.
Global data interest intend
to have their results be
reused by other projects
and teams. As such, these
projects have more
complex integration phases
and have more advanced
security features to manage
them.
The criteria you should consider when planning an architecture
4. Architectural considerations
ETL complexity
Simple ETL involves only light
transformations and data
type casting.
Medium ETL transforms an
incoming landing model
into a fully-fledged star
schema.
Complex ETL involves
proactive data quality
management, advanced
dimensional models and/or
intricate business rules.
Model complexity
Simple models use additive
measures over a single star
schema or a flat dataset.
Medium models include
advanced DAX with semi-
additive measures and/or
calculation over multiple
star schemas.
Complex models require
performance hindering
features such as row-level
security, bi-directional cross-
filtering or very advanced
DAX calculations.
Reporting demand
Low demands infer that it is
acceptable to have longer
response times (5-15s).
Medium demands require
snappy response times
(<100ms) over a small
number of concurrent users.
High demands involving
having snappy response
times over a large number
of concurrent users.
5. Functional phases
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION DATA ACCESS
TRANSPORTING
Where the original
data lives
What moves the
data from the
source to the
platform
What coordinates
the different services
What cleans and
transforms the data
from its raw state to
its usable form
Where data lives in
its cold form
Where reporting
calculations are
made for the end-
users
How the end uses
access the data
An architecture is divided into functional workloads. A single
technology can support multiple workload, and a singe workload
can sometimes be shared between different technologies.
6. Sources
Sources come in two main categories : cloud sources and on-premises sources. Generally speaking, cloud sources
are relatively simple to manage whereas on-prems have to deal with the added complexity of networking.
Cloud : in this category, we find object storage like AWS S3 or Azure Blob, API calls and user documents (ex: Excel
files) stored on SharePoint Online.
- Object storage is straightforward and is handled with an ID/Secret mechanism.
- API calls can be a bit more complex, especially depending on the authentication mechanism, but often offer a
good amount of flexibility in what is returned. They are often capped in terms of data size per call and require
more custom logic to handle.
- Documents stored on SPO allow users to give direct input in the solution but come with the perils of poorly formed
Excel files. Whenever possible, we recommend capturing user inputs through a small web application or a
PowerApp.
On-prems : these sources are highly valuable (they often form the core of information systems) but can be tricky to
access from a cloud services. A few options are available to handle this situation :
- Joining the cloud resource to the internal network through VPN
- Exposing part of the source (or extracts) in a DMZ. This may not be possible if the data is sensitive
- Having an on-premises ETL push the data to the cloud rather than having cloud services fetching the data
- Using a gateway like Azure Data Factory’s Integration Runtime to act as a bridge between the on-prems
resources and the cloud service. This tends to be the easiest scenario.
7. Transportation
Transportation refers to the Extract-and-Load workloads (without the transformations). Throughput, connectivity,
parametrization, and monitoring are the key aspects in choosing the right transportation solution.
Power BI Dataflows : when data volumes are small (less than 1GB), transformations are simple and the end
destination is solely meant to be used in Power BI, it’s Dataflow/PowerQuery engine can be used. It supports a large
array of connectors and decent parametrization possibilities.
Python : Ideally ran in a serverless environment (AWS Sage Maker, Airflow or Azure Functions), managed code can
adapt to a wide variety of data sources, shapes and destinations… provided you have the skill and time to code
the E-L solution. This is more adapted to small-to-medium data volumes since the code is usually limited to a single
machine (and often to a single cpu thread). It is fully DevOps compatible.
Azure Data Factory : the go-to solution for E-L workloads in Azure, ADF is capable of handling large workloads with
excellent throughput (especially if landing to ADLS) and is fully DevOps compatible. ADF’s main downside is that
while it can read from many different sources, it typically only writes only to MS destinations (with a few exceptions).
For self-service
projects
The go-to
solution. Even
more with on-
prems sources
Your can opener
for complex files
and API calls
OUR VERDICT
8. OUR VERDICT
Processing
The processing layer is where the data quality and business logic is applied. Some projects require very light
transformations whereas other completely change the model and apply complex business logic.
PySpark: is a managed code framework that can scale very well to Big Data scenario. Spark-based solutions can
be easily implemented in a PaaS format through Databricks with fully managed notebook, simple industrialization
of code and good monitoring capabilities. For maintainability and support, we recommend using PySpark as the
main Spark language.
SQL Procedures : Transformations can take place within the database itself, limiting data movement. Azure SQL DB
supports a DevOps-compatible fully-fledge language (T-SQL) and is suitable for small-to-medium data volumes.
Snowflake’s SQL programmatic objects are less developed but the platform can handle very large data volumes.
Azure SQL DWH offers big data levels of volume using SQL Stored procedures but need to be managed more
actively (manual cluster starting/scaling/stopping).
Azure Data Factory : ADF offers a GUI-based dataflow engine powered by Spark. It can handle very large data
volumes, but may be somewhat more limited for very complex transformations. Simple to medium complexity of ETL
can be performed without problems.
Python : Python’s processing is similar as it’s transportation’s workload : very flexible but potentially longer to
develop and best used for smaller data volumes.
PBI Dataflow : Power BI offers a simple GUI-based (codeless) interface for light ETL workloads. It can be used to
develop simple transformations over low data volumes very quickly and has a large array of source connectors. It is
limited for its output to Power BI and the Common Data Model in ADLS.
For self-service
projects
For point-and-
clic ETL in the
cloud
For complex
flows and ML
For SQL
professionals
For big data
projects
9. ETL key feature comparaisons
Here are key differentiators for typical…
PBI PySpark
(Databricks)
Cloud sources
On-Prems sources
Handling of semi-structured data
Destinations
Data volume
Transformation capabilities
Machine learning capabilities
CI/CD capabilities
Alert and monitoring
ADF + Dataflow ADF + SQL Procs
(Azure SQL DB)
Python
(Airflow)
Handling of Excel files
Ease of development
10. Storage
The storage layer is where cold data rests. Its main concerns are data throughput (in and out), data management
capabilities (RLS, data masking, Active Directory authentication, etc.), DevOps compatibility.
Azure Data Lake Storage : ADLS is an object-based storage system based on HDFS. It has excellent throughput but is
limited to file-level security (not row-level or column-level). As such, it is best used during massive import/export or
where 100% of the file needs to be read.
Snowflake : Soon to be present globally on Azure, is a true dataware solution in the cloud. As a pure storage layer,
it doesn’t have the same data management or DevOps capabilities as Azure SQL DB but it supports an impressive
per-project compute costing model over the same data. It’s throughput is similar to Azure SQL DWH.
Azure SQL DB : Azure’s main fully-featured DMBS, SQL DB has excellent data management capabilities and Active
Directory support. It supports a declarative development approach which offers many DevOps opportunities.
However, its throughput is not great and massive data volumes should be loaded using an incremental approach
for best performances. When hosting medium data volumes or more, consider an S3 service level or more to
access advanced features like columnstore storage.
Azure SQL DWH : A massively paralleled processing (MPP) version of Azure SQL DB. By default, data is shared across
60 shards, themselves spread between 1 to 60 nodes which offer very high throughput when required. Azure SQL
DWH supports programmatic objects such as stored procedures but has a slightly different writing style to Azure SQL
DB in order to take advantage of its MPP capabilities.
Landing native
files
BEST FOR
Small structured
data
Big data ETL
Big data
ad hoc use
11. Database key feature comparaisons
The cloud now offers multiple solutions for hosting relational
data. While these have more or less feature parity on all core
functionalities (they can all perform relatively well), key
features do exist between them. Snowflake Azure SQL DB Azure SQL DWH
Scale time
Compute/storage isolation
Semi-structured data
PBI integration
Azure Active Directory integration
DevOps & CI/CD support
Temporal Tables
Data cloning
Cost when use as ETL
Cost when use as reporting engine
Ease of budget forcasting
Just don’t do it…
DB programming
*
12. Live calculation
Reporting calculations engines preform the real-time data crunching when users consume reports. This tends to be
a high-demand, 24/7 service due to the group’s international nature. Model complexity and reporting demands
are key drivers when choosing an appropriate technology for this layer.
Power BI models : if the data volumes are small (less than 1GB of compressed data), Power BI native models,
especially when used on a Premium capacity, globally give the most fully-featured capabilities for this workload on
Power BI. Its only real draw backs is the lack of a real developer experience like source control or CI/CD
capabilities. Premium workload data size limits will soon be increased to 10GB… for a fee.
Composite Models : Composite Models allow Power BI to natively keep some of its data (like dimensions or
aggregated tables) in-memory and other in DirectQuery mode. This reduces the need for computation at the
reporting layer. It may not be adequate for complex models since the DAX used by DirectQuery still has limitations
and may involve uneven performances depending on when a query can hit PBI table or when it needs to revert
back to DirectQuery data. They are best suited for dashboarding scenarios (limited interactivity) over large data.
Azure Analysis Services : AAS is the scale-out version of Power BI Native Models. They can perform the same
complex calculations over very large datasets at a cost-efficient level (when compared to Power BI Premium). AAS
has a slower release cycle than Power BI, and thus tends to lack the latest features supported by PBI. It does
however support a real developer experience with source control and development environments.
DirectQuery : Often misunderstood, this mode makes every visualization on a report sends one (or many) SQL query
to the underlying source for every change on the page (ex: a filter applied). Performance will be slower than in-
memory (although may still be acceptable depending on data volume and back-end compute power). Other
issues include DAX limitations and some hard limits on data volume being returned. For these reasons, limit
DirectQuery to exploratory scenarios, near-time scenarios or dashboarding scenarios with limited complexity and
interactivity. Depending on the source, there may also be significant pressure on data gateways.
When it fits !
OUR VERDICT
When it doesn’t !
Near-time,
dashboards and
exploration
Simple reports
over big data
13. Live calculations feature comparaison
This workload comparison is somewhat more complicated
because of the relationship between data volumes and
calculation complexity. Regardless, here are some general
trends we can observe.
DirectQuery
(on Snowflake)
Cost
Composite
(on Snowflake)
AAS
PBI models
(non Premium)
Volume
Model and KPI complexity
Refreshing
CI/CD
Row-level security
Data mashing
Calculation speed
*
* *
*
*
*
*
*
*
*
*
*
*
14. USE FOR
Data Access
Data access refers to how the users are able to get to the data. This is done through the
Power BI embedded : preconfigured reports and dashboards for a topic are accessible key-in-hand from the data
portal.
Direct access – data models : If there’s a need to create a local version of the report and enhance the model with
additional KPIs, it is possible to connect directly to the data model through Excel or Power BI. This allows the report
maker to start from pre-validated dimensional data and KPIs and focus on his/her own additional KPIs and visuals.
This method however doesn’t allow the report maker to mash additional data in the current model.
Direct access – curated data : If a BI developer want to create his/her own model, it is possible to access the
curated dimensional data directly from the CoreDB. This significantly lowers the cost of a project by diminishing ETL
costs and tapping into pre-validated dimensional data. KPIs and any additional data will still need to be
developed and tested before use.
Direct access – data lake : A BI developer or a Data Scientist may wish to have access to raw data files for their
own project. This may or may not be possible depending on security needs on a per-dataset basis.
Import for your
own dataware
Datasets for
data science
Build your own
reports
Key-in-hand
reports
15. Where to connect
Raw data
Curated dimensional data
Business rules applied and tested
ETL already done
One location for multiple data sources
KPIs built and tested
Calculation engine for self-service
Data visualization based on common needs
Core DB
Data lake Analysis
Services
Power BI
What you get depending on where you
connect.
16. Orchestrating
Orchestrating is a key feature of cloud architecture. Not only does it need to launch and manage job, but should
also be able to interact with the cloud fabric (resource scaling, creation/deletion, etc.). Key features include native
connectivity to various APIs, programmatic flows (if, loops, branches, error-handling, etc.), trigger-based launching,
DevOps compatibility and GUIs for development and monitoring.
LogicApps : The de facto orchestration tool in the Azure stack and includes all the key features required in such a
tool. It includes GUI-based experiences that can be scaled to a full DevOps pipeline.
Azure Data Factory : ADF includes its own simple scheduling and orchestrating tool. While not as developed as
Logic App, it can be a valid choice when the orchestration is limited to simple scheduling, core data activities (ADF
pipelines, Databricks jobs, SQL procs, etc) and basic REST calls.
AirFlow : Airflow is a code oriented scheduler, allowing people to create complex workflows with many tasks. It is
mainly used to execute Python or Spark tasks. It is seamlessly integrated in Data Portal, so it is a good choice if you
keep you data in Datalake and want to have a single entry point to monitor both your data and your processes.
The go-to
solution in Azure
OUR VERDICT
Data science
projects in AWS
Simple
scheduling
needs with ADF
17. Feature comparaisons for orchestration
CI/CD
Scheduling and triggering
Native connectors
Debugging
Ease of development
Alerting & monitoring
Parameterization
Control flow
ADF
Airflow
LogicApps
Here are the key differentiator for orchestrating
solutions.
Interfacing with On-Prems assets
Secrets Management
18. Architecture templates
Data volume
Project complexity
Hulk
When you need pure muscle-power
Ex: CDB Reporting, Datalens
Thor
Any sources Large data volumes Global interest
Complex ETL Simple model Medium demand
Simple reporting over large data
Ex: Radarly, Digital Dashboard
Iron Man
Complex reporting over light data
Ex: Budgit
Hawkeye
Quick and agile project for POCs or short-lived needs
Ex: Weezevent
Any sources Small data volumes Local interest
Simple ETL Medium model Medium demand
Any sources Small data volumes Local interest
Complex ETL Complex model Medium demand
Any sources large data volumes Global interest
Complex ETL Complex model high demand
Whilst not the only considerations to take into account, architectures can
be broadly segmented by the volume of data they handle and the
complexity of ETL and model they must support. Based on this, we’ve
defined template architectures to guide you through the design process
19. Hulk
SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
Best-used for
Any sources Large data volumes Global interest
Complex ETL Complex model High demand
When you need pure muscle-power
Databricks
(pySpark)
ADF
20. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
Step 1
Data is landed from S3 to an ADLS gen2 via an ADF pipeline. This
ensure a fast, bottle-neck free landing phase. Due to the volume,
an incremental loading approach is highly recommanded to limit
the impact on an on-prems IR gateway and the throughput to
the SQL DB. We could have used ADF scheduling in simple
scenarios. However, for uniformity’s sake, and to benefit from
extra alerting capabilities, Logic Apps is preferred as the overall
scheduler.
1
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
21. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
2
Step 2
Databricks or ADF is used to perform complexe ETL over a large
dataset by leveraging is Spark SQL engine in Python.
It fetches the landed data from ADLS and enriches it with curated
data from the Core DB to perform the complexe ETL.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
22. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
3
Step 3
Any group data that can be used in an overall curated
dimensional model useful for other projects is pushed back to the
Core DB. This database is thus enriched project by project easy-
to-used, vouched-for datasets.
Data that is purely report-specific is pushed to a project datamart.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
23. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
4
Step 4
Due to the size of the reporting dataset, the complexity of its
model and KPI and the expected reporting performances, an
AAS cube is used as a main reporting engine.
The final report is built on top of this cube and exposed in the
Data Portal through Power BI Embedded.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
24. SOURCES
ORCHESTRATING
PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
AAS PBI Embedded
Project
ressources
Shared
ressources
5
Step 5
Subsidiary that whishes to do so can access the AAS cube to
make their own custom report and/or fetch the dimensional data
through the Core DB by using custom views for security purposes.
Hulk
When you need pure muscle-power
Databricks
(pySpark)
ADF
26. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
Composite
models
PBI Embedded
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 1
The overall architecture ressembles a Hulk-like scenario due to the
volume of data.
Transformations are performed in Databricks or ADF despite the
fact that the ETL is rather simple because data volumes can
overwhelme a single-machine architecture and/or the
throughput of the target database.
1
Thor
Simple reporting over large data
Databricks
ADF
27. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Logic apps
ADLS
Core DB
Data mart
Composite
models
PBI Embedded
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
2
Step 2
Two options are available for reporting calculations The simplest one
is to use a AAS cube. This is potentially more expensive in terms of
software, but simpler in development. The alternative is to use PBI
models with aggregations if the KPIs are simple enough to hit the
aggregations on a regular basis. However, this complexifies the data
model (thus development time), can hurt performance when the
query is passed through to the source, and infers costs on the data
mart layer (using Snowflake because of it per-query costing model).
Thor
Simple reporting over large data
Databricks
ADF
29. Iron Man
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Any sources Medium data volumes Local interest
Complex ETL Complex model Medium demand
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
Complex reporting over light data
LIVE CALCULATION Data Access
30. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
LIVE CALCULATION Data Access
1
Step 1
This architecture is designed for « tactical projects » where data
sharing is not paramount and data volumes are low, but which
may still necessitate a fair amount of business rules and data
cleansing.
The low volume means we can directly write raw data to the
database without a file-based landing in a datalake.
Iron Man
Complex reporting over light data
31. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
ADF
(dataflow)
Project
ressources
Shared
ressources
Logic apps
SQL procedures
Data mart PBI model PBI embedded
LIVE CALCULATION Data Access
2
Step 2
The complexe ETL can then be implented in SQL Stored
Procedures within the database itself.
For simplicity’s sake, the orchestration in Logic Apps launches the
ADF, and ADF launches the procs after the landing. This reduces
the complexity of the Logic App code to handle long-runnning
procedures.
Iron Man
Complex reporting over light data
33. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Small data volumes Local interest
Complex ETL Complex model Medium demand
Project
ressources
Shared
ressources Cloud sources
Airflow
Snowflake
PR Python
Operator
Amazon S3 PBI Embedded
(Import)
Databricks
(pySpark)
PR Python
Operator
/
LIVE CALCULATION Data Access
Iron Man - AWS
Complex reporting over light data
34. Hawkeye
SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Any sources Small data volumes Local interest
Simple ETL Medium model Medium demand
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Quick project for POCs or short-lived needs
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
35. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 1
A fully self-service Power BI stack is possible on small data
volumes and low ETL complexity. This architecture should be kept
to proof of concepts or temporairy projects where the time-to-
market is paramount and maintainability is not required.
While PBI dataflows are able to handle larger and larger data
volumes with premium capacities, current price-performance
ratios are highly suboptimal on larger data volumes.
1
Hawkeye
Quick project for POCs or short-lived needs
36. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 2
Power Query and the M language are capable of handling low-
to-medium levels of complexities in the ETL.
However, they currently lack the life-cycle tooling (version
control, automated deployments, etc) that are required with
professional development.
2
Hawkeye
Quick project for POCs or short-lived needs
37. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 3
A major roadblock preventing this architecture to be deployed to
more than POCs and temporary projects is the current lack of
integration with external storage systems.
The only current possibilities are thighly integrated to the Common
Data Model initiative in ADLS, which has yet to prove it viability
beyond Dynamics 365.
3
Hawkeye
Quick project for POCs or short-lived needs
38. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
PBI Scheduling
PBI Embedded
PBI model
PBI dataflow PBI dataflow
Project
ressources
Shared
ressources
LIVE CALCULATION Data Access
Step 4
The calculation and Data Access are of course made in Power BI
directly.
Here again, keep data volumes to a minimum. While the
premium capacties can accomodate larger and larger volumes,
the price-performance ratios are downright disastrous compared
to alternatives (AAS cubes and composite models).
4
Hawkeye
Quick project for POCs or short-lived needs
39. SOURCES
ORCHESTRATING
PROCESSING STORAGE
TRANSPORTING
Best-used for
Cloud sources Small data volumes Local interest
Simple ETL Simple model Medium demand
Project
ressources
Shared
ressources
Superset
Snowflake
PR Python
Operator
Airflow
Amazon S3
LIVE CALCULATION Data Access
Hawkeye
Quick project for POCs or short-lived needs
40. Where ?
Project
ressources
Shared
ressources
The Core DB is present in the Hulk and Thor templates.
It is a single database used by several projects.
Thor Hulk
SOURCES PROCESSING STORAGE LIVE CALCULATION Data Access
TRANSPORTING
ADF
(dataflow)
ADLS
Core DB
Data mart
AAS PBI Embedded
Databricks
(pySpark)
41. Core DB contains global information that can be used for all affiliates and across several
projects. It is a central repository that is gradually built from widely used business data (e-
commerce, prisma, websites, consumer activities, etc).
What is Core DB ?
SOURCES DATA LAKE
CORE DB
PROJECT DATA MART
REF
Ref/MDM data
Data quality
3rd Normal Form
Conformed
Dimensions
Star schema
EDW
This widely used pattern allows :
- Consistency across projects
- Better quality and overall data
management
- Lower project costs through
reuse of validated assets
42. Core DB contains global information that can be used for all affiliates and across several
projects:
• Common dimensions and referentials:
• products,
• entities,
• contacts
• geography
• …
• Widely used business events:
• e-commerce orders,
• prisma data,
• website views,
• consumer activities
• …
What is Core DB ?
43. The model is business oriented, and not source oriented.
It has its own IDs, to allow cross-source identifications:
For the same business item (for example a contact) it can ingest data from several sources
(CDB, client database, employee database…).
Therefore, the model and schemas do not depend on the sources.
What is Core DB ?
Contact ID CDB_id Email City Score
Core DB
Sources
44. Because of it’s multi-project nature, CoreDB has special requirements in terms of data
management and development practices.
Core DB requirements
- The data model is managed by a
central data architect
- Changes must be handled by pull
request on a central repository
- Permissions have to be managed
granularly per user and asset
- Access must be granted through
objects (views, procs) which can
part of a automated testing
pipeline
PROD TEST
DEV BRANCHES
Daily rebuild
& sanitizing
Pull request
Automated Testing
Branching
Pull request
Project-based DB development
On-demand CI/CD build
45. Datalake vs Core DB
Core DB approach :
• Strong cost of input (ETL)
• Small cost of output
• Structured data
• Business event oriented
Recommendation : only common data
Data Lake approach:
• Small cost of input
• Strong cost of output (data prep)
• Miscellaneous data
• “find signal in the noise”
Recommendation : all data