Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Kent Graziano
(updated slides used for North Texas DAMA meetup Oct 2018) As we move more and more towards the need for everyone to do Agile Data Warehousing, we need a data modeling method that can be agile with us. Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for over 15 years and is now growing in popularity. The purpose of this presentation is to provide attendees with an introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics:
• What the basic components of a DV model are
• How to build, and design structures incrementally, without constant refactoring
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
OLAP on the Cloud with Azure Databricks and Azure SynapseAtScale
This presentation was part of the 2020 Global Summer Azure Data Fest. It explains how Cloud OLAP helps you to analyze large amounts of data on Azure Databricks, Azure Synapse and other data platforms without moving it. And, shows how to leverage AtScale’s Cloud OLAP perform multidimensional analysis – and derive business insights – on data sets from multiple providers – with no data prep or data engineering required.
Data Privacy with Apache Spark: Defensive and Offensive ApproachesDatabricks
In this talk, we’ll compare different data privacy techniques & protection of personally identifiable information and their effects on statistical usefulness, re-identification risks, data schema, format preservation, read & write performance.
We’ll cover different offense and defense techniques. You’ll learn what k-anonymity and quasi-identifier are. Think of discovering the world of suppression, perturbation, obfuscation, encryption, tokenization, watermarking with elementary code examples, in case no third-party products cannot be used. We’ll see what approaches might be adopted to minimize the risks of data exfiltration.
Building a Cross Cloud Data Protection EngineDatabricks
Data Protection is still at the forefront of multiple companies minds with potential GDPR fines of up to 4% of their global annual turnover (creating a current theoretical max fine of $20bn). GDPR effects countries across the world, not just those in Europe, leaving many companies still playing catch up. Additional acts and legislation are coming into place such as CCPA meaning Data Protection is a constantly evolving landscape, with fines that can literally decimate some business. In this session we will go through how we have worked with our customers to create an Azure and AWS implementation of a Data Protection Engine covering Protection, Detection, Re-Identification and Erasure of PII data.
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSKent Graziano
(This is the talk I gave at Houston DAMA and Agile Denver BI meetups)
At a past client, in order to meet timelines to fulfill urgent, unmet reporting needs, I found it necessary to build a virtualized Operational Data Store as the first phase of a new Data Vault 2.0 project. This allowed me to deliver new objects, quickly and incrementally to the report developer so we could quickly show the business users their data. In order to limit the need for refactoring in later stages of the data warehouse development, I chose to build this virtualization layer on top of a Type 2 persistent staging layer. All of this was done using Oracle SQL Developer Data Modeler (SDDM) against (gasp!) a MS SQL Server Database. In this talk I will show you the architecture for this approach, the rationale, and then the tricks I used in SDDM to build all the stage tables and views very quickly. In the end you will see actual SQL code for a virtual ODS that can easily be translated to an Oracle database.
Delta Lake: Open Source Reliability w/ Apache SparkGeorge Chow
As presented: Sajith Appukuttan, Solution Architect, Databricks
Sept 12, 2019 at Vancouver Spark Meetup
Abstract: Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Kent Graziano
(updated slides used for North Texas DAMA meetup Oct 2018) As we move more and more towards the need for everyone to do Agile Data Warehousing, we need a data modeling method that can be agile with us. Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for over 15 years and is now growing in popularity. The purpose of this presentation is to provide attendees with an introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics:
• What the basic components of a DV model are
• How to build, and design structures incrementally, without constant refactoring
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
OLAP on the Cloud with Azure Databricks and Azure SynapseAtScale
This presentation was part of the 2020 Global Summer Azure Data Fest. It explains how Cloud OLAP helps you to analyze large amounts of data on Azure Databricks, Azure Synapse and other data platforms without moving it. And, shows how to leverage AtScale’s Cloud OLAP perform multidimensional analysis – and derive business insights – on data sets from multiple providers – with no data prep or data engineering required.
Data Privacy with Apache Spark: Defensive and Offensive ApproachesDatabricks
In this talk, we’ll compare different data privacy techniques & protection of personally identifiable information and their effects on statistical usefulness, re-identification risks, data schema, format preservation, read & write performance.
We’ll cover different offense and defense techniques. You’ll learn what k-anonymity and quasi-identifier are. Think of discovering the world of suppression, perturbation, obfuscation, encryption, tokenization, watermarking with elementary code examples, in case no third-party products cannot be used. We’ll see what approaches might be adopted to minimize the risks of data exfiltration.
Building a Cross Cloud Data Protection EngineDatabricks
Data Protection is still at the forefront of multiple companies minds with potential GDPR fines of up to 4% of their global annual turnover (creating a current theoretical max fine of $20bn). GDPR effects countries across the world, not just those in Europe, leaving many companies still playing catch up. Additional acts and legislation are coming into place such as CCPA meaning Data Protection is a constantly evolving landscape, with fines that can literally decimate some business. In this session we will go through how we have worked with our customers to create an Azure and AWS implementation of a Data Protection Engine covering Protection, Detection, Re-Identification and Erasure of PII data.
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSKent Graziano
(This is the talk I gave at Houston DAMA and Agile Denver BI meetups)
At a past client, in order to meet timelines to fulfill urgent, unmet reporting needs, I found it necessary to build a virtualized Operational Data Store as the first phase of a new Data Vault 2.0 project. This allowed me to deliver new objects, quickly and incrementally to the report developer so we could quickly show the business users their data. In order to limit the need for refactoring in later stages of the data warehouse development, I chose to build this virtualization layer on top of a Type 2 persistent staging layer. All of this was done using Oracle SQL Developer Data Modeler (SDDM) against (gasp!) a MS SQL Server Database. In this talk I will show you the architecture for this approach, the rationale, and then the tricks I used in SDDM to build all the stage tables and views very quickly. In the end you will see actual SQL code for a virtual ODS that can easily be translated to an Oracle database.
Delta Lake: Open Source Reliability w/ Apache SparkGeorge Chow
As presented: Sajith Appukuttan, Solution Architect, Databricks
Sept 12, 2019 at Vancouver Spark Meetup
Abstract: Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Data Lineage with Apache Airflow using Marquez Willy Lulciuc
The term data quality is used to describe the dependability, reliability, and usability of datasets. Data scientists and business analysts often determine the quality of a dataset by its trustworthiness and completeness. But what information might be needed to differentiate between useful vs noisy data? How quickly can data quality issues be identified and explored? More importantly, how can metadata enable data scientists to make better sense of the high volume of data within their organization from a variety of data sources?
With Airflow now ubiquitous for DAG orchestration, organizations increasingly dependon Airflow to manage complex inter-DAG dependencies and provide up-to-date runtime visibility into DAG execution. At WeWork, Airflow has quickly become an important component of our Data Platform powering billing, space inventory, etc. But what effects (if any) would upstream DAGs have on downstream DAGs if dataset consumption was delayed? What alerting rules should be in place to notify downstream DAGs of possible upstream processing issues or failures?
At WeWork, we feel it’s critical that DAG metadata is collected, maintained, and shared across the organization. This investment in metadata enables:
● Data lineage
● Data governance
● Data discovery
In this talk, we introduce Marquez: an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. We will demonstrate how metadata management with Marquez helps maintain inter-DAG dependencies, catalog historical runs of DAGs, and minimize data quality issues.
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureKent Graziano
This presentation was given at OakTable World 2014 (#OTW14) in San Francisco as a short Ted-style 10 minute talk. In it I introduce Data Vault 2.0 and its innovative approach to doing change data capture in a data warehouse by using MD5 Hash columns.
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...Willy Lulciuc
At WeWork, it's critical that we understand the complete context for all datasets. We also want to be able to explore dependencies between jobs and the datasets they produce and consume. To do this, WeWork needs metadata. In this talk I will focus on Marquez, a core service for the collection, aggregation and visualization of a data ecosystems metadata. Marquez maintains the provenance of how datasets are consumed and produced while providing global visibility into job runtime.
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
Why data warehouses cannot support hot analyticsImply
Check out the full webinar: https://imply.io/videos/why-data-warehouses-cannot-support-hot-analytics
Today’s data warehouses - whether traditional, specialized or cloud-based - are good at supporting cold analytics, such as reporting, where query times can take minutes. But they cannot cost-effectively support hot analytics—interactive ad hoc analytics usually performed by larger groups of users against batch or streaming data. Examples of hot analytics include clickstream analytics; service, network and application performance monitoring; and risk analytics.
Data warehouses struggle with hot analytics use cases because they are too slow, unable to scale, or too expensive. Learn how a new class of real-time data platforms overcome these limitations, and how companies implement a “temperature-based” approach to analytics.
Scaling Databricks to Run Data and ML Workloads on Millions of VMsMatei Zaharia
Keynote at Scale By The Bay 2020.
Cloud service developers need to handle massive scale workloads from thousands of customers with no downtime or regressions. In this talk, I’ll present our experience building a very large-scale cloud service at Databricks, which provides a data and ML platform service used by many of the largest enterprises in the world. Databricks manages millions of cloud VMs that process exabytes of data per day for interactive, streaming and batch production applications. This means that our control plane has to handle a wide range of workload patterns and cloud issues such as outages. We will describe how we built our control plane for Databricks using Scala services and open source infrastructure such as Kubernetes, Envoy, and Prometheus, and various design patterns and engineering processes that we learned along the way. In addition, I’ll describe how we have adapted data analytics systems themselves to improve reliability and manageability in the cloud, such as creating an ACID storage system that is as reliable as the underlying cloud object store (Delta Lake) and adding autoscaling and auto-shutdown features for Apache Spark.
Extreme BI: Creating Virtualized Hybrid Type 1+2 DimensionsKent Graziano
From a talk I gave at WWDVC and ECO in 2015 about how we built virtual dimensions (views) on a data vault-style data warehouse (see Data Warehousing in the Real World for full details on that architecture)
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...StampedeCon
At the StampedeCon 2015 Big Data Conference: This talk will examine the benefits of using multiple persistence strategies to build an end-to-end predictive engine. Utilizing Spark Streaming backed by a Cassandra persistence layer allows rapid lookups and inserts to be made in order to perform real-time model scoring. Spark backed by Parquet files, stored in HDFS, allows for high-throughput model training and tuning utilizing Spark MLlib. Both of these persistence layers also provide ad-hoc queries via Spark SQL in order to easily analyze model sensitivity and accuracy. Storing the data in this way also provides extensibility to leverage existing tools like CQL to perform operational queries on the data stored in Cassandra and Impala to perform larger analytical queries on the data stored in HDFS further maximizing the benefits of the flexible architecture.
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
This session will detail best practices for architecting, building, operating and managing an Analytics Data Lake platform. Key topics will include:
1) Defining next-generation Data Lake architectures. The defacto standard has been commodity DAS servers with HDFS, but there are now multiple solutions aimed at separating compute and storage, virtualizing or containerizing Hadoop applications, and utilizing Hadoop compatible or embedded HDFS filesystems. This portion will explore the options available, and the pros and cons of each.
2) Data Ingest. There are many ways to load data into a Data Lake, including standardized Apache tools (Sqoop, Flume, Kafka, Storm, Spark, NiFi), standard file and object protocols (SFTP, NFS, Rest, WebHDFS), and proprietary tools (eg, Zaloni Bedrock, DataTorrent). This section will explore these options in the context of best fit to workflows; it will also look at key gaps and challenges, particularly in the areas of data formats and integration with metadata/cataloging tools.
3) Metadata & Cataloguing. One of the biggest inhibitors of successful Data Lake deployments is Data Governance, particularly in the areas of indexing, cataloguing and metadata management. It is nearly impossible to run analytics on top of a Data Lake and get meaningful & timely results without solving these problems. This portion will explore both emerging open standards (Apache Atlas, HCatalog) and proprietary tools (Cloudera Navigator, Zaloni Bedrock/Mica, Informatica Metadata Manager), and balance the pros, cons and gaps of each.
4) Security & Access Controls. Solving these challenges are key for adoption in regulatory driven industries like Healthcare & Financial Services. There are multiple Apache projects and proprietary tools to address this, but the challenge is making security and access controls consistent across the entire application and infrastructure stack, and over the data lifecycle, and being able to audit this in the face of legal challenges. This portion will explore available options and best practices.
5) Provisioning & Workflow Management. The real promise of the Data Lake is integrating Analytics workflows and tools on converged infrastructure-with shared data-and build “As A Service” oriented architectures that are oriented towards self-service data exploration and Analytics for end users. This is an emerging and immature area, but this session will explore some potential concepts, tools and options to achieve this.
This will be a moderately technical session, with the above topics being illustrated by real world examples. Attendees should have basic familiarity with Hadoop and the associated Apache projects.
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Databricks
Jules Damji and Denny Lee from Databricks Developer Relations will recap some keynote highlights, and each will briefly present personal picks from sessions that resonated well with them. Next, Jacek Laskowski, an independent consultant, will speak about Spark 3.0 internals, and Scott Haines from Twilio, Inc. will give a talk about structured streaming microservice architectures. This live coding session and technical deep dive are not to be missed!
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
[Given at DAMA WI, Nov 2018] With the increasing prevalence of semi-structured data from IoT devices, web logs, and other sources, data architects and modelers have to learn how to interpret and project data from things like JSON. While the concept of loading data without upfront modeling is appealing to many, ultimately, in order to make sense of the data and use it to drive business value, we have to turn that schema-on-read data into a real schema! That means data modeling! In this session I will walk through both simple and complex JSON documents, decompose them, then turn them into a representative data model using Oracle SQL Developer Data Modeler. I will show you how they might look using both traditional 3NF and data vault styles of modeling. In this session you will:
1. See what a JSON document looks like
2. Understand how to read it
3. Learn how to convert it to a standard data model
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseDean Hallman
Before the rise of Big Data, the Enterprise Data Warehouse (EDW) reigned supreme in Business Intelligence architecture. However, modern data rates and volumes often outstripped the capacity of traditional Data Warehousing tools and modeling strategies to keep pace. Many companies turned to unstructured Data Lakes as an means of keeping up with the influx. Consequently, they often discovered that the road from Data Lake to Business Intelligence was filled with its own steep challenges. As a result, any savings in terms of throughput and storage costs was more than offset by the high extraction and analytics costs of turning an unstructured Data Lake into an insights-yielding asset.
Enter Data Vault 2.0, the Enterprise Data Warehouse reimagined to meet today’s data rate, volume and analytics demands. Not strictly an alternative to Data Lakes, Data Vault can easily integrate with your Data Lake and Big Data ingestion pipelines and analytics toolchain. This talk will introduce the fundamental concepts and advantages of Data Vault 2.0, and explain its approach to modeling data around your business domain’s “Hubs”, “Links” and “Satellites”. Finally, the talk will examine a real case study of building a Data Vault, including some challenges and drawbacks we encountered and addressed along the way.
This is a run-through at a 200 level of the Microsoft Azure Big Data Analytics for the Cloud data platform based on the Cortana Intelligence Suite offerings.
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
Seeing Redshift: How Amazon Changed Data Warehousing ForeverInside Analysis
The Briefing Room with Claudia Imhoff and Birst
Live Webcast April 9, 2013
What a difference a day can make! When Amazon announced their new RedShift offering – a data warehouse in the cloud – the entire industry of information management changed. The most notable disruption? Price. At a whopping $1,000 per year for a terabyte, RedShift achieved a price-point improvement that amounts to at least two orders of magnitude, if not three when compared to its top-tier competitors. But pricing is just one change; there's also the entire process by which data warehousing is done.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Claudia Imhoff explain why a new cloud-based reality for data warehousing significantly changes the game for business intelligence and analytics. She'll be briefed by Brad Peters of Birst who will tout his company's BI solution, which has been specifically architected for cloud-based hosting. Peters will discuss several key intricacies of doing BI in the cloud, including the unique provisioning, loading and modeling requirements. Founded in 2004, Birst has nearly a decade of doing cloud-based BI and Analytics.
Visit: http://www.insideanalysis.com
Data Vault 2.0 is a data modeling methodology designed for developing enterprise data warehouses. It was developed by Dan Linstedt in response to the shortcomings of previous data modeling methodologies, such as the Kimball methodology and Inmon methodology, for managing large volumes of data from disparate sources.
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
Watch full webinar here: https://bit.ly/32TT2Uu
Data virtualization is not just for self-service, it’s also a first-class citizen when it comes to modern data platform architectures. Technology has forced many businesses to rethink their delivery models. Startups emerged, leveraging the internet and mobile technology to better meet customer needs (like Amazon and Lyft), disrupting entire categories of business, and grew to dominate their categories.
Schedule a complimentary Data Virtualization Discovery Session with g2o.
Traditional companies are still struggling to meet rising customer expectations. During this webinar with the experts from g2o and Denodo we covered the following:
- How modern data platforms enable businesses to address these new customer expectation
- How you can drive value from your investment in a data platform now
- How you can use data virtualization to enable multi-cloud strategies
Leveraging the strategy insights of g2o and the power of the Denodo platform, companies do not need to undergo the costly removal and replacement of legacy systems to modernize their systems. g2o and Denodo can provide a strategy to create a modern data architecture within a company’s existing infrastructure.
Data Lineage with Apache Airflow using Marquez Willy Lulciuc
The term data quality is used to describe the dependability, reliability, and usability of datasets. Data scientists and business analysts often determine the quality of a dataset by its trustworthiness and completeness. But what information might be needed to differentiate between useful vs noisy data? How quickly can data quality issues be identified and explored? More importantly, how can metadata enable data scientists to make better sense of the high volume of data within their organization from a variety of data sources?
With Airflow now ubiquitous for DAG orchestration, organizations increasingly dependon Airflow to manage complex inter-DAG dependencies and provide up-to-date runtime visibility into DAG execution. At WeWork, Airflow has quickly become an important component of our Data Platform powering billing, space inventory, etc. But what effects (if any) would upstream DAGs have on downstream DAGs if dataset consumption was delayed? What alerting rules should be in place to notify downstream DAGs of possible upstream processing issues or failures?
At WeWork, we feel it’s critical that DAG metadata is collected, maintained, and shared across the organization. This investment in metadata enables:
● Data lineage
● Data governance
● Data discovery
In this talk, we introduce Marquez: an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. We will demonstrate how metadata management with Marquez helps maintain inter-DAG dependencies, catalog historical runs of DAGs, and minimize data quality issues.
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureKent Graziano
This presentation was given at OakTable World 2014 (#OTW14) in San Francisco as a short Ted-style 10 minute talk. In it I introduce Data Vault 2.0 and its innovative approach to doing change data capture in a data warehouse by using MD5 Hash columns.
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...Willy Lulciuc
At WeWork, it's critical that we understand the complete context for all datasets. We also want to be able to explore dependencies between jobs and the datasets they produce and consume. To do this, WeWork needs metadata. In this talk I will focus on Marquez, a core service for the collection, aggregation and visualization of a data ecosystems metadata. Marquez maintains the provenance of how datasets are consumed and produced while providing global visibility into job runtime.
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
Why data warehouses cannot support hot analyticsImply
Check out the full webinar: https://imply.io/videos/why-data-warehouses-cannot-support-hot-analytics
Today’s data warehouses - whether traditional, specialized or cloud-based - are good at supporting cold analytics, such as reporting, where query times can take minutes. But they cannot cost-effectively support hot analytics—interactive ad hoc analytics usually performed by larger groups of users against batch or streaming data. Examples of hot analytics include clickstream analytics; service, network and application performance monitoring; and risk analytics.
Data warehouses struggle with hot analytics use cases because they are too slow, unable to scale, or too expensive. Learn how a new class of real-time data platforms overcome these limitations, and how companies implement a “temperature-based” approach to analytics.
Scaling Databricks to Run Data and ML Workloads on Millions of VMsMatei Zaharia
Keynote at Scale By The Bay 2020.
Cloud service developers need to handle massive scale workloads from thousands of customers with no downtime or regressions. In this talk, I’ll present our experience building a very large-scale cloud service at Databricks, which provides a data and ML platform service used by many of the largest enterprises in the world. Databricks manages millions of cloud VMs that process exabytes of data per day for interactive, streaming and batch production applications. This means that our control plane has to handle a wide range of workload patterns and cloud issues such as outages. We will describe how we built our control plane for Databricks using Scala services and open source infrastructure such as Kubernetes, Envoy, and Prometheus, and various design patterns and engineering processes that we learned along the way. In addition, I’ll describe how we have adapted data analytics systems themselves to improve reliability and manageability in the cloud, such as creating an ACID storage system that is as reliable as the underlying cloud object store (Delta Lake) and adding autoscaling and auto-shutdown features for Apache Spark.
Extreme BI: Creating Virtualized Hybrid Type 1+2 DimensionsKent Graziano
From a talk I gave at WWDVC and ECO in 2015 about how we built virtual dimensions (views) on a data vault-style data warehouse (see Data Warehousing in the Real World for full details on that architecture)
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...StampedeCon
At the StampedeCon 2015 Big Data Conference: This talk will examine the benefits of using multiple persistence strategies to build an end-to-end predictive engine. Utilizing Spark Streaming backed by a Cassandra persistence layer allows rapid lookups and inserts to be made in order to perform real-time model scoring. Spark backed by Parquet files, stored in HDFS, allows for high-throughput model training and tuning utilizing Spark MLlib. Both of these persistence layers also provide ad-hoc queries via Spark SQL in order to easily analyze model sensitivity and accuracy. Storing the data in this way also provides extensibility to leverage existing tools like CQL to perform operational queries on the data stored in Cassandra and Impala to perform larger analytical queries on the data stored in HDFS further maximizing the benefits of the flexible architecture.
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
This session will detail best practices for architecting, building, operating and managing an Analytics Data Lake platform. Key topics will include:
1) Defining next-generation Data Lake architectures. The defacto standard has been commodity DAS servers with HDFS, but there are now multiple solutions aimed at separating compute and storage, virtualizing or containerizing Hadoop applications, and utilizing Hadoop compatible or embedded HDFS filesystems. This portion will explore the options available, and the pros and cons of each.
2) Data Ingest. There are many ways to load data into a Data Lake, including standardized Apache tools (Sqoop, Flume, Kafka, Storm, Spark, NiFi), standard file and object protocols (SFTP, NFS, Rest, WebHDFS), and proprietary tools (eg, Zaloni Bedrock, DataTorrent). This section will explore these options in the context of best fit to workflows; it will also look at key gaps and challenges, particularly in the areas of data formats and integration with metadata/cataloging tools.
3) Metadata & Cataloguing. One of the biggest inhibitors of successful Data Lake deployments is Data Governance, particularly in the areas of indexing, cataloguing and metadata management. It is nearly impossible to run analytics on top of a Data Lake and get meaningful & timely results without solving these problems. This portion will explore both emerging open standards (Apache Atlas, HCatalog) and proprietary tools (Cloudera Navigator, Zaloni Bedrock/Mica, Informatica Metadata Manager), and balance the pros, cons and gaps of each.
4) Security & Access Controls. Solving these challenges are key for adoption in regulatory driven industries like Healthcare & Financial Services. There are multiple Apache projects and proprietary tools to address this, but the challenge is making security and access controls consistent across the entire application and infrastructure stack, and over the data lifecycle, and being able to audit this in the face of legal challenges. This portion will explore available options and best practices.
5) Provisioning & Workflow Management. The real promise of the Data Lake is integrating Analytics workflows and tools on converged infrastructure-with shared data-and build “As A Service” oriented architectures that are oriented towards self-service data exploration and Analytics for end users. This is an emerging and immature area, but this session will explore some potential concepts, tools and options to achieve this.
This will be a moderately technical session, with the above topics being illustrated by real world examples. Attendees should have basic familiarity with Hadoop and the associated Apache projects.
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Databricks
Jules Damji and Denny Lee from Databricks Developer Relations will recap some keynote highlights, and each will briefly present personal picks from sessions that resonated well with them. Next, Jacek Laskowski, an independent consultant, will speak about Spark 3.0 internals, and Scott Haines from Twilio, Inc. will give a talk about structured streaming microservice architectures. This live coding session and technical deep dive are not to be missed!
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
[Given at DAMA WI, Nov 2018] With the increasing prevalence of semi-structured data from IoT devices, web logs, and other sources, data architects and modelers have to learn how to interpret and project data from things like JSON. While the concept of loading data without upfront modeling is appealing to many, ultimately, in order to make sense of the data and use it to drive business value, we have to turn that schema-on-read data into a real schema! That means data modeling! In this session I will walk through both simple and complex JSON documents, decompose them, then turn them into a representative data model using Oracle SQL Developer Data Modeler. I will show you how they might look using both traditional 3NF and data vault styles of modeling. In this session you will:
1. See what a JSON document looks like
2. Understand how to read it
3. Learn how to convert it to a standard data model
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseDean Hallman
Before the rise of Big Data, the Enterprise Data Warehouse (EDW) reigned supreme in Business Intelligence architecture. However, modern data rates and volumes often outstripped the capacity of traditional Data Warehousing tools and modeling strategies to keep pace. Many companies turned to unstructured Data Lakes as an means of keeping up with the influx. Consequently, they often discovered that the road from Data Lake to Business Intelligence was filled with its own steep challenges. As a result, any savings in terms of throughput and storage costs was more than offset by the high extraction and analytics costs of turning an unstructured Data Lake into an insights-yielding asset.
Enter Data Vault 2.0, the Enterprise Data Warehouse reimagined to meet today’s data rate, volume and analytics demands. Not strictly an alternative to Data Lakes, Data Vault can easily integrate with your Data Lake and Big Data ingestion pipelines and analytics toolchain. This talk will introduce the fundamental concepts and advantages of Data Vault 2.0, and explain its approach to modeling data around your business domain’s “Hubs”, “Links” and “Satellites”. Finally, the talk will examine a real case study of building a Data Vault, including some challenges and drawbacks we encountered and addressed along the way.
This is a run-through at a 200 level of the Microsoft Azure Big Data Analytics for the Cloud data platform based on the Cortana Intelligence Suite offerings.
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
Seeing Redshift: How Amazon Changed Data Warehousing ForeverInside Analysis
The Briefing Room with Claudia Imhoff and Birst
Live Webcast April 9, 2013
What a difference a day can make! When Amazon announced their new RedShift offering – a data warehouse in the cloud – the entire industry of information management changed. The most notable disruption? Price. At a whopping $1,000 per year for a terabyte, RedShift achieved a price-point improvement that amounts to at least two orders of magnitude, if not three when compared to its top-tier competitors. But pricing is just one change; there's also the entire process by which data warehousing is done.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Claudia Imhoff explain why a new cloud-based reality for data warehousing significantly changes the game for business intelligence and analytics. She'll be briefed by Brad Peters of Birst who will tout his company's BI solution, which has been specifically architected for cloud-based hosting. Peters will discuss several key intricacies of doing BI in the cloud, including the unique provisioning, loading and modeling requirements. Founded in 2004, Birst has nearly a decade of doing cloud-based BI and Analytics.
Visit: http://www.insideanalysis.com
Data Vault 2.0 is a data modeling methodology designed for developing enterprise data warehouses. It was developed by Dan Linstedt in response to the shortcomings of previous data modeling methodologies, such as the Kimball methodology and Inmon methodology, for managing large volumes of data from disparate sources.
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
Watch full webinar here: https://bit.ly/32TT2Uu
Data virtualization is not just for self-service, it’s also a first-class citizen when it comes to modern data platform architectures. Technology has forced many businesses to rethink their delivery models. Startups emerged, leveraging the internet and mobile technology to better meet customer needs (like Amazon and Lyft), disrupting entire categories of business, and grew to dominate their categories.
Schedule a complimentary Data Virtualization Discovery Session with g2o.
Traditional companies are still struggling to meet rising customer expectations. During this webinar with the experts from g2o and Denodo we covered the following:
- How modern data platforms enable businesses to address these new customer expectation
- How you can drive value from your investment in a data platform now
- How you can use data virtualization to enable multi-cloud strategies
Leveraging the strategy insights of g2o and the power of the Denodo platform, companies do not need to undergo the costly removal and replacement of legacy systems to modernize their systems. g2o and Denodo can provide a strategy to create a modern data architecture within a company’s existing infrastructure.
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
The Future of Data Integration: Data Mesh, and a Special Deep Dive into Stream Processing with GoldenGate, Apache Kafka and Apache Spark. This video is a replay of a Live Webinar hosted on 03/19/2020.
Join us for a timely 45min webinar to see our take on the future of Data Integration. As the global industry shift towards the “Fourth Industrial Revolution” continues, outmoded styles of centralized batch processing and ETL tooling continue to be replaced by realtime, streaming, microservices and distributed data architecture patterns.
This webinar will start with a brief look at the macro-trends happening around distributed data management and how that affects Data Integration. Next, we’ll discuss the event-driven integrations provided by GoldenGate Big Data, and continue with a deep-dive into some essential patterns we see when replicating Database change events into Apache Kafka. In this deep-dive we will explain how to effectively deal with issues like Transaction Consistency, Table/Topic Mappings, managing the DB Change Stream, and various Deployment Topologies to consider. Finally, we’ll wrap up with a brief look into how Stream Processing will help to empower modern Data Integration by supplying realtime data transformations, time-series analytics, and embedded Machine Learning from within data pipelines.
GoldenGate: https://www.oracle.com/middleware/tec...
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Oracle OpenWorld London - session for Stream Analysis, time series analytics, streaming ETL, streaming pipelines, big data, kafka, apache spark, complex event processing
this is part 3 of the series on Data Mesh ... looking at the intersection of microservices architecture concepts, data integration / replication technologies and log-based stream integration techniques. This webinar was mostly a demonstration, but several slides used to setup the demo are included here as a PDF for viewers.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Are you facing the challenge to meet growing IT requirements while operating on a limited budget?
Learn more about why you should transform your database management system (DBMS) and make open source part of your strategic business and IT choices. An open source DBMS offers you various benefits, including cost reduction, liberation from vendor lock-in, and a large development community. Paired with enterprise-class services, 24x7 support and reliable management tools, open source is a first class alternative to traditional proprietary DBMSs.
Big data ingest frameworks ship with an array of connectors for common data origins and destinations, such as flat files, S3, HDFS, Kafka etc, but sometimes, you need to send data to, or receive data from a system that's not on the list. StreamSets includes template code for building your own connectors and processors; we'll walk through the process of building a simple destination that sends data to a REST web service, and show how it can be extended to target more sophisticated systems such as Salesforce Wave Analytics.
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.
Big data and the cloud are perfect partners for companies who want to unlock maximum value from all of their unstructured, semi-structured, and structured data. The challenge has been how to create and manage a reliable end-to-end solution that spans data ingestion, storage and analysis in the face of the volume, velocity and variety of big data sources.
In this webinar, we will show you how to achieve big data bliss by combining StreamSets Data Collector, which specializes in creating and running complex any-to-any dataflows, with Microsoft's Azure Data Lake and Azure analytic solutions.
We will walk through an example of how a major bank is using StreamSets to transport their on-premise data to the Azure Cloud Computing Platform and Azure Data Lake to take advantage of analytics tools with unprecedented scale and performance.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
A Key to Real-time Insights in a Post-COVID World (ASEAN)Denodo
Watch full webinar here: https://bit.ly/2EpHGyd
Presented at Data Champions, Online Asia 2020
Businesses and individuals around the world are experiencing the impact of a global pandemic. With many workers and potential shoppers still sequestered, COVID-19 is proving to have a momentous impact on the global economy. Regardless of the current situation and post-pandemic era, real-time data becomes even more critical to healthcare practitioners, business owners, government officials, and the public at large where holistic and timely information are important to make quick decisions. It enables doctors to make quick decisions about where to focus the care, business owners to alter production schedules to meet the demand, government agencies to contain the epidemic, and the public to be informed about prevention.
In this on-demand session, you will learn about the capabilities of data virtualization as a modern data integration technique and how can organisations:
- Rapidly unify information from disparate data sources to make accurate decisions and analyse data in real-time
- Build a single engine for security that provides audit and control by geographies
- Accelerate delivery of insights from your advanced analytics project
The Great Lakes: How to Approach a Big Data ImplementationInside Analysis
The Briefing Room with Dr. Robin Bloor and Think Big, a Teradata Company
Live Webcast April 7, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=4114b87441ab7b2b4c52f6b24776e5a1
The more things change in Big Data, the more they stay the same. Indeed, there are many similarities between a Hadoop-based Data Lake and today’s modern Data Warehouse. Regardless of platform, information workers must still be able to turn their assets into action quickly, without taking a hit on governance or downstream performance.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains the challenges facing organizations who endeavor on Big Data projects. He’ll be briefed by Rick Stellwagen of Think Big, a Teradata Company, who will outline his company’s approach to handling Big Data implementations. Rick will discuss the role of the data lake, and how timely response of queries is critical for reporting and analysis.
Visit InsideAnalysis.com for more information.
Horses for Courses: Database RoundtableEric Kavanagh
The blessing and curse of today's database market? So many choices! While relational databases still dominate the day-to-day business, a host of alternatives has evolved around very specific use cases: graph, document, NoSQL, hybrid (HTAP), column store, the list goes on. And the database tools market is teeming with activity as well. Register for this special Research Webcast to hear Dr. Robin Bloor share his early findings about the evolving database market. He'll be joined by Steve Sarsfield of HPE Vertica, and Robert Reeves of Datical in a roundtable discussion with Bloor Group CEO Eric Kavanagh. Send any questions to info@insideanalysis.com, or tweet with #DBSurvival.
John Hammink's Talk at Great Wide Open 2016. We discuss: 1.) the need for data analytics infrastructure that can scale exponentially and 2.) what such an infrastructure must contain and finally 3.) the need for an infrastructure to be able to handle un - and semi-structured data.
Data APIs as a Foundation for Systems of EngagementVictor Olex
APIs have finally crossed over to the world of enterprise software, data analytics and application integration. Spearheaded by Amazon, propagated by internet startups and now adopted by the largest of businesses including Wall Street top firm Goldman Sachs - the APIs are here to stay. In this presentation we are linking all the facts and examine the opportunities stemming from Resource Oriented Architecture - a holistic approach to API implementation in large organizations.
Similar to Data Vault 2.0: Big Data Meets Data Warehousing (20)
Building Reliability - The Realities of ObservabilityAll Things Open
Presented at the ATO RTP Meetup
Presented by Jeremy Proffit, Director of DevSecOps & SRE for Customer Care and Communications, Ally
Title: Building Reliability - The Realities of Observability
Abstract: Join me as we discuss true observability, learn what works and what doesn't. We'll not only discuss dashboards, monitoring and alerting, but how these can be built by automation or included in your IAC modules. We'll talk about how to properly alert staff based on priority to keep your staff and yourself sane. And even discuss architecture and how it impacts reliably and why serverless isn't always the best at being reliable.
Presented at the ATO RTP Meetup
Presented by Peter Zaitsev, Founder of Percona
Title: Modern Database Best Practices
Abstract: There are now more Database choices available for developers than ever before - there are general purpose databases and specialized databases, single node and distributed databases, Open Source, Proprietary databases and databases available exclusively in the cloud. In this presentation we will cover the best practices of choosing database(s) for your applications, best practices as it comes to application development as well as managing those databases to achieve best possible performance, security, availability at the lowest cost.
All Things Open 2023
Presented at All Things Open 2023
Presented by Deb Bryant - Open Source Initiative, Patrick Masson - Apereo Foundation, Stephen Jacobs - Rochester Institute of Technology, Ruth Suehle - SAS, & Greg Wallace - FreeBSD Foundation
Title: Open Source and Public Policy
Abstract: New regulations in the software industry and adjacent areas such as AI, open science, open data, and open education are on the rise around the world. Cyber Security, societal impact of AI, data and privacy are paramount issues for legislators globally. At the same time, the COVID-19 pandemic drove collaborative development to unprecedented levels and took Open Source software, open research, open content and data from mainstream to main stage, creating tension between public benefit and citizen safety and security as legislators struggle to find a balance between open collaboration and protecting citizens.
Historically, the open source software community and foundations supporting its work have not engaged in policy discussions. Moving forward, thoughtful development of these important public policies whilst not harming our complex ecosystems requires an understanding of how our ecosystem operates. Ensuring stakeholders without historic benefit of representation in those discussions becomes paramount to that end.
Please join our open discussion with open policy stakeholders working constructively on current open policy topics. Our panelists will provide a view into how oss foundations and other open domain allies are now rising to this new challenge as well as seizing the opportunity to influence positive changes to the public’s benefit.
Topics: Public Policy, Open Science, Open Education, current legislation in the US and EU, US interest in OSS sustainability, intro to the Open Policy Alliance
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...All Things Open
Presented at All Things Open 2023
Presented by Ashpak Shaikh & Lucy Shen - Intuit
Title: Weaving Microservices into a Unified GraphQL Schema with graph-quilt
Abstract: The magic of GraphQL is that it provides data access through a single endpoint—clean and easy. But as the number of GraphQL microservices your tech stack depends on starts to grow, that single-endpoint purpose becomes a new multi-endpoint problem. Ideally, we would have an orchestrator that could aggregate schemas from multiple microservices into a unified GraphQL schema and route the requests to the appropriate microservice.
Enter graph-quilt, an open source Java library that provides recursive schema stitching and Apollo Federation style schema composition. In this talk, we’ll walk through our GraphQL journey and show you how to use graph-quilt to simplify your data orchestration needs. We will also share our open sourced reference implementation of a highly performant graph-quilt gateway currently being used in production here at Intuit, where we’ve had incredible success in scaling the gateway with 50+ microservices and 150+ clients.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
The State of Passwordless Auth on the Web - Phil NashAll Things Open
Presented at All Things Open 2023
Presented by Phil Nash - Sonar
Title: The State of Passwordless Auth on the Web
Abstract: Can we get rid of passwords yet? They make for a poor user experience and users are notoriously bad with them. The advent of WebAuthn has brought a passwordless world closer, but where do we really stand?
In this talk we'll explore the current user experience of WebAuthn and the requirements a user has to fulfil to authenticate without a password. We'll also explore the fallbacks and safeguards we can use to make the password experience better and more secure. By the end of the session you'll have a vision of how authentication could look in the future and a blueprint for how to build the best auth experience today.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Total ReDoS: The dangers of regex in JavaScriptAll Things Open
Presented at All Things Open 2023
Presented by Phil Nash - Sonar
Title: Total ReDoS: The dangers of regex in JavaScript
Abstract: Regular expressions are complicated and can be hard to learn. On top of that, they can also be a security risk; writing the wrong pattern can open your application up to denial of service attacks. One token out of place and you invite in the dreaded ReDoS.
But how can a regular expression cause this? In this talk we’ll track down the patterns that can cause this trouble, explain why they are an issue and propose ways to fix them now and avoid them in the future. Together we’ll demystify these powerful search patterns and keep your application safe from expressions that behave in a way that is anything but regular.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
What Does Real World Mass Adoption of Decentralized Tech Look Like?All Things Open
Presented at All Things Open 2023
Presented by Karl Mozurkewich - Storj
Title: What Does Real World Mass Adoption of Decentralized Tech Look Like?
Abstract: We delve into the transformative potential of decentralized technology. Beginning with a brief overview of the rise of centralization with the advent of the internet and the counter-shift marked by blockchain we explore the intrinsic characteristics of decentralized and distributed systems, such as trustless operations, peer-to-peer networks, and enterprise application scalability. Various sectors, including finance, supply chains, media and entertainment, data science and cloud infrastructure are on the brink of disruption. The societal implications are vast, with the potential for greater individual empowerment, a greener planet and more viable resource utilization, but concerns about data security persist.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Presented at All Things Open 2023
Presented by Anastasia Lalamentik - Kaleido
Title: How to Write & Deploy a Smart Contract
Abstract: In this talk, Anastasia Lalamentik, Full Stack Engineer at Kaleido, will walk through how Ethereum smart contracts work and go over related concepts like gas fees, the Ethereum Virtual Machine (EVM), the block explorer, and the Solidity programming language. This is vital to anyone who wants to build a blockchain app and is a great introduction to blockchain technology for newcomers to the space.
By the end of the talk, attendees will better understand how to:
- Write a simple smart contract
- Deploy their smart contract to an Ethereum test network through the latest tools like Hardhat and the MetaMask wallet
- Test interactions with their deployed smart contract and ensure that everything is working properly
Additionally, participants will get to interact with Anastasia's deployed smart contract at the end of the talk. Anastasia’s past talks have attracted and have been attended by a diverse group of participants with a range of experience in the space.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlowAll Things Open
Presented at All Things Open 2023
Presented by Paul Brebner - Instaclustr (by Spot by NetApp)
Title: Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Abstract: In this talk we’ll build a Drone delivery application, and then use it to do some Machine Learning “on the fly”.
In the 1st part of the talk, we'll build a real-time Drone Delivery demonstration application using a combination of two open-source technologies: Uber’s Cadence (for stateful, scheduled, long-running workflows), and Apache Kafka (for fast streaming data).
With up to 2,000 (simulated) drones and deliveries in progress at once this application generates a vast flow of spatio-temporal data.
In the 2nd part of the talk, we'll use this platform to explore Machine Learning (ML) over streaming and drifting Kafka data with TensorFlow to try and predict which shops will be busy in advance.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Presented at the All Things Open 2023 Inclusion and Diversity in Open Source Event
Presented by Efraim Marquez-Arreaza - Red Hat
Title: DEI Challenges and Success
Abstract: In today's world, many companies and organizations have Diversity, Equity and Inclusion (DEI) communities. Red Hat Unidos is a DEI community focused on advocating for the Hispanic/Latine community. In this talk, we would like to share our challenges and success during the past 4-years and plans for the future.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Presented at All Things Open 2023
Presented by Lydia Cupery - HubSpot
Title: Scaling Web Applications with Background Jobs: Takeaways from Generating a Huge PDF
Abstract: Do you need to perform time-consuming or CPU-intensive processes in your web application but are concerned about performance? That’s where background jobs come in. By offloading resource-intensive tasks to separate worker processes, you can improve the scalability of your web application.
In this talk, I'll share my experience of using background jobs to scale our web application. I'll discuss the challenges my team faced that led us to adopt background jobs. Then, I'll share practical tips on how to design background jobs for CPU-intensive or time-consuming processes, such as generating huge PDFs and batch emailing. I'll wrap up by going over the performance and cost tradeoffs of background jobs.
I'll use Typescript, Express, and Heroku as examples in this talk, but the concepts and best practices that I'll share are applicable to other languages and tools.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Presented at All Things Open 2023
Presented by Robert Aboukhalil - CZI
Title: Supercharging tutorials with WebAssembly
Abstract: sandbox.bio is a free platform that features interactive command-line tutorials for bioinformatics. This talk is a deep-dive into how sandbox.bio was built, with a focus on how WebAssembly enabled bringing command-line tools like awk and grep to the web. Although these tools were originally written in C/C++, they all run directly in the browser, thanks to WebAssembly! And since the computations run on each user's computer, this makes the application highly scalable and cost-effective.
Along the way, I'll discuss how WebAssembly works and how to get started using it in your own applications. The talk will also cover more advanced WebAssembly features such as threads and SIMD, and will end with a discussion of WebAssembly's benefits and pitfalls (it's a powerful technology, but it's not always the right tool!).
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Presented at All Things Open 2023
Presented by K.S. Bhaskar - YottaDB LLC
Title: Using SQL to Find Needles in Haystacks
Abstract: Database journal files capture every update to a database. A database of a few hundred GB can generate GBs worth of journal files every minute at busy times. Troubleshooting and forensices, especially of rare and intermittent problems, such as which process made what update and when, is an exercise of finding needles in haystacks. A similar problem exists with syslogs. A solution is to load the journal files and syslogs into a database, and use SQL to query the database. Bhaskar will present and demonstrate this with a 100% FOSS stack.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Configuration Security as a Game of Pursuit InterceptAll Things Open
Presented at All Things Open 2023
Presented by Wes Widner - Automox
Title: Configuration Security as a Game of Pursuit Intercept
Abstract: In this session we will take a look at the emerging field of cloud security posture management and how we can approach the problem space using a class of board games known as pursuit/intercept. Using the game Scotland Yard as a visual illustration we'll explore the cognitive and technical limitations that all CSPM systems face and what you should look for when evaluating the strengths and weakness of CSPM vendors and approaches.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Presented at All Things Open 2023
Presented by Carol Huang & Mike Fix - Stripe
Title: Scaling an Open Source Sponsorship Program
Abstract: We already know this: the open-source ecosystem needs further monetary investment from the companies that benefit most from it. Likewise, companies say they want to participate in these initiatives, but find it hard to dedicate resources to open source funding when there isn’t a clear ROI.
This talk discusses how the Open Source Program Office at Stripe built a scalable, sustainable open source sponsorship model that aligns internal company incentives with those of open source maintainers and the community at large. We go over the unique “platformization” of our OSPO that allowed us to create multiple funding models, such as BYOB (Bring Your Own Budget), and share lessons learned from this experience as well as other OSPOs.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Build Developer Experience Teams for Open SourceAll Things Open
Presented at All Things Open 2023
Presented by Arundeep Nagaraj - Amazon Web Services (AWS)
Title: Build Developer Experience Teams for Open Source
Abstract: Open Source has become the default strategy for many IT organizations and Enterprises. However, the constant challenge with Open Source leaders of these organizations has been -
How is my product's developer experience?
Is this the right metric to track?
How can I scale my team to support our products better?
How can I add automation to scale redundant workflows?
If my product involves working with developers, how can I scale to the complexity of the requests and reduce Engineering bandwidth?
The challenges within support of open source products continues to magnify depending on the end user persona whether they are consumers or contributors to your product. Consumers utilize your product, SDK's and API's and are blocked with using it or run into issues, whereas contributors are advanced users of your software that understands the codebase to provide a meaningful contribution back to the product.
The answer to the above is to look at Open Source support as a first-class citizen of your corporate support strategy. To employ the right level of developer focused support as opposed to traditional infrastructure based support is key to scale to the amount of developers using your product. Supporting customers in the open involves more than pure support - building customer / developer experiences (DX) in the open (across platforms and communities) that pivots over the ability of your product's users or developers to be focused on the end-to-end value add. This helps with your active developer growth and retention of users.
Key Takeaways:
- IT leaders of Open Source will learn to employ strategies to build a DX team that engages on multiple platforms
- Work on identifying accurate metrics for product and organization
- Innovate on platforms such as Discord to build a bot and a dashboard
- Ability to leverage customer feedback and iterate over the customer success flywheel
- Distinguish between DX and Developer Advocacy (DA)
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Presented at All Things Open 2023
Presented by Danny McCormick - Google
Title: Deploying Models at Scale with Apache Beam
Abstract: Apache Beam is an open source tool for building distributed scalable data pipelines. This talk will explore how Beam can be used to perform common machine learning tasks, with a heavy focus on running inference at scale. The talk will include a demo component showing how Beam can be used to deploy and update models efficiently on both CPUs and GPUs for inference workloads.
An attendee can expect to leave this talk with a high level understanding of Beam, the challenges of deploying models at scale, and the ability to use Beam to easily parallelize their inference workloads.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Sudo – Giving access while staying in controlAll Things Open
Presented at All Things Open 2023
Presented by Peter Czanik - One Identity
Title: Sudo – Giving access while staying in control
Abstract: Sudo is used by millions to control and log administrator access to systems, but using the default configuration only, there are plenty of blind spots. Using the latest features in sudo let you watch some previously blind spots and control access to them. Here are four major new features, which arrived since the 1.9.0 release, allowing you see your blind spots:
- configuring a working directory or chroot within sudo often makes full shell access redundant
- JSON-formatted logs give you more details on events and are easier to act on
- relays in sudo_logsrvd make session recording collection more secure and reliable
- you can log and control sub-commands executed by the command run through sudo
Let us take a closer look at each of these.
Previously, there were quite a few situations where you had to give users full shell access through sudo. Typical examples include when you need to run a command from a given directory, or running commands in a chroot environment. You can now configure the working directory or the chroot directory and give access only to the command the user really needs.
Logging is a central role of sudo, to see who did what on the system. Using JSON-formatted log messages gives you even more information about events. What is even more: structured logs are easier to act on. Setting up alerting for suspicious events is much easier when you have a single parser to configure for any kind of sudo logs. You can collect sudo logs not only by local syslog, but also by using sudo_logsrvd, the same application used to collect session recordings.
Speaking of session recordings: instead of using a single central server, you can now have multiple levels of sudo_logsrvd relays between the client and the final destination. This allows session collection even if the central server is unavailable, providing you with additional security. It also makes your network configuration simpler.
Finally, you can log sub-commands executed from the command started through sudo. You can see commands started from a shell. No more unnoticed shell access from text editors. Best of all: you can also intercept sub-commands.
These are just a few of the most prominent features helping you to watch and control previous blind spots on your systems. See these and other possibilities in action in some live demos during our presentation.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsAll Things Open
Presented at All Things Open 2023
Presented by Christine Abernathy - F5, Inc.
Title: Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Abstract: As Artificial Intelligence (AI) and Machine Learning (ML) applications continue to surge, it is crucial to be aware of and address the security risks associated with these technologies. In this talk, Christine will explore AI/ML failure modes, threats, and mitigation strategies. She will guide you through the fundamentals of ML models then introduce you to key security challenges such as adversarial attacks, data poisoning, model inversion, model stealing, and membership inference attacks, using real-world examples to demonstrate their potential impact.
Christine will also discuss privacy and ethical considerations in ML, touching upon techniques like federated learning and shedding light on the current regulatory landscape surrounding security risks. If you are developing AI/ML applications or incorporating AI/ML components into your technology stack, check out this talk. You will walk away with a deeper understanding of the current AI/ML security landscape and a toolkit to help you address these risks, enabling you to build safer, more secure, and privacy-aware applications.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...All Things Open
Presented at All Things Open 2023
Presented by Carlos Santana - AWS
Title: Securing Cloud Resources Deployed with Control Planes on Kubernetes using Governance and Policy as Code
Abstract: Are you concerned about the security of your cloud resources deployed on Kubernetes? Are you struggling to ensure compliance with regulatory requirements while managing your cloud infrastructure? If yes, then this talk is for you!
We will discuss how to secure cloud resources deployed with Crossplane on Kubernetes using Governance and Policy as Code. We will explore how to leverage Governance and Policy as Code tools like Rego, Kyverno, and OPA to ensure security and compliance.
By the end of this talk, you will have a better understanding of the challenges associated with securing cloud resources deployed with Crossplane or ACK on Kubernetes, the importance of Governance and Policy as Code in ensuring security and compliance, and why it is critical to use open source and open standards in these technologies.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
2. DATA WAREHOUSING VS BIG DATA
• Does Big Data replace Data Warehousing? Or do I need both?
• What’s the difference:
• Between the data flowing into a data warehouse vs big data tools?
• Between the ingestion processes and infrastructure?
• Data Lakes arrived with Big Data, so are they useful in Data
Warehousing?
• How should I model my data in EDW?
• 3NF, Star Schema, same as my operational data stores?
• Data Vault 2.0
• Graph Databases
• What is an architecture that allows both to co-exists effectively?
5. DATA VAULT 2.0
COMMON FOUNDATIONAL WAREHOUSE ARCHITECTURE
• “The Data Vault Model is a detail oriented, historical tracking and uniquely linked
set of normalized tables that support one or more functional areas of business. It is a
hybrid approach encompassing the best of breed between 3rd normal form (3NF)
and star schema. The design is flexible, scalable, consistent and adaptable to the
needs of the enterprise” -- Dan Linstedt, Creator of Data Vault
• Data loaded as-is from sources, no edits or cleanup
• Append-only to afford highest performance
• Agile & agnostic to changes in the operational store’s data model
• Essentially, a prescription for Layered Graph to Relational Mapping
6. DATA WAREHOUSING & DATA VAULT 2.0
• 60’s, 70’s, 80’s
• E.F. Codd => 3NF
• Bill Inmon invents Data Warehousing
concept
• Dr. Ralph Kimball popularizes Star
Schema design
• 90’s, 00’s:
• Dan Linstedt creates Data Vault Model @
DOD
• 2014:
• Dan Introduces Data Vault 2.0
7.
8. Source: “What are Graph Databases and Why should I care?“, by Dave Bechberger of Expero
14. Flight
Base Dest Forecast
Record
Source
LoadDate Depart Gate
LGA 2018-10-11 1:25P
M
B27
CAE 2018-10-24 3:30P
M
A14
SFO 2018-09-06 8:55P
M
G19
RDU 2018-08-12 4:45P
M
C22
SERVICED_BY
Record Source Airport CAE
Load Date 2018-11-17
Source Id 20181117-32-983
Aircraft
Base Service FAA NTSB
Record
Source
LoadDate Model Tailno
United 2017-02-11 767 1477
Delta 2015-11-04 A6 2381
Alaska 2013-08-28 747 8312
Frontie
r
2016-07-19 182 1438
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
SERVICED_BY
Base Dest Manifest
Record
Source
LoadDate Begin End
United 2017-02-11 2017-04-23 2017-09-23
Delta 2015-11-04 2015-12-01 2017-04-22
Alaska 2013-08-28 2013-09-14 2016-05-04
Frontie
r
2016-07-19 2016-08-02 2018-04-11
Record Source United Airlines
Load Date 2018-09-17
Hubs
Links
SatellitesTab
15. • Organizations which design systems ...
are constrained to produce designs
which are copies of the communication
structures of these organizations
- Mel Conway
16. FLIGHT
Base Dest Forecast
Record
Source
LoadDate Depart Gate
LGA 2018-10-
11
1:25P
M
B27
CAE 2018-10-
24
3:30P
M
A14
FLIGHT
Record Source Airport CAE
Load Date 2018-11-17
Source Id 20181117-32-983
Aircraft
Bas
e
Service FAA NTSB
Record
Source
LoadDate Model Tailno
United 2017-02-
11
767 1477
Delta 2015-11-
04
A6 2381
Alaska 2013-08-
28
747 8312
Frontie
r
2016-07-
19
182 1438
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
Airport
Base Dest Manifest
Record
Source
LoadDate Begin End
United 2017-02-11 2017-04-23 2017-09-
23
Delta 2015-11-04 2015-12-01 2017-04-
22
Alaska 2013-08-28 2013-09-14 2016-05-
04
Frontie
r
2016-07-19 2016-08-02 2018-04-
11
Record Source United Airlines
Load Date 2018-09-17
Airline
Base Service FAA
NTS
B
Record
Source
LoadDate Model Tailno
United 2017-02-11 767 1477
Delta 2015-11-04 A6 2381
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
Hubs
Links
SatellitesTab
18. • Modeled after self-
organizing networks
• A Business Key identifies a
key concept in business.
• They have a business
meaning
• They are unique and
have very low propensity
to change
• Business keys change
only when the business
change
• Enables (forces) cross-
source modeling
Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
30. DATA WAREHOUSING
• Deep Topic
• 60’s, 70’s, 80’s
• E.F. Codd => 3NF
• Bill Inmon invents Data Warehousing
concept
• Dr. Ralph Kimball popularizes Star Schema
design
• 90’s, 00’s:
• Dan Linstedt creates Data Vault Model @
DOD
• 2014:
• Dan Introduces Data Vault 2.0
• Data Warehouse vs Operational Data
Stores
• Data Warehouse as Version Control System
• MapReduce, 2004, Google by Jeffery
Dean and Sanjay, “MAPREDUCE:
SIMPLIFIED DATA PROCESSING ON
LARGE CLUSTERS” , GFS
• Nutch 2005, Hadoop 2006, 2007 - Doug
Cutting
• What exactly is “Big Data”?
BIG DATA
33. ETL OR SERDE ?
S3
Hadoop
Time Series
Event Record
Analysis
Deserializer
L e
L
d
L
m
Client
User
Serializer
L p
L
p
Eventlog.e Eventlog.d
L
e
Single Source
(Version Locked)
Kafka/Kinesis
LeInternet
No single answer, but convention over configuration has one the day
Data Warehousing
---
60’s, 70’s, 80’s
E.F. Codd => 3NF
Bill Inmon invents Data Warehousing concept
Dr. Ralph Kimball popularizes Star Schema design
90’s, 00’s:
Dan Linstedt creates Data Vault Model @ DOD
2014:
Dan Introduces Data Vault 2.0
Data Warehouse vs Operational Data Stores
Data Warehouse as Version Control System
Big Data
-----
MapReduce, 2004, Google by Jeffery Dean and Sanjay, “MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS” , GFS
Nutch 2005, Hadoop 2006, 2007 - Doug Cutting
What exactly is “Big Data”?
Too close to the forest, forget to see the trees
Is the business intelligence scattered out in the field
Or centralized in the back office?
Actors in the system are intelligent?
Learn lanuage, conjugate verbs, form new sentences
Serializer/Deserialize: Reusable package to be imported into a Lambda
Test suite that ensures Serializer / Deserializer agree on before/after result