Karen Lopez (@datchick/InfoAdvisors) 90-minute presentation on Data Security, Data Privacy, Compliance and how data modelers should discover, assess, and monitor these important data management responsibilities.
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
As more and more people are moving to PostgreSQL from Oracle, a pattern of mistakes is emerging. They can be caused by the tools being used or just not understanding how PostgreSQL is different than Oracle. In this talk we will discuss the top mistakes people generally make when moving to PostgreSQL from Oracle and what the correct course of action.
Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution.
Intuit Data Ecosystem supports unique consumer and small business assets at scale, and handle petabytes of customer data. We have 8M active small business customers and 16M paid workers that uses Intuit Quick Books and Quick Books Payroll Products. Huge customer base and large volumes of data always challenges the data teams in terms of freshness of data, correctness of data etc. This presentation is intended to cover such problems we faced at Intuit along with the data observability model we follow to cure, detect and prevent data Issues. We would like to provide deep insights into the implementations and the impact of some of the great work done by Intuit in this direction.
This is the presentation for the lecture of Dimitar Mitov "Data Analytics with Dremio" (in Bulgarian), part of OpenFest 2022: https://www.openfest.org/2022/bg/full-schedule-bg/
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
As more and more people are moving to PostgreSQL from Oracle, a pattern of mistakes is emerging. They can be caused by the tools being used or just not understanding how PostgreSQL is different than Oracle. In this talk we will discuss the top mistakes people generally make when moving to PostgreSQL from Oracle and what the correct course of action.
Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution.
Intuit Data Ecosystem supports unique consumer and small business assets at scale, and handle petabytes of customer data. We have 8M active small business customers and 16M paid workers that uses Intuit Quick Books and Quick Books Payroll Products. Huge customer base and large volumes of data always challenges the data teams in terms of freshness of data, correctness of data etc. This presentation is intended to cover such problems we faced at Intuit along with the data observability model we follow to cure, detect and prevent data Issues. We would like to provide deep insights into the implementations and the impact of some of the great work done by Intuit in this direction.
This is the presentation for the lecture of Dimitar Mitov "Data Analytics with Dremio" (in Bulgarian), part of OpenFest 2022: https://www.openfest.org/2022/bg/full-schedule-bg/
Comprehensive Identity and Access Governance for Rapid, Actionable Compliance
The industry’s most comprehensive identity governance solution delivers user administration, privileged account management, and identity intelligence, powered by rich analytics and actionable insight.
Accelerate Your ML Pipeline with AutoML and MLflowDatabricks
Building ML models is a time consuming endeavor that requires a thorough understanding of feature engineering, selecting useful features, choosing an appropriate algorithm, and performing hyper-parameter tuning. Extensive experimentation is required to arrive at a robust and performant model. Additionally, keeping track of the models that have been developed and deployed may be complex. Solving these challenges is key for successfully implementing end-to-end ML pipelines at scale.
In this talk, we will present a seamless integration of automated machine learning within a Databricks notebook, thus providing a truly unified analytics lifecycle for data scientists and business users with improved speed and efficiency. Specifically, we will show an app that generates and executes a Databricks notebook to train an ML model with H2O’s Driverless AI automatically. The resulting model will be automatically tracked and managed with MLflow. Furthermore, we will show several deployment options to score new data on a Databricks cluster or with an external REST server, all within the app.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...Lace Lofranco
Talk Description:
The Modern Data Warehouse architecture is a response to the emergence of Big Data, Machine Learning and Advanced Analytics. DevOps is a key aspect of successfully operationalising a multi-source Modern Data Warehouse.
While there are many examples of how to build CI/CD pipelines for traditional applications, applying these concepts to Big Data Analytical Pipelines is a relatively new and emerging area. In this demo heavy session, we will see how to apply DevOps principles to an end-to-end Data Pipeline built on the Microsoft Azure Data Platform with technologies such as Data Factory, Databricks, Data Lake Gen2, Azure Synapse, and AzureDevOps.
Resources: https://aka.ms/mdw-dataops
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
Apache Kafka With Spark Structured Streaming With Emma LIU, Nitin Saksena, Ram Dhakne | Current 2022
A well-architected data lakehouse provides an open data platform that combines streaming with data warehousing, data engineering, data science and ML. This opens a world beyond streaming to solving business problems in real-time with analytics and AI. See how companies like Albertsons have used Databricks and Confluent together to combine Kafka streaming with Databricks for their digital transformation.
In this talk, you will learn:
- The built-in streaming capabilities of a lakehouse
- Best practices for integrating Kafka with Spark Structured Streaming
- How Albertsons architected their data platform for real-time data processing and real-time analytics
What is data literacy? Which organizations, and which workers in those organizations, need to be data-literate? There are seemingly hundreds of definitions of data literacy, along with almost as many opinions about how to achieve it.
In a broader perspective, companies must consider whether data literacy is an isolated goal or one component of a broader learning strategy to address skill deficits. How does data literacy compare to other types of skills or “literacy” such as business acumen?
This session will position data literacy in the context of other worker skills as a framework for understanding how and where it fits and how to advocate for its importance.
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
DataOps expands DevOps philosophy to include data-heavy roles (data engineering & data science). DataOps uses better cross-functional collaboration for flexibility, fast time to value and an agile workflow for data-intensive applications including machine learning pipelines. (Strata Data San Jose March 2018)
Delivering Data Democratization in the Cloud with SnowflakeKent Graziano
This is a brief introduction to Snowflake Cloud Data Platform and our revolutionary architecture. It contains a discussion of some of our unique features along with some real world metrics from our global customer base.
The financial industry operates on a variety of different data and computing platforms. Integrating these different sources into a centralized data lake is crucial to support reporting and analytics tools.
Apache Spark is becoming the tool of choice for big data integration analytics due it’s scalable nature and because it supports processing data from a variety of data sources and formats such as JSON, Parquet, Kafka, etc. However, one of the most common platforms in the financial industry is the mainframe, which does not provide easy interoperability with other platforms.
COBOL is the most used language in the mainframe environment. It was designed in 1959 and evolved in parallel to other programming languages, thus, having its own constructs and primitives. Furthermore, data produced by COBOL has EBCDIC encoding and has a different binary representation of numeric data types.
We have developed Cobrix, a library that extends Spark SQL API to allow direct reading from binary files generated by mainframes.
While projects like Sqoop focus on transferring relational data by providing direct connectors to a mainframe, Cobrix can be used to parse and load hierarchical data (from IMS for instance) after it is transferred from a mainframe by dumping records to a binary file. Schema should be provided as a COBOL copybook. It can contain nested structures and arrays. We present how the schema mapping between COBOL and Spark was done, and how it was used in the implementation of Spark COBOL data source. We also present use cases of simple and multi-segment files to illustrate how we use the library to load data from mainframes into our Hadoop data lake.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Data Security and Protection in DevOps Karen Lopez
Presentation to London #WinOps event Sept 2019. Focusing on data security, privacy, and protection on DevOps efforts. Includes data masking, dev and test, data, Alwasy Encrypted, and more.
Bridging the Gap: Analyzing Data in and Below the CloudInside Analysis
The Briefing Room with Dean Abbott and Tableau Software
Live Webcast July 23, 2013
http://www.insideanalysis.com
Today’s desire for analytics extends well beyond the traditional domain of Business Intelligence. That’s partly because business users are realizing the value of mixing and matching all kinds of data, from all kinds of sources. One emerging market driver is Cloud-based data, and the desire companies have to analyze this data cohesively with their on-premise data sets.
Register for this episode of The Briefing Room to learn from Analyst Dean Abbott, who will explain how the ability to access data in the cloud can play a critical role for generating business value from analytics. He’ll be briefed by Ellie Fields of Tableau Software who will tout Tableau’s latest release, which includes native connectors to cloud-based applications like Salesforce.com, Amazon Redshift, Google Analytics and BigQuery. She’ll also demonstrate how Tableau can combine cloud data with other data sources, including spreadsheets, databases, cubes and even Big Data.
Comprehensive Identity and Access Governance for Rapid, Actionable Compliance
The industry’s most comprehensive identity governance solution delivers user administration, privileged account management, and identity intelligence, powered by rich analytics and actionable insight.
Accelerate Your ML Pipeline with AutoML and MLflowDatabricks
Building ML models is a time consuming endeavor that requires a thorough understanding of feature engineering, selecting useful features, choosing an appropriate algorithm, and performing hyper-parameter tuning. Extensive experimentation is required to arrive at a robust and performant model. Additionally, keeping track of the models that have been developed and deployed may be complex. Solving these challenges is key for successfully implementing end-to-end ML pipelines at scale.
In this talk, we will present a seamless integration of automated machine learning within a Databricks notebook, thus providing a truly unified analytics lifecycle for data scientists and business users with improved speed and efficiency. Specifically, we will show an app that generates and executes a Databricks notebook to train an ML model with H2O’s Driverless AI automatically. The resulting model will be automatically tracked and managed with MLflow. Furthermore, we will show several deployment options to score new data on a Databricks cluster or with an external REST server, all within the app.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...Lace Lofranco
Talk Description:
The Modern Data Warehouse architecture is a response to the emergence of Big Data, Machine Learning and Advanced Analytics. DevOps is a key aspect of successfully operationalising a multi-source Modern Data Warehouse.
While there are many examples of how to build CI/CD pipelines for traditional applications, applying these concepts to Big Data Analytical Pipelines is a relatively new and emerging area. In this demo heavy session, we will see how to apply DevOps principles to an end-to-end Data Pipeline built on the Microsoft Azure Data Platform with technologies such as Data Factory, Databricks, Data Lake Gen2, Azure Synapse, and AzureDevOps.
Resources: https://aka.ms/mdw-dataops
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
Apache Kafka With Spark Structured Streaming With Emma LIU, Nitin Saksena, Ram Dhakne | Current 2022
A well-architected data lakehouse provides an open data platform that combines streaming with data warehousing, data engineering, data science and ML. This opens a world beyond streaming to solving business problems in real-time with analytics and AI. See how companies like Albertsons have used Databricks and Confluent together to combine Kafka streaming with Databricks for their digital transformation.
In this talk, you will learn:
- The built-in streaming capabilities of a lakehouse
- Best practices for integrating Kafka with Spark Structured Streaming
- How Albertsons architected their data platform for real-time data processing and real-time analytics
What is data literacy? Which organizations, and which workers in those organizations, need to be data-literate? There are seemingly hundreds of definitions of data literacy, along with almost as many opinions about how to achieve it.
In a broader perspective, companies must consider whether data literacy is an isolated goal or one component of a broader learning strategy to address skill deficits. How does data literacy compare to other types of skills or “literacy” such as business acumen?
This session will position data literacy in the context of other worker skills as a framework for understanding how and where it fits and how to advocate for its importance.
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
DataOps expands DevOps philosophy to include data-heavy roles (data engineering & data science). DataOps uses better cross-functional collaboration for flexibility, fast time to value and an agile workflow for data-intensive applications including machine learning pipelines. (Strata Data San Jose March 2018)
Delivering Data Democratization in the Cloud with SnowflakeKent Graziano
This is a brief introduction to Snowflake Cloud Data Platform and our revolutionary architecture. It contains a discussion of some of our unique features along with some real world metrics from our global customer base.
The financial industry operates on a variety of different data and computing platforms. Integrating these different sources into a centralized data lake is crucial to support reporting and analytics tools.
Apache Spark is becoming the tool of choice for big data integration analytics due it’s scalable nature and because it supports processing data from a variety of data sources and formats such as JSON, Parquet, Kafka, etc. However, one of the most common platforms in the financial industry is the mainframe, which does not provide easy interoperability with other platforms.
COBOL is the most used language in the mainframe environment. It was designed in 1959 and evolved in parallel to other programming languages, thus, having its own constructs and primitives. Furthermore, data produced by COBOL has EBCDIC encoding and has a different binary representation of numeric data types.
We have developed Cobrix, a library that extends Spark SQL API to allow direct reading from binary files generated by mainframes.
While projects like Sqoop focus on transferring relational data by providing direct connectors to a mainframe, Cobrix can be used to parse and load hierarchical data (from IMS for instance) after it is transferred from a mainframe by dumping records to a binary file. Schema should be provided as a COBOL copybook. It can contain nested structures and arrays. We present how the schema mapping between COBOL and Spark was done, and how it was used in the implementation of Spark COBOL data source. We also present use cases of simple and multi-segment files to illustrate how we use the library to load data from mainframes into our Hadoop data lake.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Data Security and Protection in DevOps Karen Lopez
Presentation to London #WinOps event Sept 2019. Focusing on data security, privacy, and protection on DevOps efforts. Includes data masking, dev and test, data, Alwasy Encrypted, and more.
Bridging the Gap: Analyzing Data in and Below the CloudInside Analysis
The Briefing Room with Dean Abbott and Tableau Software
Live Webcast July 23, 2013
http://www.insideanalysis.com
Today’s desire for analytics extends well beyond the traditional domain of Business Intelligence. That’s partly because business users are realizing the value of mixing and matching all kinds of data, from all kinds of sources. One emerging market driver is Cloud-based data, and the desire companies have to analyze this data cohesively with their on-premise data sets.
Register for this episode of The Briefing Room to learn from Analyst Dean Abbott, who will explain how the ability to access data in the cloud can play a critical role for generating business value from analytics. He’ll be briefed by Ellie Fields of Tableau Software who will tout Tableau’s latest release, which includes native connectors to cloud-based applications like Salesforce.com, Amazon Redshift, Google Analytics and BigQuery. She’ll also demonstrate how Tableau can combine cloud data with other data sources, including spreadsheets, databases, cubes and even Big Data.
Regulatory compliance mandates have historically focused on IT & endpoint security as the primary means to protect data. However, as our digital economy has increasingly become software dependent, standards bodies have dutifully added requirements as they relate to development and deployment practices. Enterprise applications and cloud-based services constantly store and transmit data; yet, they are often difficult to understand and assess for compliance.
This webcast will present a practical approach towards mapping application security practices to common compliance frameworks. It will discuss how to define and enact a secure, repeatable software development lifecycle (SDLC) and highlight activities that can be leveraged across multiple compliance controls. Topics include:
* Consolidating security and compliance controls
* Creating application security standards for development and operations teams
* Identifying and remediating gaps between current practices and industry accepted "best practices”
Your database holds your company's most sensitive and important assets- your data. All those customers' personal details, credit card numbers, social security numbers- you can't afford leaving them vulnerable to any- outside or inside- breaches.
Applying Auto-Data Classification Techniques for Large Data SetsPriyanka Aash
In the current data security landscape, large volumes of data are being created across the enterprise. Manual techniques to inventory and classify data makes it a tedious and expensive activity. To create a time and cost effective implementation of security and access controls, it becomes key to automate the data classification process.
(Source: RSA USA 2016-San Francisco)
Designing for Data Security by Karen LopezKaren Lopez
As security and complaince becomes more important for organizations, especially in the age of GDPR, data breach and other legislation, Karen covers the types of features data architects and designers should be considering when building modern, protected and defensive systems.
Building a Data Driven Culture and AI Revolution With Gregory Little | Curren...HostedbyConfluent
Building a Data Driven Culture and AI Revolution With Gregory Little | Current 2022
Transforming business or mission through AI/ML doesn't start with technology but with culture…and an audit. At least as much is true for the US Department of Defense (DoD), which presents significant modernization challenges because of its mission scope, expansive global footprint, and massive size - with over 2.8 million people, it is the largest employer in the world. Greg Little discusses how establishing the DoD’s annual audit became a surprising accelerator for the department’s data and analytics journey. It revealed the foundational needs for data management to run a $3 trillion in assets enterprise, and its successful implementation required breaking through deeply entrenched cultural and organizational resistance across DoD.
In this session, Greg will discuss what it will take to guide the evolution of technology and culture in parallel: leadership, technology that enables rapid scale and a complete & reliable data flow, and a data driven culture.
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi | Automating Machine Learning, Artificial Intelligence, and Data Science | Guided Analytics
Data Profiling: The First Step to Big Data QualityPrecisely
Big data offers the promise of a data-driven business model generating new revenue and competitive advantage fueled by new business insights, AI, and machine learning. Yet without high quality data that provides trust, confidence, and understanding, business leaders continue to rely on gut instinct to drive business decisions.
The critical foundation and first step to deliver high quality data in support of a data-driven view that truly leverages the value of big data is data profiling - a proven capability to analyze the actual data content and help you understand what's really there.
View this webinar on-demand to learn five core concepts to effectively apply data profiling to your big data, assess and communicate the quality issues, and take the first step to big data quality and a data-driven business.
Challenges of Operationalising Data Science in Productioniguazio
The presentation topic for this meet-up was covered in two sections without any breaks in-between
Section 1: Business Aspects (20 mins)
Speaker: Rasmi Mohapatra, Product Owner, Experian
https://www.linkedin.com/in/rasmi-m-428b3a46/
Once your data science application is in the production, there are many typical data science operational challenges experienced today - across business domains - we will cover a few challenges with example scenarios
Section 2: Tech Aspects (40 mins, slides & demo, Q&A )
Speaker: Santanu Dey, Solution Architect, Iguazio
https://www.linkedin.com/in/santanu/
In this part of the talk, we will cover how these operational challenges can be overcome e.g. automating data collection & preparation, making ML models portable & deploying in production, monitoring and scaling, etc.
with relevant demos.
Machine Learning is increasingly being used by companies as a disruptor or providing a USP. This means that Machine Learning models need to cope with being a critical part of solutions and if those solutions use PCI-DSS or PII then the models must be highly secure.
In addition, if a Machine Learning model is part of your USP then you will want to protect it. Also, the EU AI Regulation and UK AI Strategy means that AI is becoming increasingly regulated. This means you need to be able to prove what model made a prediction and why it made it by providing auditability and explainabilty.
In this talk we go over these issues and how to address them including using AWS and how to implement development best practices.
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
Many data scientists are well grounded in creating accomplishment in the enterprise, but many come from outside – from academia, from PhD programs and research. They have the necessary technical skills, but it doesn’t count until their product gets to production and in use. The speaker recently helped a struggling data scientist understand his organization and how to create success in it. That turned into this presentation, because many new data scientists struggle with the complexities of an enterprise.
The data services marketplace is enabled by a data abstraction layer that supports rapid development of operational applications and single data view portals. In this presentation yo will learn services-based reference architecture, modality, and latency of data access.
- Reference architecture for enterprise data services marketplace
- Modality and latency of data access
- Customer use cases and demo
This presentation is part of the Denodo Educational Seminar , and you can watch the video here goo.gl/vycYmZ.
Modernizing, Migrating & Mitigating - Moving to Modern Cloud & API Web Apps W...Security Innovation
This talk will help you, as a decision maker or architect, to understand the risks of migrating a thick client or traditional web application to the modern web. In this talk I’ll give you tools and techniques to make the migration to the modern web painless and secure so you can mitigate common pitfalls without having to make the mistakes first. I’ll be doing demos, and telling lots of stories throughout.
Making some good architectural decisions up front can help you:
- Minimize the risk of data breach
- Protect your user’s privacy
- Make security choices easy the easy default for your developers
- Understand the cloud security model
- Create defaults, policies, wrappers, and guidance for developers
- Detect when developers have bypassed security controls
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Precisely
The advanced analytics and AI that run today’s businesses rely on a larger volume, and greater variety, of data. This data needs to be of the highest quality to ensure the best possible outcomes, but traditional data quality tools weren’t designed for today’s modern data environments.
That’s why we’ve developed Trillium DQ for Big Data -- an integrated product that delivers industry-leading data profiling and data quality at scale, in the cloud or on premises.
In this on-demand webcast, you will learn how Trillium DQ:
• Empowers data analysts to easily profile large, diverse data sources to discover new insights, uncover issues, and report on their findings – all without involving IT.
• Delivers best-in-class entity resolution to support mission-critical applications such as Customer 360, fraud detection, AML, and predictive analytics.
• Supports Cloud and hybrid architectures by providing consistent high-performance processing within critical time windows on all platforms.
• Keeps enterprise data lakes validated, clean, and trusted with the highest quality data – without technical expertise in big data or distributed architectures.
• Enables data quality monitoring based on targeted business rules for data governance and business insight
Qiagram is a collaborative visual data exploration environment that enables investigator-initiated, hypothesis-driven data exploration, allowing business users as well as IT professionals to easily ask complex questions against complex data sets.
Similar to Data Modeling for Security, Privacy and Data Protection (20)
Slide deck for the DGIQ SIG on AI Ethics.
Are you concerned about data and AI ethics? Do you worry about how to make sure the algorithms and systems that affect our lives are fair, honest, responsible, and respectful of our rights and values? Do you have opinions about how to build an organizational culture that cares about these topics
Join us for what will surely be a lively and interesting session where you are the speakers.
Special interest group (SIG) discussions are group conversations on topics that are new, or specific to an audience segment. The format is casual and without any formal presentation. The objective is to engage all participants in an exchange of ideas, questions, and advice, so please come with a willingness to participate in the conversation.
A Designer's Favourite Security and Privacy Features in SQL Server and Azure ...Karen Lopez
SQL Server includes multiple features that focus on data security, privacy, and developer productivity. In this session, we will review the best features from a database designer’s and developer’s point of view.
– Always Encrypted
– Dynamic Data Masking
– Row Level Security
– Data Classification
– Assessments
– Defender for SQL Server
– Ledger Tables
…and more
We’ll look at new and older features, why you should consider them, where they work, where they don’t, who needs to be involved in using them, and what changes, if any, need to be made to applications or tools that you use with SQL Server.
You will learn:
– The pros and cons of implementing each feature
– How implementing these new features may impact existing applications
– 10 tips for enhancing SQL Server security and privacy protections
Designer's Favorite New Features in SQLServerKaren Lopez
A database designer's favourte features in SQL Server...with a bit of Azure SQL DB, too.
Always Encrypted
Row Level Security
Microsoft Purview
Azure Enabled SQL Server
Azure Defender for SQL
Azure Defender for Cloud
Dynamic Data Masking
Ledger Database and Tables
Data Privacy
Data Governance
Karen's Presentation to DAMA Chicago and other DAMA Chapters on 15 February 2023.
This presentation is less about data lakes that it is about Data Quality and how data professionals should think about designing and architecting systems that best meet the needs of how data works in the real world.
Expert Cloud Data Backup and Recovery Best Practice.pptxKaren Lopez
We’ve been deploying backup solutions since the beginning of computing and the foundations of backup and recovery have stayed the same: make sure backups run consistently and set recovery objectives. Yet systems in 2022 don’t work or act the same way they did decades ago. Cloud data backups have helped us meet the need for offsite backups, as well as impacted how we budget for them. Ransomware has impacted how we store them. The laws of physics might be more of an issue than when we had tapes stored in a safe down the hall. Cost models have changed, too.
In this session, Karen Lopez covers best practices for modern data recovery…and she will share stories of worst practices just to keep it real.
Manage Your Time So It Doesn't Manage YouKaren Lopez
NASA Space Apps NYC Pre-Hackathon Symposium presentation by Karen Lopez, InfoAdvisors and NASA Datanaut. Karen presents on how to successfully manage your time and deliverables in the NASA Space Apps Challenge no matter where you are participating.
This one-hour presentation covers the tools and techniques for migrating SQL Server databases and data to Azure SQL DB or SQL Server on VM. Includes SSMA, DMA, DMS, and more.
Blockchain for the DBA and Data ProfessionalKaren Lopez
An overview of blockchain fundamentals, including examples of Oracle 20c Blockchain Tables. Includes concepts of trust, immutability, hashes, distributed nodes, and cryptography.
Blockchain for the DBA and Data ProfessionalKaren Lopez
With all the hype around blockchain, why should a DBA or other data professional care? In this session, we will cover the basics of blockchain as it applies to data and database processes:
Immutability
Verification
Distribution
Cryptography
Transactions
Trust
We will look at current offerings for blockchain features in Azure and in database and data stores. Finally, we'll help you identify the types of business requirements that need blockchain technologies.
You will learn:
Understand the valid uses of blockchain approaches in databases
How current technologies support blockchain approaches
Understand the costs, benefits, and risks of blockchain
There are many data modeling and database design terms and jargon that uses the word "key." Do you know the difference between a surrogate key and a primary key? A super key and a candidate key? Could you explain them to a technical audience? A business user or an auditor?
In this presentation, Karen Lopez covers the concepts of primary keys, foreign keys, candidate key, surrogate keys, and more.
How to Survive as a Data Architect in a Polyglot Database WorldKaren Lopez
Karen Lopez talks to data architects and data moders how they can best deliver value on modern data drive projects beyond relational database technologies. She covers NoSQL Databases and Datastores, which data stories they best fit and which ones they don't. She ends with 10 tips for adding more value to ployschematic database solutions.
Karen's Favourite Features of SQL Server 2016Karen Lopez
Slides from a one hour webinar on Karen Lopez's favorite features from database designer's point of view. Topics include Always Encrypted, Data Masking, Row Level Security, Foreign Keys, JSON and more.
Notice an error? Let me know. I welcome this sort of feedback.
In the spirit of the book 7 Databases in 7 Weeks, Lara Rubbelke and Karen Lopez cover ~seven databases and datastores in the SQL and NoSQL world, when to use them, and how they are SQL-like.
From SQLBitsXV
Notice an error? Let me know. I welcome this sort of feedback.
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez
Karen Lopez's presentation about 10 Physical Data Modeling/Database Design blunders, based on her work in helping organizations get the most value out of their models and data.
Notice an error? Let me know. I welcome this sort of feedback.
NoSQL and Data Modeling for Data ModelersKaren Lopez
Karen Lopez's presentation for data modelers and data architects. Why data modeling is still relevant for big data and NoSQL projects.
Plus 10 tips for data modelers for working on NoSQL projects.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Data Modeling for Security, Privacy and Data Protection
1. Data Modeling for Security and
Privacy
Karen Lopez
Data Evangelist
InfoAdvisors
www.datamodel.com
1
2. Abstract
Modern database systems have introduced more support
for security, privacy, and compliance over the last few years.
We expect this to increase as compliance issues such as
GDPR and other data compliance challenges arise. In this
session, Karen will be discussing the newer features from a
data modelers/database designers' point of view, including:
Data Masking
End-to-End encryption
Row Level Security
New Data Types
Data Categorization and Classification
We'll look at the new features, why you should consider
them, where they work, where they don't. We will also
discuss how to negotiate on behalf of data protection in a
world of Agile, MVP, Lean and DevOps. This session is
hands-on with demos and labs, so bring your own laptop to
participate.
3
3. Karen Lopez
• Karen has 20+ years of
data and information
architecture experience
on large, multi-project
programs.
• She is a frequent speaker
on data modeling, data-
driven methodologies and
pattern data models.
• She wants you to love
your data.
8. About this
session
• Mostly
transactional
discussions
• Variety of skills &
experience in
teams
• Time limits
• Inspire you to
learn
• Our style
• “At another
company”
• Giving you tools &
approaches
• Some checklists
items
• Mostly analytical
and practical
learning
• Tools are for
examples
9
10. Ready for 25 May?
Callers asked me:
• How can we get started?
• Can you help us get certified?
• Do you have software for this?
• Do you have a couple of weeks to
help us get this done?
11. Karen’s Governance Position
Security at the data level
Models capture security & privacy requirements
Management reports of reviews
Measurement
In other words, Governance
12. Data Models
• Karen’s Preference
• Track all kinds of
metadata
• Advanced Compare
features
• Support DevOps and
Iterative development
• Support Conceptual,
Logical and Physical
design
20. Data Curation
Related to Data
Stewardship
Covers more than Data
Categorization
Important part of Data
Governance
New-ish term going into
GDPR and other
protection concepts
21. One more time…
Every Design Decision
must be based on
Cost, Benefit and Risk
www.datamodel.com
23. Catalog Data
Assets
Every compliance effort starts with
inventory
Capture the hard work of every project
Build incrementally
Start with what exists physically
24
24. Azure Data Catalog
Azure Data Catalog is a
fully managed cloud
service whose users can
discover the data sources
they need and
understand the data
sources they find. At the
same time, Data Catalog
helps organizations get
more value from their
existing investments.
29. Data Objects/Assets
• A metadata representation in Data Catalog of a real-world data object.
Examples include: tables, views, files, reports, and so on.
37. Issues
• Data Scientists spend 80% of
their time sourcing, prepping and
cleansing data
• Likely everyone else has these
issues
• We are lousy at documenting
data and meta data
• This makes Karen sad
38. Lab 1 Discussion
• When would you be “done” discovering?
• How would you know you were done?
• Would you be able to do all the datasets?
• How would you prioritize the work?
• What skills would you need?
• What went right? Wrong?
• What would make this easier?
39
45. Dynamic Data Masking
COLUMN LEVEL DATA IN THE
DATABASE, AT REST,
IS NOT MASKED
MEANT TO
COMPLEMENT
OTHER METHODS
PERFORMED AT THE
END OF A DATABASE
QUERY RIGHT
BEFORE DATA
RETURNED
PERFORMANCE
IMPACT SMALL
46
47. DDM Functions
Function Mask Example
Default Based on Datatype
String – XXX
Numbers – 000000
Date & Times - 01.01.2000 00:00:00.0000000
Binary – Single Byte 0
xxxx
0
01.01.2000 00:00:00.0000000
0
Email First character of email, then Xs, then .com
Always .com
Kxxx@xxxx.com
Custom First and last values, with Xs in the middle kxxxn
Random For numeric types, with a range 12
48
48. Dynamic Data Masking
Data in database is
not changed
01
Ad-hoc queries
*can* expose data
02
Does not aim to
prevent users from
exposing pieces of
sensitive data
03
49
49. Dynamic Data
Masking
Cannot mask an encrypted column (AE)
Cannot be configured on computed column
But if computed column depends on a mask,
then mask is returned
Using SELECT INTO or INSERT INTO results in
masked data being inserted into target (also
for import/export)
50
50. Why would a DB Designer love
it?
• Allows central, reusable design for
standard masking
• Offers more reliable masking and
more usable masking
• Applies across applications
• Removes whining about “we can
do that later”
51
52. Security –
Row Level
Security
Filtering result sets (predicate-based
access)
Predicates applied when reading data
Can be used to block write access
User defined policies tied to inline table
functions
53
53. Row Level Security
No indication that results have been filtered
If all rows are filtered than NULL set returned
For block predicates, an error returned
Works even if you are dbo or db_owner role
54
54. Why would a DB Designer love
it?
• Allows a designer to do this sort of
data protection IN THE DATABASE,
not just rely on code.
• Many, many pieces of code
• Applies across applications
55
60. Why would a DB Designer love
it?
• Always Encrypted, yeah.
• Allows designers to not only specify
which columns need to be
protected, but how.
• Parameters are encrypted as well
• Built in to the engine, easier for
Devs
61
61. What should we STOP doing?
Nobody ever talks about this….
62
62. SQL Injection
• WE ARE STILL DOING THIS!
• IT’S STILL THE #1 (but
unsecured storage is
getting more popular)
• TEST. TEST SOME MORE
• Automated Testing
• Governance is important
66. Test Data
• Restoring Production to
Development
• Restoring Production, with
Masking
• Restoring Production, with
Randomizing
• Restoring
Production…anywhere
• Design Test Data
• Lorem Ipsum for Data
• Really, Design Test Data
68
67. What Skills Do
Data Professionals
Need for Data
Protection?
No one ever talks about this….
69
68. Big Data and Analytics
Level: Literacy and Hands On
Why: These new technologies and
techniques are making it mainstream
in most shops, whether they are
installed or software as a service.
Plus, we need to use them on our
own data
Who: All IT roles, especially data
stewarding ones.
70
69. Literacy with Deep Learning, AI, Machine Learning
Level: Literacy +++
• How are they used?
• What are the real life uses today?
• Future uses
• Privacy and Security requirements
• Compliance trade-offs
• Employee Monitoring
71
70. Data Quality & Reliability
Level: Active Skills
• Is the data right?
• Is it current?
• Should it be there at all?
• Do we Know where it came from?
• Do we know it was calculated correctly?
• Are there any know anomalies?
72
71. How can we do all this?
Cloud Services are a fantastic way
to learn and get hands on skills.
Online Tutorials are often free and
self guided
Learn from Experts & Case Studies
Deprioritize tasks that are really
just being done for tradition
Hire help
Automate away some tasks to
make more time 73