Privacy has become one of the most important critical topics in data today. It is more than how do we ingest and consume data but the important factors about how you protect your customer’s rights while balancing the business need. In our session, we will bring CTO, Privacera, Don Bosco Durai together with Northwestern Mutual to detail an important use case in privacy and then show how to scale Privacy with a focus on the business needs. We will make the ability to scale effortless.
Migrate and Modernize Hadoop-Based Security Policies for DatabricksDatabricks
Data teams are faced with a variety of tasks when migrating Hadoop-based platforms to Databricks. A common pitfall happens during the migration step where often overlooked access control policies can block adoption. This session will focus on the best practices to migrate and modernize Hadoop-based policies to govern data access (such as those in Apache Ranger or Apache Sentry). Data architects must consider new, fine-grained access control requirements when migrating from Hadoop architectures to Databricks in order to deliver secure access to as many data sets and data consumers as possible. This session will provide guidance across open source, AWS, Azure and partner tools, such as Immuta, on how to scale existing Hadoop-based policies to dynamically support more classes of users, implement fine-grained access control and leverage automation to protect sensitive data while maximizing utility — without manual effort
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Data Con LA
Learn how to benefit from IoT (internet of things) to reduce costs and spur transformation for your company and clients. Attendees will learn about building blocks to create an IoT solution, and walk through real life architectural decisions in building a solution.
Organizations are grappling to manually classify and create an inventory for distributed and heterogeneous data assets to deliver value. However, the new Azure service for enterprises – Azure Synapse Analytics is poised to help organizations and fill the gap between data warehouses and data lakes.
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
Presentation by James Baker and myself on Running cost effective big data workloads with Azure Synapse and Azure Datalake Storage (ADLS) at Microsoft Ignite 2020. Covers Modern Data warehouse architecture supported by Azure Synapse, integration benefits with ADLS and some features that reduce cost such as Query Acceleration, integration of Spark and SQL processing with integrated meta data and .NET For Apache Spark support.
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Cloud and Analytics - From Platforms to an EcosystemDatabricks
Zurich North America is one of the largest providers of insurance solutions and services in the world with customers representing a wide range of industries from agriculture to construction and more than 90 percent of the Fortune 500.
Migrate and Modernize Hadoop-Based Security Policies for DatabricksDatabricks
Data teams are faced with a variety of tasks when migrating Hadoop-based platforms to Databricks. A common pitfall happens during the migration step where often overlooked access control policies can block adoption. This session will focus on the best practices to migrate and modernize Hadoop-based policies to govern data access (such as those in Apache Ranger or Apache Sentry). Data architects must consider new, fine-grained access control requirements when migrating from Hadoop architectures to Databricks in order to deliver secure access to as many data sets and data consumers as possible. This session will provide guidance across open source, AWS, Azure and partner tools, such as Immuta, on how to scale existing Hadoop-based policies to dynamically support more classes of users, implement fine-grained access control and leverage automation to protect sensitive data while maximizing utility — without manual effort
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Data Con LA
Learn how to benefit from IoT (internet of things) to reduce costs and spur transformation for your company and clients. Attendees will learn about building blocks to create an IoT solution, and walk through real life architectural decisions in building a solution.
Organizations are grappling to manually classify and create an inventory for distributed and heterogeneous data assets to deliver value. However, the new Azure service for enterprises – Azure Synapse Analytics is poised to help organizations and fill the gap between data warehouses and data lakes.
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
Presentation by James Baker and myself on Running cost effective big data workloads with Azure Synapse and Azure Datalake Storage (ADLS) at Microsoft Ignite 2020. Covers Modern Data warehouse architecture supported by Azure Synapse, integration benefits with ADLS and some features that reduce cost such as Query Acceleration, integration of Spark and SQL processing with integrated meta data and .NET For Apache Spark support.
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Cloud and Analytics - From Platforms to an EcosystemDatabricks
Zurich North America is one of the largest providers of insurance solutions and services in the world with customers representing a wide range of industries from agriculture to construction and more than 90 percent of the Fortune 500.
Azure Synapse is Microsoft's new cloud analytics service offering that combines enterprise data warehouse and Big Data analytics capabilities. It offers a powerful and streamlined platform to facilitate the process of consolidating, storing, curating and analysing your data to generate reliable and actionable business insights.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Analytics-Enabled Experiences: The New Secret WeaponDatabricks
Tracking and analyzing how our individual products come together has always been an elusive problem for Steelcase. Our problem can be thought of in the following way: “we know how many Lego pieces we sell, yet we don’t know what Lego set our customers buy.” The Data Science team took over this initiative, which resulted in an evolution of our analytics journey. It is a story of innovation, resilience, agility and grit.
The effects of the COVID-19 pandemic on corporate America shined the spotlight on office furniture manufacturers to solve for ways on which the office can be made safe again. The team would have never imagined how relevant our work on product application analytics would become. Product application analytics became an industry priority overnight.
The proposal presented this year is the story of how data science is helping corporations bring people back to the office and set the path to lead the reinvention of the office space.
After groundbreaking milestones to overcome technical challenges, the most important question is: What do we do with this? How do we scale this? How do we turn this opportunity into a true competitive advantage? The response: stop thinking about this work as a data science project and start to think about this as an analytics-enabled experience.
During our session we will cover the technical elements that we overcame as a team to set-up a pipeline that ingests semi-structured and unstructured data at scale, performs analytics and produces digital experiences for multiple users.
This presentation will be particularly insightful for Data Scientists, Data Engineers and analytics leaders who are seeking to better understand how to augment the value of data for their organization
Using Redash for SQL Analytics on DatabricksDatabricks
This talk gives a brief overview with a demo performing SQL analytics with Redash and Databricks. We will introduce some of the new features coming as part of our integration with Databricks following the acquisition earlier this year, along with a demo of the other Redash features that enable a productive SQL experience on top of Delta Lake.
From Events to Networks: Time Series Analysis on ScaleDr. Mirko Kämpf
Event processing, time series aggregation and analysis, and finally analysis of structural patterns between those data snippets can all be done on Hadoop clusters on huge data volumes.
In order to find hidden relations and invisible structures one has to combine three disciplines using a variety of tools. Luckily, the Hadoop ecosystem offers many of such tools. In this session you can see practical examples and a demonstration of the "Hadoop-Oscilloscope". Generic analysis patterns and recommendations towards selection of appropriate algorithms will also provide additional background.
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
SQLBits 2020 presentation on how you can build solutions based on the modern data warehouse pattern with Azure Synapse Spark and SQL including demos of Azure Synapse.
Big data requires service that can orchestrate and operationalize processes to refine the enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
The Internet of Things is moving into the mainstream and this new world of data-driven products is transforming a vast number of industry sectors and technologies.
However, IoT creates a new challenge: how to build and operationalize continual data ingestion from such a wide and ever-changing array of endpoints so that the data arrives consumption-ready and can drive analysis and action within the business.
In this webinar, Sean Anderson from Cloudera and Kirit Busu, Director of Product Management at StreamSets, will discuss Hadoop's ecosystem and IoT capabilities and provide advice about common patterns and best practices. Using specific examples, they will demonstrate how to build and run end-to-end IOT data flows using StreamSets and Cloudera infrastructure.
Part 3 - Modern Data Warehouse with Azure SynapseNilesh Gule
Slide deck of the third part of building Modern Data Warehouse using Azure. This session covered Azure Synapse, formerly SQL Data Warehouse. We look at the Azure Synapse Architecture, external files, integration with Azuer Data Factory.
The recording of the session is available on YouTube
https://www.youtube.com/watch?v=LZlu6_rFzm8&WT.mc_id=DP-MVP-5003170
This presentation starts off by discussing powerful examples of The Power of Data and the benefits of Data Driven architectures. A Data Governance program is important for the success of Data Driven architectures. We then discuss the challenges of implementing a Data Governance framework on a Big Data Data Lake with open source software including DataPlane, Apache Atlas and Apache Ranger. And finally, we discuss the importance of the democratization of data and the switching to a speed of thought framework with Hive LLAP.
Privacera and Northwestern Mutual - Scaling Privacy in a Spark EcosystemPrivacera
Privacera and Customer Northwestern Mutual Present "How to Scale Privacy in a Spark Ecosystem" at Data + AI Summit 2021
Privacera customer, Aaron Colcord, Sr. Director of Data Engineering at Northwestern Mutual and Don Bosco Durai, CTO and co-founder of Privacera detail an important use case in privacy and demonstrate how the financial security leader scales privacy with a focus on the business needs. Because privacy has become one of the most important critical topics in data today, it is more than how to ingest and consume data, but how to protect customers' rights while balancing the business need.
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Cloudera, Inc.
To provide visibility and transparency into your data and usage, Cloudera Enterprise has Navigator, the only native end-to-end governance solution for Apache Hadoop. In this webinar we discuss why Navigator is a key part of comprehensive security and discuss its key features including: auditing, access control, data discovery and exploration, lineage, and lifecycle management. Live demo also included.
Azure Synapse is Microsoft's new cloud analytics service offering that combines enterprise data warehouse and Big Data analytics capabilities. It offers a powerful and streamlined platform to facilitate the process of consolidating, storing, curating and analysing your data to generate reliable and actionable business insights.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Analytics-Enabled Experiences: The New Secret WeaponDatabricks
Tracking and analyzing how our individual products come together has always been an elusive problem for Steelcase. Our problem can be thought of in the following way: “we know how many Lego pieces we sell, yet we don’t know what Lego set our customers buy.” The Data Science team took over this initiative, which resulted in an evolution of our analytics journey. It is a story of innovation, resilience, agility and grit.
The effects of the COVID-19 pandemic on corporate America shined the spotlight on office furniture manufacturers to solve for ways on which the office can be made safe again. The team would have never imagined how relevant our work on product application analytics would become. Product application analytics became an industry priority overnight.
The proposal presented this year is the story of how data science is helping corporations bring people back to the office and set the path to lead the reinvention of the office space.
After groundbreaking milestones to overcome technical challenges, the most important question is: What do we do with this? How do we scale this? How do we turn this opportunity into a true competitive advantage? The response: stop thinking about this work as a data science project and start to think about this as an analytics-enabled experience.
During our session we will cover the technical elements that we overcame as a team to set-up a pipeline that ingests semi-structured and unstructured data at scale, performs analytics and produces digital experiences for multiple users.
This presentation will be particularly insightful for Data Scientists, Data Engineers and analytics leaders who are seeking to better understand how to augment the value of data for their organization
Using Redash for SQL Analytics on DatabricksDatabricks
This talk gives a brief overview with a demo performing SQL analytics with Redash and Databricks. We will introduce some of the new features coming as part of our integration with Databricks following the acquisition earlier this year, along with a demo of the other Redash features that enable a productive SQL experience on top of Delta Lake.
From Events to Networks: Time Series Analysis on ScaleDr. Mirko Kämpf
Event processing, time series aggregation and analysis, and finally analysis of structural patterns between those data snippets can all be done on Hadoop clusters on huge data volumes.
In order to find hidden relations and invisible structures one has to combine three disciplines using a variety of tools. Luckily, the Hadoop ecosystem offers many of such tools. In this session you can see practical examples and a demonstration of the "Hadoop-Oscilloscope". Generic analysis patterns and recommendations towards selection of appropriate algorithms will also provide additional background.
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
SQLBits 2020 presentation on how you can build solutions based on the modern data warehouse pattern with Azure Synapse Spark and SQL including demos of Azure Synapse.
Big data requires service that can orchestrate and operationalize processes to refine the enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
The Internet of Things is moving into the mainstream and this new world of data-driven products is transforming a vast number of industry sectors and technologies.
However, IoT creates a new challenge: how to build and operationalize continual data ingestion from such a wide and ever-changing array of endpoints so that the data arrives consumption-ready and can drive analysis and action within the business.
In this webinar, Sean Anderson from Cloudera and Kirit Busu, Director of Product Management at StreamSets, will discuss Hadoop's ecosystem and IoT capabilities and provide advice about common patterns and best practices. Using specific examples, they will demonstrate how to build and run end-to-end IOT data flows using StreamSets and Cloudera infrastructure.
Part 3 - Modern Data Warehouse with Azure SynapseNilesh Gule
Slide deck of the third part of building Modern Data Warehouse using Azure. This session covered Azure Synapse, formerly SQL Data Warehouse. We look at the Azure Synapse Architecture, external files, integration with Azuer Data Factory.
The recording of the session is available on YouTube
https://www.youtube.com/watch?v=LZlu6_rFzm8&WT.mc_id=DP-MVP-5003170
This presentation starts off by discussing powerful examples of The Power of Data and the benefits of Data Driven architectures. A Data Governance program is important for the success of Data Driven architectures. We then discuss the challenges of implementing a Data Governance framework on a Big Data Data Lake with open source software including DataPlane, Apache Atlas and Apache Ranger. And finally, we discuss the importance of the democratization of data and the switching to a speed of thought framework with Hive LLAP.
Privacera and Northwestern Mutual - Scaling Privacy in a Spark EcosystemPrivacera
Privacera and Customer Northwestern Mutual Present "How to Scale Privacy in a Spark Ecosystem" at Data + AI Summit 2021
Privacera customer, Aaron Colcord, Sr. Director of Data Engineering at Northwestern Mutual and Don Bosco Durai, CTO and co-founder of Privacera detail an important use case in privacy and demonstrate how the financial security leader scales privacy with a focus on the business needs. Because privacy has become one of the most important critical topics in data today, it is more than how to ingest and consume data, but how to protect customers' rights while balancing the business need.
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Cloudera, Inc.
To provide visibility and transparency into your data and usage, Cloudera Enterprise has Navigator, the only native end-to-end governance solution for Apache Hadoop. In this webinar we discuss why Navigator is a key part of comprehensive security and discuss its key features including: auditing, access control, data discovery and exploration, lineage, and lifecycle management. Live demo also included.
Microsoft Cloud GDPR Compliance Options (SUGUK)Andy Talbot
Recently, Microsoft introduced Microsoft 365, which brings together Office 365, Windows 10, and Enterprise Mobility + Security. We’ll explore what this combination of products means for an organisation looking to ensure GDPR compliance and additional Office 365 products that you can layer to help you meet your obligations.
GDPR - Why it matters and how to make it EasyPaul McQuillan
Looking at the rationale for the new #GDPR Data Regulations, the principles behind the regulation, how this impacts #CRM, and how to make compliance easier.
Data Governance, Compliance and Security in Hadoop with ClouderaCaserta
In our recent Big Data Warehousing Meetup, we discussed Data Governance, Compliance and Security in Hadoop.
As the Big Data paradigm becomes more commonplace, we must apply enterprise-grade governance capabilities for critical data that is highly regulated and adhere to stringent compliance requirements. Caserta and Cloudera shared techniques and tools that enables data governance, compliance and security on Big Data.
For more information, visit www.casertaconcepts.com
CRMCS GDPR - Why it matters and how to make it EasyPaul McQuillan
CRM has focused on User Adoption and Business Alignment, however technology is rewriting the rules.
This brings new opportunities but also new responsibilities for conduct in the Data Economy – notably the introduction of GDPR.
Paul will illustrate why the ethos behind GDPR will sit at the heart of the new relationship we will have with the customer, and how to realise the opportunity in having a customer-centric approach to our business.
September 14, 2016 - Austin SharePoint User Group
What does governance mean in SharePoint? How do you get to good governance? Do you really need governance? What happens if you don’t have governance, or do it poorly?
Jim brings his experience building SharePoint governance in multiple organizations. The session covers governance basics to help get you going in the right direction.
(Unlike the "Group Therapy" session, this is a straight-up presentation, though the Q&A at the end can be used by the audience to ask their specific questions)
This slide deck is from the presentation On September 14, 2016 at the Austin O365 & SharePoint User Group
My presentation to the Oklahoma City SharePoint User Group, September 7, 2016
The basics of SharePoint Governance - what you need to consider when implementing governance, how to create a plan, and how to make governance work in the long term.
Webinar - Compliance with the Microsoft Cloud- 2017-04-19TechSoup
Everyone throws around the word compliance but how do you actually achieve that? In this free, 60-minute webinar Sam Chenkin from Tech Impact discusses achievable goals for the nonprofit community to keep their data safe with the Microsoft Cloud. We explore account security like two-factor authentication, data security like encryption, and how to make sure only compliant devices can access your data.
Cybersecurity and Data Protection Executive BriefingRFA
23rd February, Italian Chamber of Commerce and Industry, RFA and 3 Lines of Defence Consulting presented at an executive briefing on cybersecurity and data protection. Topics covered included cyber threats, principles of data processing and GDPR.
Joe Caserta's 2016 Data Summit Workshop "Introduction to Data Science with Hadoop" on May 9, expanded on his Intro to Data Science Workshop held at last year's Summit. Again, Joe presented to a standing-room only audience with a focus on the data lake, governance and the role of the data scientist.
For more information on Caserta Concepts, visit our website: http://casertaconcepts.com/
IDERA Live | Understanding SQL Server Compliance both in the Cloud and On Pre...IDERA Software
You can watch the replay for this IDERA Live webcast, Understanding SQL Server Compliance both in the Cloud and On Premises, on the IDERA Resource Center, http://ow.ly/tJ3V50A4rPD.
Every industry has its own regulatory compliance guidelines. On top of that, if you want to collect credit card information you must be PCI compliant. If you are trading on a Stock Exchange you must be SOX compliant. If you gather information on EU Members, you must be GDPR compliant. The list of regulations can be lengthy for an organization and some of those regulations may conflict with each other. With more companies moving to the cloud, it is even more important to review your compliance processes. With this session, we will explore the complex world of regulations and how that applies to how you collect and maintain your data.
Speaker: Kim Brushaber is the Senior Product Manager for SQL Compliance Manager at IDERA. Kim has over 20 years of experience as a Business Analyst, Software Developer, Product Manager and IT Executive. Kim enjoys working as the translator between the business and the technical teams in an organization.
The General Data Protection Regulation (GDPR) went into effect on May 25, 2018, and this has immediate implications for handling data in your big data, machine learning, and analytics environments. Traditional architectural approaches will need to be adjusted to be compliant with several of the provisions. The good news is that Cloudera can help you!
Webinar presented on Oct 21st (US) and Oct 23rd (EMEA), 2014 by Christian Buckley, Managing Director at GTconsult and Steve Marsh, Director of Product Marketing at Metalogix.
Similar to Scaling Privacy in a Spark Ecosystem (20)
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Business update Q1 2024 Lar España Real Estate SOCIMI
Scaling Privacy in a Spark Ecosystem
1. Scaling Privacy with
Apache Spark
Aaron Colcord
Sr. Director Engineering, Northwestern Mutual
Don Durai Bosco
CTO and Co-Founder, Privacera
2. Agenda
▪ Our background
▪ Why privacy, security,
compliance?
▪ Approaches
▪ Ideal problem solve
▪ Real life meets ideal life
3. Backgrounds
▪ Building an Enterprise Scale Unified
Framework
▪ Very Long, Respected History ~ 160 Years
▪ Compliance is extremely important to us
▪ Agile Data vs Compliant Data
▪ Founded in 2016 by the creators of Apache
Ranger & Apache Atlas
▪ Extends Ranger's capabilities beyond traditional
Big Data environments to cloud (Databricks,
AWS, Azure, GCP, and more)
▪ Specializes in democratizing data for analytics,
while ensuring compliance with privacy
regulations (GDPR, CCPA, LGPD, HIPAA, & more)
• Privacera
• Northwestern Mutual
4. Why do we suddenly care about privacy?
• You care if you are regulated in any form
• Simple you need to show you can pass an audit
• You care if you store any information about your users
• Simple because governments have woken up with GDPR and CCPA
• You care if you want to democratize your data
• Simple because the use of your data can be scrutinized
We always did, but technology got ahead of privacy. Privacy is often this assumed competency, and
technology really showed how important it was.
5. Have you ever...
• Collecting information about your customers can
• Improve the experience
• Allow the company to understand their business better
• At the core, privacy is a policy and legal obligation
• You have the data, it used to be your business to just secure it.
• Do you want your information monetized? Sold? Traded?
• Most companies don’t do this. But the privacy policy is there for you.
• Clicked ‘accept all’ on website, used a digital assistant..
Gone to a website and read their privacy policy, clicked accept cookies, accepted terms of service, or
EULA?
6. And it’s only going to pick up speed.
• More Regulations are arriving around privacy
• Increasing your ability to execute against data means respecting your user’s rights
• A part of maturity is being able to manage governance
7. More importantly, why do we care so much?
• Technology like Apache Spark opens the capability to
democratize your data.
• Most every company wants the marketplace to enrich
and share their data.
• Who inside that company can view it? Do we have the
controls to protect your information? Can we verify
that the information is used for the right purposes?
8. What is the difference between these?
▪ Preventing unauthorized
usage of systems
▪ Ensuring users don’t see the
incorrect information
▪ Creating boundaries to
enforce right action of the
system
• The process of making sure
your company and
employees follow all laws,
regulations, standards, and
ethical practices that apply
to your organization
• Compliance
• Security
• “Data privacy may be
defined as the authorized,
fair, and legitimate
processing of personal
information”
• Consent rights
• Do not share
• Slippery space
• Privacy
9. Examine strategies to scale agile data w/privacy
• Build a metadata layer that defines PII in its schema
• Users and developers can and will change where PII is stored
• You can literally chase people to do the ‘right thing’ forever
• You could build views with permissions to certain users
• Not very scalable
• Plus you need to always show who accessed and why
• Are these security scenario?
10. Challenges to that strategy
• Is the metadata layer flexible enough or should we think in policies?
• Privacy is inherently your organization’s position which may evolve based on regulation
• Can your development keep up with views?
• When you discover the extra 10,000 fields, can you keep up?
• Implement a framework that scales
• Security is not Privacy.
• Security has a different domain and set of principles.
• Remember we are protecting the usage of your data.
12. Ideal scalable system
▪ Revocation of
Consent
▪ Portability
▪ Erasure
▪ Rectification
▪ How is data used?
▪ Rights follow Data
Reuse
▪ Flexible to change
▪ Should align with a
Data Governance
program
▪ Should adapt to
changing data
▪ Proactive.
▪ Reclassification
• Classification
• User Rights
▪ How was it used?
▪ How was it
accessed?
▪ How was it
protected?
▪ Did it cross
borders?
• Audit/Governance
▪ Authorization of
User may change
▪ Supports Agile
Access
▪ Business Use is
preserved
▪ Automated
Systems obey
Privacy
• Access
13. User Rights at Scale
▪Revocation of Consent/ Right To Be Forgotten
▪Portability
▪Erasure
▪Rectification
▪How is data used?
▪Rights follow Data Reuse
▪Flexible to change
14. S3 ADLS Redshift Snowflake Synapse
Privacy Challenges in Open Data Ecosystem
Athena Databricks HDInsight
EMR
Dremio Trino PrestoDB
PowerBI Tableau
Storage
SQL Engines
Data Virtualization
BI Tools
Marketing
Data
Analyst
Data
Scientist/A
rchitect
17. Automated Data Discovery
● Automatically detect and catalog sensitive
data
● Detailed classification, e.g. EMAIL, SSN,
GENDER, CC, PHONE_NUMBER, etc.
● Eliminate manual processes
● Catalog data as it is ingested
● Track data movement and propagate tag
● Catalog data across multiple cloud
services
18. Centralized Access Control
● Global Tag/Classification-based policies
● Purpose and Persona based policies
● Dynamic row filters v/s Views
● Dynamic masking or decryption
● Approval workflows with time and
purpose constraints
19. Centralized Auditing and Reporting
● Centralize auditing
● Monitoring data access by classification
● Track usage by Purpose
● Generate attestation reports