Zurich North America is one of the largest providers of insurance solutions and services in the world with customers representing a wide range of industries from agriculture to construction and more than 90 percent of the Fortune 500.
Delta Lake brings reliability, performance, and security to data lakes. It provides ACID transactions, schema enforcement, and unified handling of batch and streaming data to make data lakes more reliable. Delta Lake also features lightning fast query performance through its optimized Delta Engine. It enables security and compliance at scale through access controls and versioning of data. Delta Lake further offers an open approach and avoids vendor lock-in by using open formats like Parquet that can integrate with various ecosystems.
Operationalizing Machine Learning at Scale at StarbucksDatabricks
As ML-driven innovations are propelled by the Self-Service capabilities in the Enterprise Data and Analytics Platform, teams face a significant entry barrier and productivity issues in moving from POCs to Operating ML-powered apps at scale in production.
What’s New with Databricks Machine LearningDatabricks
In this session, the Databricks product team provides a deeper dive into the machine learning announcements. Join us for a detailed demo that gives you insights into the latest innovations that simplify the ML lifecycle — from preparing data, discovering features, and training and managing models in production.
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Improving Power Grid Reliability Using IoT AnalyticsDatabricks
Society depends on reliable utility services to ensure the health and safety of our communities. Electrical grid failures have impact and consequences that can range from daily inconveniences to catastrophic events. Ensuring grid reliability means that data is fully leveraged to understand and forecast demand, predict and mitigate unplanned interruptions to power supply, and efficiently restore power when needed. Neudesic, a Systems Integrator, and DTE Energy, a large electric and natural gas utility serving 2.2 million customers in southeast Michigan, partnered to use large IoT datasets to identify the sources and causes of reliability issues across DTE’s power distribution network. In this session, we will demonstrate how we ingest hundreds of millions of quality measures each day from DTE’s network of smart electric meters. This data is then further processed in Databricks to detect anomalies, apply graph analytics and spatially cluster these anomalies into “hot spots”. Engineers and Work Management Experts use a dashboard to explore, plan and prioritize diverse actions to remediate the hot spots. This allows DTE to prioritize work orders and dispatch crews based on impact to grid reliability. Because of this and other efforts, DTE has improved reliability by 25% year over year. We will demonstrate our notebooks and machine learning models along with our dashboard. We will also discuss Spark streaming, Pandas UDF’s, anomaly detection and DBSCAN clustering. By the end of our presentation, you should understand our approach to infer hidden insights from our IoT data, and potentially apply similar techniques to your own data.
Learn to Use Databricks for the Full ML LifecycleDatabricks
Machine learning development brings many new complexities beyond the traditional software development lifecycle. Unlike traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. In this talk, learn how to operationalize ML across the full lifecycle with Databricks Machine Learning.
How R Developers Can Build and Share Data and AI Applications that Scale with...Databricks
This document discusses how R developers can build and share scalable data and AI applications using RStudio and Databricks. It outlines how RStudio and Databricks can be used together to overcome challenges of processing large amounts of data in R, including limited server memory and performance issues. Developers can use hosted RStudio servers on Databricks clusters, connect to Spark from RStudio using Databricks Connect, and share scalable Shiny apps deployed with RStudio Connect. The ODBC toolchain provides a performant way to connect R to Spark without issues encountered when using sparklyr directly.
Migrate and Modernize Hadoop-Based Security Policies for DatabricksDatabricks
Data teams are faced with a variety of tasks when migrating Hadoop-based platforms to Databricks. A common pitfall happens during the migration step where often overlooked access control policies can block adoption. This session will focus on the best practices to migrate and modernize Hadoop-based policies to govern data access (such as those in Apache Ranger or Apache Sentry). Data architects must consider new, fine-grained access control requirements when migrating from Hadoop architectures to Databricks in order to deliver secure access to as many data sets and data consumers as possible. This session will provide guidance across open source, AWS, Azure and partner tools, such as Immuta, on how to scale existing Hadoop-based policies to dynamically support more classes of users, implement fine-grained access control and leverage automation to protect sensitive data while maximizing utility — without manual effort
Delta Lake brings reliability, performance, and security to data lakes. It provides ACID transactions, schema enforcement, and unified handling of batch and streaming data to make data lakes more reliable. Delta Lake also features lightning fast query performance through its optimized Delta Engine. It enables security and compliance at scale through access controls and versioning of data. Delta Lake further offers an open approach and avoids vendor lock-in by using open formats like Parquet that can integrate with various ecosystems.
Operationalizing Machine Learning at Scale at StarbucksDatabricks
As ML-driven innovations are propelled by the Self-Service capabilities in the Enterprise Data and Analytics Platform, teams face a significant entry barrier and productivity issues in moving from POCs to Operating ML-powered apps at scale in production.
What’s New with Databricks Machine LearningDatabricks
In this session, the Databricks product team provides a deeper dive into the machine learning announcements. Join us for a detailed demo that gives you insights into the latest innovations that simplify the ML lifecycle — from preparing data, discovering features, and training and managing models in production.
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Improving Power Grid Reliability Using IoT AnalyticsDatabricks
Society depends on reliable utility services to ensure the health and safety of our communities. Electrical grid failures have impact and consequences that can range from daily inconveniences to catastrophic events. Ensuring grid reliability means that data is fully leveraged to understand and forecast demand, predict and mitigate unplanned interruptions to power supply, and efficiently restore power when needed. Neudesic, a Systems Integrator, and DTE Energy, a large electric and natural gas utility serving 2.2 million customers in southeast Michigan, partnered to use large IoT datasets to identify the sources and causes of reliability issues across DTE’s power distribution network. In this session, we will demonstrate how we ingest hundreds of millions of quality measures each day from DTE’s network of smart electric meters. This data is then further processed in Databricks to detect anomalies, apply graph analytics and spatially cluster these anomalies into “hot spots”. Engineers and Work Management Experts use a dashboard to explore, plan and prioritize diverse actions to remediate the hot spots. This allows DTE to prioritize work orders and dispatch crews based on impact to grid reliability. Because of this and other efforts, DTE has improved reliability by 25% year over year. We will demonstrate our notebooks and machine learning models along with our dashboard. We will also discuss Spark streaming, Pandas UDF’s, anomaly detection and DBSCAN clustering. By the end of our presentation, you should understand our approach to infer hidden insights from our IoT data, and potentially apply similar techniques to your own data.
Learn to Use Databricks for the Full ML LifecycleDatabricks
Machine learning development brings many new complexities beyond the traditional software development lifecycle. Unlike traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. In this talk, learn how to operationalize ML across the full lifecycle with Databricks Machine Learning.
How R Developers Can Build and Share Data and AI Applications that Scale with...Databricks
This document discusses how R developers can build and share scalable data and AI applications using RStudio and Databricks. It outlines how RStudio and Databricks can be used together to overcome challenges of processing large amounts of data in R, including limited server memory and performance issues. Developers can use hosted RStudio servers on Databricks clusters, connect to Spark from RStudio using Databricks Connect, and share scalable Shiny apps deployed with RStudio Connect. The ODBC toolchain provides a performant way to connect R to Spark without issues encountered when using sparklyr directly.
Migrate and Modernize Hadoop-Based Security Policies for DatabricksDatabricks
Data teams are faced with a variety of tasks when migrating Hadoop-based platforms to Databricks. A common pitfall happens during the migration step where often overlooked access control policies can block adoption. This session will focus on the best practices to migrate and modernize Hadoop-based policies to govern data access (such as those in Apache Ranger or Apache Sentry). Data architects must consider new, fine-grained access control requirements when migrating from Hadoop architectures to Databricks in order to deliver secure access to as many data sets and data consumers as possible. This session will provide guidance across open source, AWS, Azure and partner tools, such as Immuta, on how to scale existing Hadoop-based policies to dynamically support more classes of users, implement fine-grained access control and leverage automation to protect sensitive data while maximizing utility — without manual effort
Using Redash for SQL Analytics on DatabricksDatabricks
This talk gives a brief overview with a demo performing SQL analytics with Redash and Databricks. We will introduce some of the new features coming as part of our integration with Databricks following the acquisition earlier this year, along with a demo of the other Redash features that enable a productive SQL experience on top of Delta Lake.
Analytics-Enabled Experiences: The New Secret WeaponDatabricks
Tracking and analyzing how our individual products come together has always been an elusive problem for Steelcase. Our problem can be thought of in the following way: “we know how many Lego pieces we sell, yet we don’t know what Lego set our customers buy.” The Data Science team took over this initiative, which resulted in an evolution of our analytics journey. It is a story of innovation, resilience, agility and grit.
The effects of the COVID-19 pandemic on corporate America shined the spotlight on office furniture manufacturers to solve for ways on which the office can be made safe again. The team would have never imagined how relevant our work on product application analytics would become. Product application analytics became an industry priority overnight.
The proposal presented this year is the story of how data science is helping corporations bring people back to the office and set the path to lead the reinvention of the office space.
After groundbreaking milestones to overcome technical challenges, the most important question is: What do we do with this? How do we scale this? How do we turn this opportunity into a true competitive advantage? The response: stop thinking about this work as a data science project and start to think about this as an analytics-enabled experience.
During our session we will cover the technical elements that we overcame as a team to set-up a pipeline that ingests semi-structured and unstructured data at scale, performs analytics and produces digital experiences for multiple users.
This presentation will be particularly insightful for Data Scientists, Data Engineers and analytics leaders who are seeking to better understand how to augment the value of data for their organization
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Privacy has become one of the most important critical topics in data today. It is more than how do we ingest and consume data but the important factors about how you protect your customer’s rights while balancing the business need. In our session, we will bring CTO, Privacera, Don Bosco Durai together with Northwestern Mutual to detail an important use case in privacy and then show how to scale Privacy with a focus on the business needs. We will make the ability to scale effortless.
Comcast is the largest cable and internet provider in the US, reaching more than 30 million customers, and continues to grow its presence in the EU with the acquisition of Sky. Over the last couple years, Comcast has shifted focus to the customer experience.
Eugene Polonichko "Architecture of modern data warehouse"Lviv Startup Club
The document discusses the architecture of a modern data warehouse using Microsoft technologies. It describes traditional data warehousing approaches and outlines ten characteristics of a modern data warehouse. It then details Microsoft's approach using Azure Data Factory to ingest diverse data types into Azure Blob Storage, Azure Databricks for analytics and data transformation, and Azure SQL Data Warehouse for combined structured data. It also discusses technologies for storage, visualization, and links for further information.
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Databricks
Getting machine learning models to production is notoriously difficult: it involves multiple teams (data scientists, data and machine learning engineers, operations, …), who often does not speak to each other very well; the model can be trained in one environment but then productionalized in completely different environment; it is not just about the code, but also about the data (features) and the model itself… At DataSentics, as a machine learning and cloud engineering studio, we see this struggle firsthand – on our internal projects and client’s projects as well.
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
The document discusses data architecture solutions for solving real-time, high-volume data problems with low latency response times. It recommends a data platform capable of capturing, ingesting, streaming, and optionally storing data for batch analytics. The solution should provide fast data ingestion, real-time analytics, fast action, and quick time to value. Multiple data sources like logs, social media, and internal systems would be ingested using Apache Flume and Kafka and analyzed with Spark/Storm streaming. The processed data would be stored in HDFS, Cassandra, S3, or Hive. Kafka, Spark, and Cassandra are identified as key technologies for real-time data pipelines, stream analytics, and high availability persistent storage.
Introducing MLflow for End-to-End Machine Learning on DatabricksDatabricks
Solving a data science problem is about more than making a model. It entails data cleaning, exploration, modeling and tuning, production deployment, and workflows governing each of these steps. In this simple example, we’ll take a look at how health data can be used to predict life expectancy. It will start with data engineering in Apache Spark, data exploration, model tuning and autologging with hyperopt and MLflow. It will continue with examples of how the model registry governs model promotion, and simple deployment to production with MLflow as a job or REST endpoint. This tutorial will cover the latest innovations from MLflow 1.12.
Big data ingest frameworks ship with an array of connectors for common data origins and destinations, such as flat files, S3, HDFS, Kafka etc, but sometimes, you need to send data to, or receive data from a system that's not on the list. StreamSets includes template code for building your own connectors and processors; we'll walk through the process of building a simple destination that sends data to a REST web service, and show how it can be extended to target more sophisticated systems such as Salesforce Wave Analytics.
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
Machine learning in the enterprise is rarely delivered by a single team. In order to enable Machine Learning across an organisation you need to target a variety of different skills, processes, technologies, and maturities. To do this is incredibly hard and requires a composite of different techniques to deliver a single platform which empowers all users to build and deploy machine learning models.
In this session we discuss how Azure & Databricks enables a Data Science as a Service platform. We look at how a DSaaS platform is empowering users of all abilities to build models, deploy models and enabling organisations to realise and return on investment earlier.
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Databricks
In rapidly changing conditions, many companies build ETL pipelines using ad-hoc strategy. Such an approach makes automated testing for data reliability almost impossible and leads to ineffective and time-consuming manual ETL monitoring.
Automating Data Quality Processes at ReckittDatabricks
Reckitt is a fast-moving consumer goods company with a portfolio of famous brands and over 30k employees worldwide. With that scale small projects can quickly grow into big datasets, and processing and cleaning all that data can become a challenge. To solve that challenge we have created a metadata driven ETL framework for orchestrating data transformations through parametrised SQL scripts. It allows us to create various paths for our data as well as easily version control them. The approach of standardising incoming datasets and creating reusable SQL processes has proven to be a winning formula. It has helped simplify complicated landing/stage/merge processes and allowed them to be self-documenting.
But this is only half the battle, we also want to create data products. Documented, quality assured data sets that are intuitive to use. As we move to a CI/CD approach, increasing the frequency of deployments, the demand of keeping documentation and data quality assessments up to date becomes increasingly challenging. To solve this problem, we have expanded our ETL framework to include SQL processes that automate data quality activities. Using the Hive metastore as a starting point, we have leveraged this framework to automate the maintenance of a data dictionary and reduce documenting, model refinement, testing data quality and filtering out bad data to a box filling exercise. In this talk we discuss our approach to maintaining high quality data products and share examples of how we automate data quality processes.
Leveraging Apache Spark to Develop AI-Enabled Products and Services at BoschDatabricks
The Bosch Center for Artificial Intelligence provides AI services to Bosch’s business units and manufacturing plants. We strive to generate value for our customers by deploying machine learning in their products, services, and processes across different domains such Manufacturing, Engineering, Supply Chain Management, as well as Intelligent Services.
Translating Models to Medicine an Example of Managing Visual CommunicationsDatabricks
The Neuron team at Seattle Children's Hospital aims to improve pediatric critical care through predictive models and decision support tools. They developed a framework to standardize model development and management, reduce knowledge silos, and enable broad participation. This involves tracking models, artifacts, and domain knowledge. They demonstrate managing visual communications by capturing specifications for ICU bed maps in MLFlow and rendering versions with different encodings. This allows discoverable and deployable visualization models.
Data Quality in the Data Hub with RedPointGlobalCaserta
At a Big Data Warehousing Meetup, George Corugedo, CTO of RedPoint Global demonstrated how to use your big data platform for data integration, data quality and identity resolution to provide a true 360 degree view of your customer on Hadoop using the RedPoint product.
For more information or questions, please contact us at www.casertaconcepts.com.
It is a fascinating, explosive time for enterprise analytics.
It is from the position of analytics leadership that the mission will be executed and company leadership will emerge. The data professional is absolutely sitting on the performance of the company in this information economy and has an obligation to demonstrate the possibilities and originate the architecture, data, and projects that will deliver analytics. After all, no matter what business you’re in, you’re in the business of analytics.
The coming years will be full of big changes in enterprise analytics and Data Architecture. William will kick off the fourth year of the Advanced Analytics series with a discussion of the trends winning organizations should build into their plans, expectations, vision, and awareness now.
The document discusses optimizing a data warehouse by offloading some workloads and data to Hadoop. It identifies common challenges with data warehouses like slow transformations and queries. Hadoop can help by handling large-scale data processing, analytics, and long-term storage more cost effectively. The document provides examples of how customers benefited from offloading workloads to Hadoop. It then outlines a process for assessing an organization's data warehouse ecosystem, prioritizing workloads for migration, and developing an optimization plan.
Using Redash for SQL Analytics on DatabricksDatabricks
This talk gives a brief overview with a demo performing SQL analytics with Redash and Databricks. We will introduce some of the new features coming as part of our integration with Databricks following the acquisition earlier this year, along with a demo of the other Redash features that enable a productive SQL experience on top of Delta Lake.
Analytics-Enabled Experiences: The New Secret WeaponDatabricks
Tracking and analyzing how our individual products come together has always been an elusive problem for Steelcase. Our problem can be thought of in the following way: “we know how many Lego pieces we sell, yet we don’t know what Lego set our customers buy.” The Data Science team took over this initiative, which resulted in an evolution of our analytics journey. It is a story of innovation, resilience, agility and grit.
The effects of the COVID-19 pandemic on corporate America shined the spotlight on office furniture manufacturers to solve for ways on which the office can be made safe again. The team would have never imagined how relevant our work on product application analytics would become. Product application analytics became an industry priority overnight.
The proposal presented this year is the story of how data science is helping corporations bring people back to the office and set the path to lead the reinvention of the office space.
After groundbreaking milestones to overcome technical challenges, the most important question is: What do we do with this? How do we scale this? How do we turn this opportunity into a true competitive advantage? The response: stop thinking about this work as a data science project and start to think about this as an analytics-enabled experience.
During our session we will cover the technical elements that we overcame as a team to set-up a pipeline that ingests semi-structured and unstructured data at scale, performs analytics and produces digital experiences for multiple users.
This presentation will be particularly insightful for Data Scientists, Data Engineers and analytics leaders who are seeking to better understand how to augment the value of data for their organization
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Privacy has become one of the most important critical topics in data today. It is more than how do we ingest and consume data but the important factors about how you protect your customer’s rights while balancing the business need. In our session, we will bring CTO, Privacera, Don Bosco Durai together with Northwestern Mutual to detail an important use case in privacy and then show how to scale Privacy with a focus on the business needs. We will make the ability to scale effortless.
Comcast is the largest cable and internet provider in the US, reaching more than 30 million customers, and continues to grow its presence in the EU with the acquisition of Sky. Over the last couple years, Comcast has shifted focus to the customer experience.
Eugene Polonichko "Architecture of modern data warehouse"Lviv Startup Club
The document discusses the architecture of a modern data warehouse using Microsoft technologies. It describes traditional data warehousing approaches and outlines ten characteristics of a modern data warehouse. It then details Microsoft's approach using Azure Data Factory to ingest diverse data types into Azure Blob Storage, Azure Databricks for analytics and data transformation, and Azure SQL Data Warehouse for combined structured data. It also discusses technologies for storage, visualization, and links for further information.
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Databricks
Getting machine learning models to production is notoriously difficult: it involves multiple teams (data scientists, data and machine learning engineers, operations, …), who often does not speak to each other very well; the model can be trained in one environment but then productionalized in completely different environment; it is not just about the code, but also about the data (features) and the model itself… At DataSentics, as a machine learning and cloud engineering studio, we see this struggle firsthand – on our internal projects and client’s projects as well.
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
The document discusses data architecture solutions for solving real-time, high-volume data problems with low latency response times. It recommends a data platform capable of capturing, ingesting, streaming, and optionally storing data for batch analytics. The solution should provide fast data ingestion, real-time analytics, fast action, and quick time to value. Multiple data sources like logs, social media, and internal systems would be ingested using Apache Flume and Kafka and analyzed with Spark/Storm streaming. The processed data would be stored in HDFS, Cassandra, S3, or Hive. Kafka, Spark, and Cassandra are identified as key technologies for real-time data pipelines, stream analytics, and high availability persistent storage.
Introducing MLflow for End-to-End Machine Learning on DatabricksDatabricks
Solving a data science problem is about more than making a model. It entails data cleaning, exploration, modeling and tuning, production deployment, and workflows governing each of these steps. In this simple example, we’ll take a look at how health data can be used to predict life expectancy. It will start with data engineering in Apache Spark, data exploration, model tuning and autologging with hyperopt and MLflow. It will continue with examples of how the model registry governs model promotion, and simple deployment to production with MLflow as a job or REST endpoint. This tutorial will cover the latest innovations from MLflow 1.12.
Big data ingest frameworks ship with an array of connectors for common data origins and destinations, such as flat files, S3, HDFS, Kafka etc, but sometimes, you need to send data to, or receive data from a system that's not on the list. StreamSets includes template code for building your own connectors and processors; we'll walk through the process of building a simple destination that sends data to a REST web service, and show how it can be extended to target more sophisticated systems such as Salesforce Wave Analytics.
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
Machine learning in the enterprise is rarely delivered by a single team. In order to enable Machine Learning across an organisation you need to target a variety of different skills, processes, technologies, and maturities. To do this is incredibly hard and requires a composite of different techniques to deliver a single platform which empowers all users to build and deploy machine learning models.
In this session we discuss how Azure & Databricks enables a Data Science as a Service platform. We look at how a DSaaS platform is empowering users of all abilities to build models, deploy models and enabling organisations to realise and return on investment earlier.
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Databricks
In rapidly changing conditions, many companies build ETL pipelines using ad-hoc strategy. Such an approach makes automated testing for data reliability almost impossible and leads to ineffective and time-consuming manual ETL monitoring.
Automating Data Quality Processes at ReckittDatabricks
Reckitt is a fast-moving consumer goods company with a portfolio of famous brands and over 30k employees worldwide. With that scale small projects can quickly grow into big datasets, and processing and cleaning all that data can become a challenge. To solve that challenge we have created a metadata driven ETL framework for orchestrating data transformations through parametrised SQL scripts. It allows us to create various paths for our data as well as easily version control them. The approach of standardising incoming datasets and creating reusable SQL processes has proven to be a winning formula. It has helped simplify complicated landing/stage/merge processes and allowed them to be self-documenting.
But this is only half the battle, we also want to create data products. Documented, quality assured data sets that are intuitive to use. As we move to a CI/CD approach, increasing the frequency of deployments, the demand of keeping documentation and data quality assessments up to date becomes increasingly challenging. To solve this problem, we have expanded our ETL framework to include SQL processes that automate data quality activities. Using the Hive metastore as a starting point, we have leveraged this framework to automate the maintenance of a data dictionary and reduce documenting, model refinement, testing data quality and filtering out bad data to a box filling exercise. In this talk we discuss our approach to maintaining high quality data products and share examples of how we automate data quality processes.
Leveraging Apache Spark to Develop AI-Enabled Products and Services at BoschDatabricks
The Bosch Center for Artificial Intelligence provides AI services to Bosch’s business units and manufacturing plants. We strive to generate value for our customers by deploying machine learning in their products, services, and processes across different domains such Manufacturing, Engineering, Supply Chain Management, as well as Intelligent Services.
Translating Models to Medicine an Example of Managing Visual CommunicationsDatabricks
The Neuron team at Seattle Children's Hospital aims to improve pediatric critical care through predictive models and decision support tools. They developed a framework to standardize model development and management, reduce knowledge silos, and enable broad participation. This involves tracking models, artifacts, and domain knowledge. They demonstrate managing visual communications by capturing specifications for ICU bed maps in MLFlow and rendering versions with different encodings. This allows discoverable and deployable visualization models.
Data Quality in the Data Hub with RedPointGlobalCaserta
At a Big Data Warehousing Meetup, George Corugedo, CTO of RedPoint Global demonstrated how to use your big data platform for data integration, data quality and identity resolution to provide a true 360 degree view of your customer on Hadoop using the RedPoint product.
For more information or questions, please contact us at www.casertaconcepts.com.
It is a fascinating, explosive time for enterprise analytics.
It is from the position of analytics leadership that the mission will be executed and company leadership will emerge. The data professional is absolutely sitting on the performance of the company in this information economy and has an obligation to demonstrate the possibilities and originate the architecture, data, and projects that will deliver analytics. After all, no matter what business you’re in, you’re in the business of analytics.
The coming years will be full of big changes in enterprise analytics and Data Architecture. William will kick off the fourth year of the Advanced Analytics series with a discussion of the trends winning organizations should build into their plans, expectations, vision, and awareness now.
The document discusses optimizing a data warehouse by offloading some workloads and data to Hadoop. It identifies common challenges with data warehouses like slow transformations and queries. Hadoop can help by handling large-scale data processing, analytics, and long-term storage more cost effectively. The document provides examples of how customers benefited from offloading workloads to Hadoop. It then outlines a process for assessing an organization's data warehouse ecosystem, prioritizing workloads for migration, and developing an optimization plan.
This document discusses DataOps, which is an agile methodology for developing and deploying data-intensive applications. DataOps supports cross-functional collaboration and fast time to value. It expands on DevOps practices to include data-related roles like data engineers and data scientists. The key goals of DataOps are to promote continuous model deployment, repeatability, productivity, agility, self-service, and to make data central to applications. It discusses how DataOps brings flexibility and focus to data-driven organizations through principles like continuous model deployment, improved efficiency, and faster time to value.
This document discusses Oracle's value proposition for its big data solutions. Key points include:
- Oracle offers engineered systems that integrate hardware and software to securely manage both new and existing data types and formats for big data.
- The solutions allow customers to acquire, organize, analyze and make decisions from big data to develop predictive analytics and gain competitive advantages.
- Oracle partners with other companies through its Oracle Partner Network to increase sales and empower partners with Oracle resources and specializations.
- Oracle solutions serve many customer segments including telecommunications, energy, life sciences, healthcare, oil and gas, manufacturing, and retail.
Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...Amazon Web Services
At the end of this session participants will learn how to assess their enterprise application portfolio and move thousands of instances to AWS in a quick and repeatable fashion. Migrating workloads to AWS in an enterprise environment is not easy, but with the right approach, an enterprise sized organization can migrate thousands of instances to AWS quickly and cost effectively to ensure a strong ROI.
Weet u nog waar uw bedrijfsdata zich bevindt? Uw data bevindt zich (straks) overal. In samenwerking met Commvault laten we zien, hoe uw organisatie ‘in control’ kan blijven over én meerwaarde kan geven aan uw data ongeacht of deze zich on-premise, in de cloud of op een end-user device bevindt.
Presentatie 9 juni 2016
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with HadoopPrecisely
With so many new, evolving frameworks, tools, and languages, a new big data project can lead to confusion and unwarranted risk.
Many organizations have found Data Warehouse Optimization with Hadoop to be a good starting point on their Big Data journey. Offloading ETL workloads from the enterprise data warehouse (EDW) into Hadoop is a well-defined use case that produces tangible results for driving more insights while lowering costs. You gain significant business agility, avoid costly EDW upgrades, and free up EDW capacity for faster queries. This quick win builds credibility and generates savings to reinvest in more Big Data projects.
A proven reference architecture that includes everything you need in a turnkey solution – the Hadoop distribution, data integration software, servers, networking and services – makes it even easier to get started.
SphereEx provides enterprises with distributed data service infrastructures and products/solutions to address challenges from increasing database fragmentation. It was founded in 2021 by the team behind Apache ShardingSphere, an open-source project providing data sharding and distributed solutions. SphereEx's products include solutions for distributed databases, data security, online stress testing, and its commercial version provides enhanced capabilities over the open-source version.
CSC - Presentation at Hortonworks Booth - Strata 2014Hortonworks
Come hear about how companies are kick-starting their big data projects without having to find good people, hire them, and get IT to prioritize it to get your project off the ground. Remove risk from your project, ensure scalability , and pay for just the nodes you use in a monthly utility pricing model. Worried about Data Governance, Security, want it in the cloud, can’t have it in the cloud….eliminate the hurdles with a fully managed service backed by CSC. Get your modern data architecture up and running in as little as 30 days with the Big Data Platform As A Service offering from CSC. Computer Science Corporation is a Certified Technology Partner of Hortonworks and is a Global System Integrator with over 80,000 employees globally.
Complement Your Existing Data Warehouse with Big Data & HadoopDatameer
To view the full webinar, please go to: http://info.datameer.com/Slideshare-Complement-Your-Existing-EDW-with-Hadoop-OnDemand.html
With 40% yearly growth in data volumes, traditional data warehouses have become increasingly expensive and challenging.
Much of today’s new data sources are unstructured, making the structured data warehouse an unsuitable platform for analyses. As a result, organizations now look at Hadoop as a data platform to complement existing BI data warehouses, and a scalable, flexible and cost-effective solution for data storage and analysis.
Join Datameer and Cloudera in this webinar to discuss how Hadoop and big data analytics can help to:
-Get all the data your business needs quickly into one environment
Shorten the time to insight from months to days
Extend the life of your existing data warehouse investments
Enable your business analysts to ask and answer bigger questions
Gab Genai Cloudera - Going Beyond Traditional Analytic IntelAPAC
This document discusses Intel and Cloudera's partnership in helping organizations leverage big data analytics. It provides an overview of Cloudera's history and capabilities in supporting enterprises with Hadoop-based solutions. It then contrasts traditional analytics approaches that brought data to compute with Cloudera's approach of bringing compute to data using their Enterprise Data Hub. Several case studies are presented of organizations achieving new insights and business value through Cloudera's platform. The document emphasizes that Cloudera offers an open, scalable and cost-effective platform for various analytics workloads and enables a thriving ecosystem of partners.
(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...Amazon Web Services
Migrating workloads to AWS in an enterprise environment is not easy, but with the right approach, an enterprise-sized organization can migrate thousands of instances to AWS quickly and cost effectively. You can leave this session with a good understanding of the migration framework used to assess an enterprise application portfolio and how to move thousands of instances to AWS in a quick and repeatable fashion.
In this session, we describe the components of Accenture's cloud migration framework, including tools and capabilities provided by Accenture, AWS, and third-party software solutions, and how enterprises can leverage these techniques to migrate efficiently and effectively. The migration framework covers:
- Defining an overall cloud strategy
- Assessing the business requirements, including application and data requirements
- Creating the right AWS architecture and environment
- Moving applications and data using automated migration tools- Services to manage the migrated environment
Many large enterprises have begun using AWS to host development and test environments while also building greenfield applications in AWS. After realizing the benefits that AWS has to offer, many Enterprise look for ways to accelerate their migration to the cloud. In beginning this journey they are often faced with a number of challenges such as determining which applications should move, how they should move, and how can they be effectively managed in the cloud. Accenture, working with AWS Solution Architects, and AWS Professional Services have developed a framework, based on our experiences, to quickly, efficiently, and successfully move enterprise applications to AWS at scale. This session will review our approach, tools, and methods that can help Enterprises evolve their cloud transformation programs.
Software engineering practices for the data science and machine learning life...DataWorks Summit
With the advent of newer frameworks and toolkits, data scientists are now more productive than ever and starting to prove indispensable to enterprises. Typical organizations have large teams of data scientists who build out key analytics assets that are used on a daily basis and an integral part of live transactions. However, there is also quite a lot of chaos and complexities that get introduced because of the state of the industry. Many packages used by data scientists are from open source, and even if they are well curated, there is a growing tendency to pick out the cutting-edge or unstable packages and frameworks to accelerate analytics. Different data scientists may use different versions of runtimes, different Python or R versions, or even different versions of the same packages. Predominantly data scientists work on their laptops and it becomes difficult to reproduce their environments for use by others. Since data science is now a team sport across multiple personas, involving non-practitioners, traditional application developers, execs, and IT operators, how does an enterprise create a platform for productive cross-role collaboration?
Enterprises need a very reliable and repeatable process, especially when it results in something that affects their production environments. They also require a well managed approach that enables the graduation of an asset from development through a testing and staging process to production. Given the pace of businesses nowadays, the process needs to be quite agile and flexible too—even enabling an easy path to reversing a change. Compliance and audit processes require clear lineage and history as well as approval chains.
In the traditional software engineering world, this lifecycle has been well understood and best practices have been followed for ages. But what does it mean when you have non-programmers or users who are not really trained in software engineering philosophies or who perceive all of this as "big process" roadblocks in their daily work ? How do you we engage them in a productive manner and yet support enterprise requirements for reliability, tracking, and a clear continuous integration and delivery practice? The presenters, in this session, will bring up interesting techniques based on their user research, real life customer interviews, and productized best practices. The presenters also invite the audience to share their stories and best practices to make this a lively conversation.
Speaker
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
The Census Bureau is the U.S. government's largest statistical agency with a mission to provide current facts and figures about America's people, places and economy. The Bureau operates a large number of surveys to collect this data, the most well known being the decennial population census. Data is being collected in increasing volumes and the analytics solutions must be able to scale to meet the ever increasing needs while maintaining the confidentiality of the data. Past data analytics have occurred in processing silos inhibiting the sharing of information and common reference data is replicated across multiple system. The use of the Hortonworks Data Platform, Hortonworks Data Flow and other open-source technologies is enabling the creation of a cloud-based enterprise data lake and analytics platform. Cloud object stores are used to provide scalable data storage and cloud compute supports permanent and transient clusters. Data governance tools are used to track the data lineage and to provide access controls to sensitive data.
MSFT MAIW Data Mod - Session 1 Deck_Why Migrate your databases to Azure_Sept ...ssuser01a66e
Microsoft Azure Immersion Workshop focused on data modernization and migrating databases to Azure. Key reasons for migrating included enabling remote work during the pandemic, improving business resiliency, and adopting emerging technologies. Digital transformation is affecting all companies, which now need to operate like digital companies. When migrating databases to Azure, customers can choose between infrastructure as a service (IaaS) options like SQL Server VMs or platform as a service (PaaS) options like Azure SQL that are fully managed by Microsoft. Migrating databases to Azure PaaS options can significantly reduce costs compared to on-premises databases and provide benefits like automatic updates and built-in security and high availability.
High-Performance Analytics in the Cloud with Apache ImpalaCloudera, Inc.
With more and more data being generated and stored in the cloud, you need a modern data platform that can extend to any environment so you can derive value from all your data. Cloudera Enterprise is the leading enterprise Hadoop platform for cloud deployments. It’s the easiest way to manage and secure Hadoop data across any cloud environment and includes component-level support for cloud-native object stores. This makes the platform uniquely suited to handle transient jobs like ETL and BI analytics, as well as persistent workloads like stream processing and advanced analytics.
With the recent release of Cloudera 5.8, Apache Impala (incubating) has added support for Amazon S3, enabling business analysts to get instant insights from all data through high-performance exploratory analytics and BI.
3 Things to learn:
Join David Tishgart, Director of Product Marketing, and James Curtis, Senior Analyst Data Platforms & Analytics at 451 Research, as they discuss:
* Best practices for analytic workloads in the cloud
* A live demo and real-world use cases
* What’s next for Cloudera and the cloud
Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution.
Delivering business insights and automation utilizing aws data servicesBhuvaneshwaran R
The document discusses building data lakes and data architectures on AWS. It begins with an introduction on why data lakes are needed and driving automation and insights with AWS data services. It then covers best practices for data architecture and implementation case studies. Specifically, it discusses building a data lake infrastructure on AWS using services like S3, Glue, Athena, and Redshift. It also covers streaming data solutions, data governance best practices, and the Lake Formation service. Real-world customer case studies are presented on using AWS for data lakes and analytics in industries like e-commerce, FMCG, and manufacturing.
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
This document discusses building a modern analytic database with Cloudera. It outlines Marketing Associates' evaluation of solutions to address challenges around managing massive and diverse data volumes. They selected Cloudera Enterprise to enable self-service BI and real-time analytics at lower costs than traditional databases. The solution has provided scalability, cost savings of over 90%, and improved security and compliance. Future roadmaps for Cloudera's analytic database include faster SQL, improved multitenancy, and deeper BI tool integration.
Similar to Cloud and Analytics - From Platforms to an Ecosystem (20)
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Cloud and Analytics - From Platforms to an Ecosystem
1. Cloud and Analytics
– from Platforms to an Ecosystem
Ming Yuan, Zurich North America
David Carlson, Databricks
2. Agenda
▪ Data and Analytics at ZNA
▪ Data and Metadata
▪ Data Exploration and ETL
▪ Containerization
▪ DevOps in Analytics
3. Zurich is a data-enabled innovative company
• Data is used in day-to-day decision makings in key business
domains
• A strong data science team delivers predictive models and
business insights
• We are an early adopter of advanced analytics and cloud
analytics
Multiple Databases
On-premises Data Warehouse
Hadoop Data Lake
Cloud Data Lake
• Governance processes on data access and utilization are established
• Metadata is collected and stored in the repository system
4. Key capabilities support data analytics life cycle
• Data Discovery
• Data Integration
• Collaboration
• Business Impact (Operationalization)
• Scalability
• Multiple Personas
• Support multiple types of implementations
Ideation
Model Build
Model
Deployment
Model
Execution
Model
Monitoring
5. ▪ Support ML and advanced analysis to discover business insights and drive
appropriate actions
▪ Enable cross-domain data sharing, aggregation, and integration
▪ Modernize the technical landscape to handle data sets that were previously
unprocessable
Data foundation and processing power
Data
▪ Optimize data processing and archiving strategies to reduce
operation costs
▪ Apply data governance best practices to manage utilization
6. Data lake consists of ADLS and Databricks® clusters
Provisioning Store
Landing Staging Active Archive
Change Data
Capture
(CDC) or full
snapshots
Enrich
Landing zone
data with
additional
Date format
fields and
remove
Special
characters.
CDC records
applied
(I, U, D) to
copy of
previous
day's data
Rolling
pointers to
previous
day's
Active…
Curation Layer
Universal
Data
Model
Curated
Data Sets
Data
Sources
Data
Consumption
Azure Subscription
Services
Enterprise
level
curated
datasets
covering
broad
utilization
Pertaining
to the
needs in
specific
business
domain
7. Metadata management and data discovery
▪ For metadata administrators
▪ Maintain business glossary for data domains that are owned by function or business units
▪ Import technical metadata and catalog it as data assets
▪ Curate technical metadata relating them to logical business terms
▪ Maintain data-flows mapping transformations
▪ For data consumers
▪ Search, explore and discover data assets and data lineage
▪ Interpret data with correct meaning and context
▪ Navigate data flows to analyses processes and assess change impact
▪ Evaluate data quality reports and drive improvement actions
8. Alation® Data Catalog manages metadata ingestions
Database
Data Warehouse
Cloud Data Lake
JSON Streams
Ingest and refresh schema, table, and column definitions
Build data lineage, popularity, common queries, and more
Profile and store sample data sets
Collect user information and usage metrics
Open APIs to programmatically import business glossaries
2,053,632
9. Intuitive user interfaces to access metadata
Users and Stewards
actively curate the pages
Natural-language
search to easily discover
unknowns
Everyone collaborates
and communicates
Query intelligently against
source systems
10. Data exploration and ETL implementations
▪ Explore, valid and analyze existing data sets
▪ Curate new data sets for model development
▪ Construct ETL flows with embedded AI/modeling components
▪ Release ETL flows to production environment
▪ Provide runtime environments to trigger, manage, and monitor ETL flows in
production
11. Leverage technical stack and skills across Personas
LINUX Server on
Azure Cloud
CENTRALIZED OR AD-
HOC DATA SOURCES,
DATA LAKE
AVAILABLE OR SPUN-UP
PROCESSING RESOURCES
Leveraging best
storage and
compute
resources
Dataiku deployment servers for
enterprise grade operationalization
PRODUCTION SYSTEMS
Centralized server to facilitate
access to data, and foster
collaboration
Browser
based user
interfaces
User/task specific
interaction modes
INTEGRATION WITH
METADATA SYSTEM
12. Containerization in building model API services
▪ Standardize the runtime environment using commonly used ML libraries for
development and production
▪ Elastically scale the system capacity for the development environment
▪ Easily migrate system stacks from development environment to production
▪ Build CI/CD pipelines and deployment environments based on
open standards
▪ Monitor and ensure the health of model implementations in
production
13. Containerize models as cloud-native applications
Client App
Client App
Orchestration
We observed improved agility in development, more portability in deployment, and better elasticity in production
14. DevOps in data & analytics
▪ For platform administrators
▪ Codify the installation and configuration of key components in the ecosystem
▪ Streamline the process of testing and upgrading systems to newer versions
▪ Automate system’s backup and restoration
▪ For model services developers
▪ Standardize the deployment pipelines to reduce the effort per project
▪ Increase the agility of deploying applications from development to production
▪ Reduce the time to fix bugs after production releases
16. Analytical platforms fitting into different scenarios
are integrated as an ecosystem
Ideation Model Build Model Deployment Model Execution Model Monitoring
18. Zurich Insurance Group (Zurich), headquartered and founded in Switzerland, is a leading multi-line
insurance group with more than 140 years’ experience serving businesses worldwide, including over
100 years in North America. We are committed to delivering broad and flexible insurance solutions to
our customers and helping them understand, manage and minimize risk.
Through member companies in North America, Zurich is a leading commercial property-casualty
insurance provider serving small businesses, mid-sized and large companies, including multinational
corporations.
§ Approximately 55,000 employees
§ Managing complex risks for 7,600 international programs through our global network
§ Achieving USD 5.3 billion in business operating profit (BOP) in 2019
§ Providing comprehensive solutions and insights for 25 industries
§ Insuring more than 215,500 customers
§ Insuring more than 90 percent of the Fortune 500
The Alation Data Catalog and its logo is used with kind permission of Alation, Inc.
The Dataiku DSS and its logo is used with kind permission of Dataiku, Inc.
The Domino Data Lab and its logo is used with kind permission of Domino Data Lab, Inc.
Use of them does not endorse the products.