This document discusses developing analytics applications using machine learning on Azure Databricks and Apache Spark. It begins with an introduction to Richard Garris and the agenda. It then covers the data science lifecycle including data ingestion, understanding, modeling, and integrating models into applications. Finally, it demonstrates end-to-end examples of predicting power output, scoring leads, and predicting ratings from reviews.
Data Con LA 2020
Description
Data warehouses are not enough. Data lakes are the backbone of a modern data environment. Data Lakes are best built leveraging unique services of the cloud provider to reduce operations complexity. This session will explain why everyone's talking about data lakes, break down the best services in Azure to build a Data Lake, and walk through code for querying and loading with Azure Databricks and Event Hubs for Kafka. Attendees will leave the session with a firm grasp of why we build data lakes and how Azure Databricks fits in for ETL and querying.
Speaker
Dustin Vannoy, Dustin Vannoy Consulting, Principal Data Engineer
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...Lace Lofranco
Talk Description:
The Modern Data Warehouse architecture is a response to the emergence of Big Data, Machine Learning and Advanced Analytics. DevOps is a key aspect of successfully operationalising a multi-source Modern Data Warehouse.
While there are many examples of how to build CI/CD pipelines for traditional applications, applying these concepts to Big Data Analytical Pipelines is a relatively new and emerging area. In this demo heavy session, we will see how to apply DevOps principles to an end-to-end Data Pipeline built on the Microsoft Azure Data Platform with technologies such as Data Factory, Databricks, Data Lake Gen2, Azure Synapse, and AzureDevOps.
Resources: https://aka.ms/mdw-dataops
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...Databricks
At Lennox International, we have thousands of IoT connected devices streaming data into the Azure platform with a minute level polling interval. The challenge was to use these data sets, combine with external data sources such as weather, and predict equipment failure with high levels of accuracy along with their influencing patterns and parameters. Previously the team was using a combination of on-premise and desktop tools to run algorithms on a sample set of devices. The result was low accuracy levels (around 65%) on a process that took more than 6 hours.
The team had to work through several data orchestration challenges and identify a machine learning platform which enabled them to collaborate between our engineering SME’s, Data Engineers and Data Scientists. The team decided to use Azure Databricks to build the data engineering pipelines, appropriate machine learning models and extract predictions using PySpark. To enhance the sophistication of the learning, the team worked on a variety of Spark ML models such as Gradient Boosted Trees and Random Forest. The team also implemented stacking, ensemble methods using H2O driverless AI and sparkling water on Azure Databricks clusters, which can scale up to 1000 cores.
Join us in this session and see how this resulted in models that run in 40 minutes with minimal tuning and predict failures with accuracy of about 90%.
Lake Database Database Template Map Data in Azure Synapse AnalyticsErwin de Kreuk
Database templates in Synapse Analytics are blueprints which can be used by organizations to plan, architect and design solutions.
How can we use these Database Templates in a day-to-day business, in order to speed up to automate this process?
Map data tool can help us with that
Data Con LA 2020
Description
Data warehouses are not enough. Data lakes are the backbone of a modern data environment. Data Lakes are best built leveraging unique services of the cloud provider to reduce operations complexity. This session will explain why everyone's talking about data lakes, break down the best services in Azure to build a Data Lake, and walk through code for querying and loading with Azure Databricks and Event Hubs for Kafka. Attendees will leave the session with a firm grasp of why we build data lakes and how Azure Databricks fits in for ETL and querying.
Speaker
Dustin Vannoy, Dustin Vannoy Consulting, Principal Data Engineer
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...Lace Lofranco
Talk Description:
The Modern Data Warehouse architecture is a response to the emergence of Big Data, Machine Learning and Advanced Analytics. DevOps is a key aspect of successfully operationalising a multi-source Modern Data Warehouse.
While there are many examples of how to build CI/CD pipelines for traditional applications, applying these concepts to Big Data Analytical Pipelines is a relatively new and emerging area. In this demo heavy session, we will see how to apply DevOps principles to an end-to-end Data Pipeline built on the Microsoft Azure Data Platform with technologies such as Data Factory, Databricks, Data Lake Gen2, Azure Synapse, and AzureDevOps.
Resources: https://aka.ms/mdw-dataops
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...Databricks
At Lennox International, we have thousands of IoT connected devices streaming data into the Azure platform with a minute level polling interval. The challenge was to use these data sets, combine with external data sources such as weather, and predict equipment failure with high levels of accuracy along with their influencing patterns and parameters. Previously the team was using a combination of on-premise and desktop tools to run algorithms on a sample set of devices. The result was low accuracy levels (around 65%) on a process that took more than 6 hours.
The team had to work through several data orchestration challenges and identify a machine learning platform which enabled them to collaborate between our engineering SME’s, Data Engineers and Data Scientists. The team decided to use Azure Databricks to build the data engineering pipelines, appropriate machine learning models and extract predictions using PySpark. To enhance the sophistication of the learning, the team worked on a variety of Spark ML models such as Gradient Boosted Trees and Random Forest. The team also implemented stacking, ensemble methods using H2O driverless AI and sparkling water on Azure Databricks clusters, which can scale up to 1000 cores.
Join us in this session and see how this resulted in models that run in 40 minutes with minimal tuning and predict failures with accuracy of about 90%.
Lake Database Database Template Map Data in Azure Synapse AnalyticsErwin de Kreuk
Database templates in Synapse Analytics are blueprints which can be used by organizations to plan, architect and design solutions.
How can we use these Database Templates in a day-to-day business, in order to speed up to automate this process?
Map data tool can help us with that
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
SQLBits 2020 presentation on how you can build solutions based on the modern data warehouse pattern with Azure Synapse Spark and SQL including demos of Azure Synapse.
In this session we will delve into the world of Azure Databricks and analyze why it is becoming a tool for data Scientist and/or fundamental data Engineer in conjunction with Azure services
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
This presentation will cover Cloud history and Microsoft Azure Data Analytics capabilities. Moreover, it has a real-world example of DW modernization. Finally, we will check the alternative solution on Azure using Snowflake and Matillion ETL.
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAmazon Web Services
This session will focus on how to get from 'Minimum Viable Product' (MVP) to scale. It will also explain how to deal with unpredictable demand and how to build a scalable business. Attend this session to learn how to:
Scale web servers and app services with Elastic Load Balancing and Auto Scaling on Amazon EC2
Scale your storage on Amazon S3 and S3 Reduced Redundancy Storage
Scale your database with Amazon DynamoDB, Amazon RDS, and Amazon ElastiCache
Scale your customer base by reaching customers globally in minutes with Amazon CloudFront
a talk about azure synapse aimed to help people who are not data experts understand what synapse is and how you can integrate it with other technologies
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
The developer world is changing as we create and generate new data patterns and handling processes within our applications. Additionally, with the massive interest in machine learning and advanced analytics how can we as developers build intelligence directly into our applications that can integrate with the data and data paths we are creating? The answer is Azure Databricks and by attending this session you will be able to confidently develop smarter and more intelligent applications and solutions which can be continuously built upon and that can scale with the growing demands of a modern application estate.
Data Quality in the Data Hub with RedPointGlobalCaserta
At a Big Data Warehousing Meetup, George Corugedo, CTO of RedPoint Global demonstrated how to use your big data platform for data integration, data quality and identity resolution to provide a true 360 degree view of your customer on Hadoop using the RedPoint product.
For more information or questions, please contact us at www.casertaconcepts.com.
This is a run-through at a 200 level of the Microsoft Azure Big Data Analytics for the Cloud data platform based on the Cortana Intelligence Suite offerings.
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsInformatica
This presentation is geared toward enterprise architects and senior IT leaders looking to drive more value from their data by learning about cloud data lake management.
As businesses focus on leveraging big data to drive digital transformation, technology leaders are struggling to keep pace with the high volume of data coming in at high speed and rapidly evolving technologies. What's needed is an approach that helps you turn petabytes into profit.
Cloud data lakes and cloud data warehouses have emerged as a popular architectural pattern to support next-generation analytics. Informatica's comprehensive AI-driven cloud data lake management solution natively ingests, streams, integrates, cleanses, governs, protects and processes big data workloads in multi-cloud environments.
Please leave any questions or comments below.
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Microsoft Tech Community
In this session you will learn how to develop data pipelines in Azure Data Factory and build a Cloud-based analytical solution adopting modern data warehouse approaches with Azure SQL Data Warehouse and implementing incremental ETL orchestration at scale. With the multiple sources and types of data available in an enterprise today Azure Data factory enables full integration of data and enables direct storage in Azure SQL Data Warehouse for powerful and high-performance query workloads which drive a majority of enterprise applications and business intelligence applications.
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive
Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous, and the default increasingly is to capture and store any and all data, in anticipation of potential future strategic value. These differences in data heterogeneity, scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of very large datasets that are stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools, e.g., for machine learning and stream analytics. These new systems are necessarily based on scale-out architectures for both storage and computation.
Hadoop has become a key building block in the new generation of scale-out systems. On the storage side, HDFS has provided a cost-effective and scalable substrate for storing large heterogeneous datasets. However, as key customer and systems touch points are instrumented to log data, and Internet of Things applications become common, data in the enterprise is growing at a staggering pace, and the need to leverage different storage tiers (ranging from tape to main memory) is posing new challenges, leading to caching technologies, such as Spark. On the analytics side, the emergence of resource managers such as YARN has opened the door for analytics tools to bypass the Map-Reduce layer and directly exploit shared system resources while computing close to data copies. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit.
While Hadoop is widely recognized and used externally, Microsoft has long been at the forefront of Big Data analytics, with Cosmos and Scope supporting all internal customers. These internal services are a key part of our strategy going forward, and are enabling new state of the art external-facing services such as Azure Data Lake and more. I will examine these trends, and ground the talk by discussing the Microsoft Big Data stack.
The breath and depth of Azure products that fall under the AI and ML umbrella can be difficult to follow. In this presentation I’ll first define exactly what AI, ML, and deep learning is, and then go over the various Microsoft AI and ML products and their use cases.
In this video from the ISC Big Data'14 Conference, Ted Willke from Intel presents: The Analytics Frontier of the Hadoop Eco-System.
"The Hadoop MapReduce framework grew out of an effort to make it easy to express and parallelize simple computations that were routinely performed at Google. It wasn’t long before libraries, like Apache Mahout, were developed to enable matrix factorization, clustering, regression, and other more complex analyses on Hadoop. Now, many of these libraries and their workloads are migrating to Apache Spark because it supports a wider class of applications than MapReduce and is more appropriate for iterative algorithms, interactive processing, and streaming applications. What’s next beyond Spark? Where is big data analytics processing headed? How will data scientists program these systems? In this talk, we will explore the current analytics frontier, the popular debates, and discuss some potentially clever additions. We will also share the emergent data science applications and collaborative university research that inform our thinking."
Learn more:
http://www.isc-events.com/bigdata14/schedule.html
and
http://www.intel.com/content/www/us/en/software/intel-graph-solutions.html
Watch the video presentation: https://www.youtube.com/watch?v=qlfx495Ekw0
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
SQLBits 2020 presentation on how you can build solutions based on the modern data warehouse pattern with Azure Synapse Spark and SQL including demos of Azure Synapse.
In this session we will delve into the world of Azure Databricks and analyze why it is becoming a tool for data Scientist and/or fundamental data Engineer in conjunction with Azure services
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
This presentation will cover Cloud history and Microsoft Azure Data Analytics capabilities. Moreover, it has a real-world example of DW modernization. Finally, we will check the alternative solution on Azure using Snowflake and Matillion ETL.
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAmazon Web Services
This session will focus on how to get from 'Minimum Viable Product' (MVP) to scale. It will also explain how to deal with unpredictable demand and how to build a scalable business. Attend this session to learn how to:
Scale web servers and app services with Elastic Load Balancing and Auto Scaling on Amazon EC2
Scale your storage on Amazon S3 and S3 Reduced Redundancy Storage
Scale your database with Amazon DynamoDB, Amazon RDS, and Amazon ElastiCache
Scale your customer base by reaching customers globally in minutes with Amazon CloudFront
a talk about azure synapse aimed to help people who are not data experts understand what synapse is and how you can integrate it with other technologies
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
The developer world is changing as we create and generate new data patterns and handling processes within our applications. Additionally, with the massive interest in machine learning and advanced analytics how can we as developers build intelligence directly into our applications that can integrate with the data and data paths we are creating? The answer is Azure Databricks and by attending this session you will be able to confidently develop smarter and more intelligent applications and solutions which can be continuously built upon and that can scale with the growing demands of a modern application estate.
Data Quality in the Data Hub with RedPointGlobalCaserta
At a Big Data Warehousing Meetup, George Corugedo, CTO of RedPoint Global demonstrated how to use your big data platform for data integration, data quality and identity resolution to provide a true 360 degree view of your customer on Hadoop using the RedPoint product.
For more information or questions, please contact us at www.casertaconcepts.com.
This is a run-through at a 200 level of the Microsoft Azure Big Data Analytics for the Cloud data platform based on the Cortana Intelligence Suite offerings.
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsInformatica
This presentation is geared toward enterprise architects and senior IT leaders looking to drive more value from their data by learning about cloud data lake management.
As businesses focus on leveraging big data to drive digital transformation, technology leaders are struggling to keep pace with the high volume of data coming in at high speed and rapidly evolving technologies. What's needed is an approach that helps you turn petabytes into profit.
Cloud data lakes and cloud data warehouses have emerged as a popular architectural pattern to support next-generation analytics. Informatica's comprehensive AI-driven cloud data lake management solution natively ingests, streams, integrates, cleanses, governs, protects and processes big data workloads in multi-cloud environments.
Please leave any questions or comments below.
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Microsoft Tech Community
In this session you will learn how to develop data pipelines in Azure Data Factory and build a Cloud-based analytical solution adopting modern data warehouse approaches with Azure SQL Data Warehouse and implementing incremental ETL orchestration at scale. With the multiple sources and types of data available in an enterprise today Azure Data factory enables full integration of data and enables direct storage in Azure SQL Data Warehouse for powerful and high-performance query workloads which drive a majority of enterprise applications and business intelligence applications.
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive
Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous, and the default increasingly is to capture and store any and all data, in anticipation of potential future strategic value. These differences in data heterogeneity, scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of very large datasets that are stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools, e.g., for machine learning and stream analytics. These new systems are necessarily based on scale-out architectures for both storage and computation.
Hadoop has become a key building block in the new generation of scale-out systems. On the storage side, HDFS has provided a cost-effective and scalable substrate for storing large heterogeneous datasets. However, as key customer and systems touch points are instrumented to log data, and Internet of Things applications become common, data in the enterprise is growing at a staggering pace, and the need to leverage different storage tiers (ranging from tape to main memory) is posing new challenges, leading to caching technologies, such as Spark. On the analytics side, the emergence of resource managers such as YARN has opened the door for analytics tools to bypass the Map-Reduce layer and directly exploit shared system resources while computing close to data copies. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit.
While Hadoop is widely recognized and used externally, Microsoft has long been at the forefront of Big Data analytics, with Cosmos and Scope supporting all internal customers. These internal services are a key part of our strategy going forward, and are enabling new state of the art external-facing services such as Azure Data Lake and more. I will examine these trends, and ground the talk by discussing the Microsoft Big Data stack.
The breath and depth of Azure products that fall under the AI and ML umbrella can be difficult to follow. In this presentation I’ll first define exactly what AI, ML, and deep learning is, and then go over the various Microsoft AI and ML products and their use cases.
In this video from the ISC Big Data'14 Conference, Ted Willke from Intel presents: The Analytics Frontier of the Hadoop Eco-System.
"The Hadoop MapReduce framework grew out of an effort to make it easy to express and parallelize simple computations that were routinely performed at Google. It wasn’t long before libraries, like Apache Mahout, were developed to enable matrix factorization, clustering, regression, and other more complex analyses on Hadoop. Now, many of these libraries and their workloads are migrating to Apache Spark because it supports a wider class of applications than MapReduce and is more appropriate for iterative algorithms, interactive processing, and streaming applications. What’s next beyond Spark? Where is big data analytics processing headed? How will data scientists program these systems? In this talk, we will explore the current analytics frontier, the popular debates, and discuss some potentially clever additions. We will also share the emergent data science applications and collaborative university research that inform our thinking."
Learn more:
http://www.isc-events.com/bigdata14/schedule.html
and
http://www.intel.com/content/www/us/en/software/intel-graph-solutions.html
Watch the video presentation: https://www.youtube.com/watch?v=qlfx495Ekw0
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
Abstract: How graphs became just another big data primitive
Graph-shaped data is used in product recommendation systems, social network analysis, network threat detection, image de-noising, and many other important applications. And, a growing number of these applications will benefit from parallel distributed processing for graph featuring engineering, model training, and model serving. But today’s graph tools are riddled with limitations and shortcomings, such as a lack of language bindings, streaming support, and seamless integration with other popular data services. In this talk, we’ll argue that the key to doing more with graphs is doing less with specialized systems and more with systems already good at handling data of other shapes. We’ll examine some practical data science workflows to further motivate this argument and we’ll talk about some of the things that Intel is doing with the open source community and industry to make graphs just another big data primitive.
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
Thomas Cook, director of sales, Cambridge Semantics, offers a primer on graph database technology and the rapid growth of knowledge graphs at Data Summit 2020 in his presentation titled "AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Connected World".
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumVMware Tanzu
Data is at the center of digital transformation; using data to drive action is how transformation happens. But data is messy, and it’s everywhere. It’s in the cloud and on-premises. It’s in different types and formats. By the time all this data is moved, consolidated, and cleansed, it can take weeks to build a predictive model.
Even with data lakes, efficiently integrating multi-structured data from different data sources and streams is a major challenge. Enterprises struggle with a stew of data integration tools, application integration middleware, and various data quality and master data management software. How can we simplify this complexity to accelerate and de-risk analytic projects?
The data warehouse—once seen as only for traditional business intelligence applications — has learned new tricks. Join James Curtis from 451 Research and Pivotal’s Bob Glithero for an interactive discussion about the modern analytic data warehouse. In this webinar, we’ll share insights such as:
- Why after much experimentation with other architectures such as data lakes, the data warehouse has reemerged as the platform for integrated operational analytics
- How consolidating structured and unstructured data in one environment—including text, graph, and geospatial data—makes in-database, highly parallel, analytics practical
- How bringing open-source machine learning, graph, and statistical methods to data accelerates analytical projects
- How open-source contributions from a vibrant community of Postgres developers reduces adoption risk and accelerates innovation
We thank you in advance for joining us.
Presenter : Bob Glithero, PMM, Pivotal and James Curtis Senior Analyst, 451 Research
How to Build Modern Data Architectures Both On Premises and in the CloudVMware Tanzu
Enterprises are beginning to consider the deployment of data science and data warehouse platforms on hybrid (public cloud, private cloud, and on premises) infrastructure. This delivers the flexibility and freedom of choice to deploy your analytics anywhere you need it and to create an adaptable and agile analytics platform.
But the market is conspiring against customer desire for innovation...
Leading public cloud vendors are interested in pushing their new, but proprietary, analytic stacks, locking customers into subpar Analytics as a Service (AaaS) for years to come.
In tandem, Legacy Data Warehouse vendors are trying to extend the lifecycle of their costly and aging appliances with new features of marginal value, simply imitating the same limiting models of public cloud vendors.
New vendors are coming up with interesting ideas, but these ideas are often lacking critical features that don’t provide support for hybrid solutions, limiting the immediate value to users.
It is 2017—you can, in fact, have your analytics cake and eat it too! Solve your short term costs and capabilities challenges, and establish a long term hybrid data strategy by running the same open source analytics platform on your infrastructure as it exists today.
In this webinar you will learn how Pivotal can help you build a modern analytical architecture able to run on your public, private cloud, or on-premises platform of your choice, while fully leveraging proven open source technologies and supporting the needs of diverse analytical users.
Let’s have a productive discussion about how to deploy a solid cloud analytics strategy.
Presenter : Jacque Istok, Head of Data Technical Field for Pivotal
https://content.pivotal.io/webinars/jul-20-how-to-build-modern-data-architectures-both-on-premises-and-in-the-cloud
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
High value analytics in FS are being enabled by Graph, machine learning and Spark technologies. To make these real at production scale HPC technologies are more appropriate than commodity clusters.
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events.
So far this mostly a development experience, with frameworks such as Oracle Event Processing, Apache Storm or Spark Streaming. With Oracle Stream Analytics, analytics on event streams can be put in the hands of the business analyst. It simplifies the implementation of event processing solutions so that every business analyst is able to graphically and decleratively define event stream processing pipelines, without having to write a single line of code or continous query language (CQL). Event Processing is no longer “complex”! This session presents Oracle Stream Analytics directly on some selected demo use cases.
At Data-centric Architecture Forum 2020 Thomas Cook, our Sales Director of AnzoGraph DB, gave his presentation "Knowledge Graph for Machine Learning and Data Science". These are his slides.
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
Sesja pokazująca zarówno Machine Learning Server (czyli algorytmy uczenia maszynowego w językach R i Python), ale także możliwość korzystania z danych JSON w SQL Server, czy też łączenia się do danych znajdujących się na HDFS, HADOOP, czy Spark poprzez Polybase w SQL Server, by te dane wykorzystywać do analizy, predykcji poprzez modele w językach R lub Python.
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data.
Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations.
This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Similar to Azure Databricks for Data Scientists (20)
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Azure Databricks for Data Scientists
1. The Developer Data
Scientist Creating New Analytics Driven
Applications
Using Azure Databricks® and Apache Spark™
2. About Richard
Richard Garris
● Principal Solutions Architect
● 14+ years in data management and
advanced analytics
● advises customers on their data
science and advanced analytic
projects
● Degrees from The Ohio State
University and Carnegie Mellon
University
2
3. Agenda
- Introduction to Data Science
- Data Science Lifecycle
- Data Ingestion
- Data Understanding & Exploration
- Modeling
- Integrating Machine Learning in Your Application
- End-to-End Example Use Cases
3
5. AI is Changing the World
What is the secret to AI?
AlphaGoSelf-driving cars Alexa
6. AI is Changing the World
What do these companies have in common?
AlphabetTesla Amazon
7. Hardest Part of AI isn’t AI, its Big Data
ML
Code
Configuration
Data Collection
Data
Verification
Feature
Extraction
Machine
Resource
Management
Analysis Tools
Process
Management Tools
Serving
Infrastructure
Monitoring
“Hidden Technical Debt in Machine Learning Systems,” Google NIPS
2015
Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by
the small green box in the middle. The required surrounding infrastructure is vast and complex.
8. Business Value of Data Science
Present the
Right Offer at
the Right Time
•Businesses have to Adapt Faster to
Change
•Data driven decisions need to be
made quickly and accurately
•Customers expect faster responses
10. Agile Modeling Process
Set Business
Goals
Understand Your
Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
11. Data Scientists or Data Janitors?
1
1
“3 out of 5 data
scientists spend
80% of their time
collecting, cleaning
and organizing
data”
12. Data Understanding
Schema - understand the field names / data types
Metadata Management - understand descriptions and business
meaning
Data Quality - data validation / profiling / checks
Exploration / Visualization - scatter plots, charts, correlations
Summary Statistics - average, min, max, range, median,
standard deviation
25. What is a Data Science Platform?
Gartner defines Data Science Platform :
“an end-to-end platform for developing and
deploying models”
Using sophisticated statistical models, machine learning,
neural networks, text analytics, and other advanced data
mining techniques
25
26. What is a Model
A simplified and idealized
representation of the real-world
27. What does Modeling Mean?
A Class is a Model Model of a Building Data Model
class Employee {
FirstName : String
LastName : String
DOB : java.calendar.Date
Grades : Seq[Grade]
}
29. Two Broad Categories of Models
●Supervised learning: prediction
Classification (binary or multiclass): predict a category (label)
Regression: predict a number (target)
●Unsupervised learning: discovery
Clustering: find groupings based on pattern
Density estimation: match data with distribution pattern
Dimensionality: reduction / reduce # of columns
Similarity search: find similar data
Frequent Items (or association rules): finding relationships in variables
30. Model Category Use Cases
●Anomaly detection
Density estimation: “Is this observation uncommon?”
Similarity search: “How far is it from other observations?”
Clustering: “Are there groups of strange observations?”
●Lead scoring / recommendation
Classification: “Will this user become a buyer?”
Regression: “How much will he/she spend?”
Similarity search: “What products did similar users buy?”
33. Supervised Methods in MLlib
Classification
Logistic regression w/ elastic net
Naive Bayes
Streaming logistic regression
Linear SVMs
Decision trees
Random forests
Gradient-boosted trees
Multilayer perceptron
One-vs-rest
DeepImagePredictor
Regression
Least squares w/ elastic net
Isotonic regression
Decision trees
Random forests
Gradient-boosted trees
Streaming linear methods
34. But What is a Model Really?
A model is a complex pipeline of components
Data Sources
Joins
Featurization Logic
Algorithm(s)
Transformers
Estimators
Tuning Parameters
38. Productionizing Models Today
Data Science Data Engineering
Develop Prototype
Model using
Python/R
Re-implement model for
production (Java)
39. Problems with Productionizing
Models
Develop Prototype
Model using
Python/R
Re-implement model for
production (Java)
- Extra work
- Different code paths
- Data science does not translate to production
- Slow to update models
Data Science Data Engineering
40. MLLib 2.X Model Serialization
Data Science Data Engineering
Develop Prototype
Model using
Python/R
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
41. Scala
val lrModel = lrPipeline.fit(dataset)
// Save the Model
lrModel.write.save("/models/lr")
•
MLLib 2.X Model Serialization
Snippet
Python
lrModel = lrPipeline.fit(dataset)
# Save the Model
lrModel.write.save("/models/lr")
•
42. Model Serialization Output
Code
// List Contents of the Model Dir
dbutils.fs.ls("/models/lr")
•
Output
Remember this is a pipeline
model and these are the stages!
43. Transformer Stage (StringIndexer)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/00_s
trIdx_bb9728f85745/metadata/part-
00000")
// Display the Parquet File in the Data
dir
display(sqlContext.read.parquet(”/mod
els/lr/stages/00_strIdx_bb9728f85745/
data/"))
Output
{
"class":"org.apache.spark.ml.feature.StringIndexerModel",
"timestamp":1488120411719,
"sparkVersion":"2.1.0",
"uid":"strIdx_bb9728f85745",
"paramMap":{
"outputCol":"workclassIdx",
"inputCol":"workclass",
"handleInvalid":"error"
}
}
Metadata and params
Data (Hashmap)
44. Estimator Stage (LogisticRegression)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/18_l
ogreg_325fa760f925/metadata/part-
00000")
// Display the Parquet File in the Data
dir
display(sqlContext.read.parquet("/mod
els/lr/stages/18_logreg_325fa760f925/
data/"))
Output
Model params
Intercept + Coefficients
{"class":"org.apache.spark.ml.classification.LogisticRegressionModel",
"timestamp":1488120446324,
"sparkVersion":"2.1.0",
"uid":"logreg_325fa760f925",
"paramMap":{
"predictionCol":"prediction",
"standardization":true,
"probabilityCol":"probability",
"maxIter":100,
"elasticNetParam":0.0,
"family":"auto",
"regParam":0.0,
"threshold":0.5,
"fitIntercept":true,
"labelCol":"label” }}
45. Output
Decision Tree Splits
Estimator Stage (DecisionTree)
Code
// Display the Parquet File in the Data dir
display(sqlContext.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/"))
// Re-save as JSON
sqlContext.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").
47. Databricks + ML Pipelines: Ideal
Modeling Tool
Data Science - highly iterative,
agile
● Lots of data sources
● Lots of dirty data
● Lots and lots of data
ML Pipelines and notebooks are
ideal way to experiment with new
methods, data, features in order to
minimize error
48. Databricks Runtime
Elastic, Fully Managed, Highly Tuned Engine
48
FULLY MANAGED CLOUD
SERVICE
• Auto-configured multi-user elastic
clusters
• Reliable sharing with fault
isolation and workload
Preemption
PERFORMANCE OPTIMIZATIONS
• Increases performance by 5X
(TPC Benchmark)
• Connector optimizations
for Cloud ( Kafka, S3 and
Kinesis)
COST OPTIMIZED / LINEAR
SCALING
• 2x nodes - time cut in half
• 2x data, 2x nodes - time constant
• Cost of 10 nodes for 10 hours
equal to 100 nodes for 1 hour
DATABRICKS UNIFIED RUNTIME
Databricks I/O Databricks Serverless
49. Databricks Collaborative Workspace
Frictionless Collaboration Enabling Faster Innovation
49
Secure collaboration for fast feedback loops with single click access to
clusters
Production Jobs
FAST, RELIABLE AND SECURE JOBS
• Executes jobs 30-50% faster
• Notebooks to Production Jobs with one-
click
• Debug faster with logs and Spark history
UI.
DATA ENGINEER
ANALYZE DATA WITH NOTEBOOKS
• Multi-language: SQL, R, Scala, Python
• Advanced Analytics (Graph, ML & DL)
• Built-in visualization, including D3 & ggplot
DATA SCIENTIST
Interactive
Notebooks
BUILD DASHBOARDS
• Publish Insights
• Real-time updates
• Interactive reportsBUSINESS SME
Dashboards
49
50. 50
Databricks’ Approach to Accelerate Innovation
INCREASE PERFORMANCE
By more than 5x and reduce TCO by
more than 70%
INCREASE PRODUCTIVITY
Of data science teams by 4-5x
STREAMLINE ANALYTIC
WORKFLOWS
Reducing deployment time to minutes
REDUCE RISK
And enable innovation with out-of-the-box
enterprise security and compliance
UNIFY ANALYTICS WITH APACHE
SPARK
Eliminating disparate tools
DATA SCIENTIST
/ANALYST
BUSINESS SMEDATA
ENGINEER
OPTIMIZEBIG DATA
CLUSTERS
SETUP
BREAK-FIX
DATA
WAREHOUSES
CLOUD
STORAGE
HADOOP STORAGEIoT / STREAMING DATA
MODEL DATA PRODUCTEXPLOREINGEST DASHBOARD
ML
LIBRARIES
STREAMIN
G
STATISTICS
PACKAGES
ETL SQL
Unified Engine
• SQL
• Streaming
• MLlib
• Graph
Databricks Optimizations
and Managed Cloud Service
DATABRICKS ENTERPRISE
SECURITY
DATABRICKS COLLABORATIVE
WORKSPACE
Databricks
Production
Databricks Interactive
DATABRICKS
RUNTIME
Databricks I/ODatabricks Serverless Unified Engine
Open APIs