- The document discusses machine learning and ML.NET. It begins with an introduction of the speaker and their background in machine learning.
- Key topics that will be covered include machine learning, ML.NET, Parquet.NET, using machine learning in production, and relevant Azure tools for data and machine learning.
- Examples provided will demonstrate sentiment analysis, finding patterns in taxi fare data, image recognition, and more to illustrate machine learning algorithms and best practices.
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
The workshop implements an innovative fraud detection solution as a PoC for a bank who provides payment processing services for commerce to their merchant customers all across the globe, helping them save costs by applying machine learning and advanced analytics to detect fraudulent transactions. Since their customers are around the world, the right solutions should minimize any latencies experienced using their service by distributing as much of the solution as possible, as closely as possible, to the regions in which their customers use the service. The workshop designs a data pipeline solution that leverages Cosmos DB for both the scalable ingest of streaming data, and the globally distributed serving of both pre-scored data and machine learning models. Cosmos DB’s major advantage when operating at a global scale is its high concurrency with low latency and predictable results.
This combination is unique to Cosmos DB and ideal for the bank needs. The solution leverages the Cosmos DB change data feed in concert with the Azure Databricks Delta and Spark capabilities to enable a modern data warehouse solution that can be used to create risk reduction solutions for scoring transactions for fraud in an offline, batch approach and in a near real-time, request/response approach. https://github.com/Microsoft/MCW-Cosmos-DB-Real-Time-Advanced-Analytics Takeaway: How to leverage Azure Cosmos DB + Azure Databricks along with Spark ML for building innovative advanced analytics pipelines.
Big data never stops and neither should your Spark jobs. They should not stop when they see invalid input data. They should not stop when there are bugs in your code. They should not stop because of I/O-related problems. They should not stop because the data is too big. Bulletproof jobs not only keep working but they make it easy to identify and address the common problems encountered in large-scale production Spark processing: from data quality to code quality to operational issues to rising data volumes over time.
In this session you will learn three key principles for bulletproofing your Spark jobs, together with the architecture and system patterns that enable them. The first principle is idempotence. Exemplified by Spark 2.0 Idempotent Append operations, it enables 10x easier failure management. The second principle is row-level structured logging. Exemplified by Spark Records, it enables 100x (yes, one hundred times) faster root cause analysis. The third principle is invariant query structure. It is exemplified by Resilient Partitioned Tables, which allow for flexible management of large scale data over long periods of time, including late arrival handling, reprocessing of existing data to deal with bugs or data quality issues, repartitioning already written data, etc.
These patterns have been successfully used in production at Swoop in the demanding world of petabyte-scale online advertising.
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
The workshop implements an innovative fraud detection solution as a PoC for a bank who provides payment processing services for commerce to their merchant customers all across the globe, helping them save costs by applying machine learning and advanced analytics to detect fraudulent transactions. Since their customers are around the world, the right solutions should minimize any latencies experienced using their service by distributing as much of the solution as possible, as closely as possible, to the regions in which their customers use the service. The workshop designs a data pipeline solution that leverages Cosmos DB for both the scalable ingest of streaming data, and the globally distributed serving of both pre-scored data and machine learning models. Cosmos DB’s major advantage when operating at a global scale is its high concurrency with low latency and predictable results.
This combination is unique to Cosmos DB and ideal for the bank needs. The solution leverages the Cosmos DB change data feed in concert with the Azure Databricks Delta and Spark capabilities to enable a modern data warehouse solution that can be used to create risk reduction solutions for scoring transactions for fraud in an offline, batch approach and in a near real-time, request/response approach. https://github.com/Microsoft/MCW-Cosmos-DB-Real-Time-Advanced-Analytics Takeaway: How to leverage Azure Cosmos DB + Azure Databricks along with Spark ML for building innovative advanced analytics pipelines.
Big data never stops and neither should your Spark jobs. They should not stop when they see invalid input data. They should not stop when there are bugs in your code. They should not stop because of I/O-related problems. They should not stop because the data is too big. Bulletproof jobs not only keep working but they make it easy to identify and address the common problems encountered in large-scale production Spark processing: from data quality to code quality to operational issues to rising data volumes over time.
In this session you will learn three key principles for bulletproofing your Spark jobs, together with the architecture and system patterns that enable them. The first principle is idempotence. Exemplified by Spark 2.0 Idempotent Append operations, it enables 10x easier failure management. The second principle is row-level structured logging. Exemplified by Spark Records, it enables 100x (yes, one hundred times) faster root cause analysis. The third principle is invariant query structure. It is exemplified by Resilient Partitioned Tables, which allow for flexible management of large scale data over long periods of time, including late arrival handling, reprocessing of existing data to deal with bugs or data quality issues, repartitioning already written data, etc.
These patterns have been successfully used in production at Swoop in the demanding world of petabyte-scale online advertising.
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
A talk on EDHREC, a service for magic the gathering deck recommendations. I discuss the algorithms used, my infrastructure, and some lessons learned about building data science applications.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Introduction to Analytics with Azure Notebooks and PythonJen Stirrup
Introduction to Analytics with Azure Notebooks and Python for Data Science and Business Intelligence. This is one part of a full day workshop on moving from BI to Analytics
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
A deeper explanation of Spark's evaluation principals including lazy evaluation, the Spark execution environment, anatomy of a Spark Job (Tasks, Stages, Query execution plan) and presents one use case to demonstrate these concepts.
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...sparktc
At the sold-out Spark & Machine Learning Meetup in Brussels on October 27, 2016, Nick Pentreath of the Spark Technology Center teamed up with Jean-François Puget of IBM Analytics to deliver a talk called Creating an end-to-endRecommender System with Apache Spark and Elasticsearch.
Jean-François and Nick started with a look at the workflow for recommender systems and machine learning, then moved on to data modeling and using Spark ML for collaborative filtering. They closed with a discussion of deploying and scoring the recommender models, including a demo.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
Introduction to Mahout and Machine LearningVarad Meru
This presentation gives an introduction to Apache Mahout and Machine Learning. It presents some of the important Machine Learning algorithms implemented in Mahout. Machine Learning is a vast subject; this presentation is only a introductory guide to Mahout and does not go into lower-level implementation details.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
A talk on EDHREC, a service for magic the gathering deck recommendations. I discuss the algorithms used, my infrastructure, and some lessons learned about building data science applications.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Introduction to Analytics with Azure Notebooks and PythonJen Stirrup
Introduction to Analytics with Azure Notebooks and Python for Data Science and Business Intelligence. This is one part of a full day workshop on moving from BI to Analytics
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
A deeper explanation of Spark's evaluation principals including lazy evaluation, the Spark execution environment, anatomy of a Spark Job (Tasks, Stages, Query execution plan) and presents one use case to demonstrate these concepts.
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...sparktc
At the sold-out Spark & Machine Learning Meetup in Brussels on October 27, 2016, Nick Pentreath of the Spark Technology Center teamed up with Jean-François Puget of IBM Analytics to deliver a talk called Creating an end-to-endRecommender System with Apache Spark and Elasticsearch.
Jean-François and Nick started with a look at the workflow for recommender systems and machine learning, then moved on to data modeling and using Spark ML for collaborative filtering. They closed with a discussion of deploying and scoring the recommender models, including a demo.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
Introduction to Mahout and Machine LearningVarad Meru
This presentation gives an introduction to Apache Mahout and Machine Learning. It presents some of the important Machine Learning algorithms implemented in Mahout. Machine Learning is a vast subject; this presentation is only a introductory guide to Mahout and does not go into lower-level implementation details.
Dapper: the microORM that will change your lifeDavide Mauri
ORM or Stored Procedures? Code First or Database First? Ad-Hoc Queries? Impedance Mismatch? If you're a developer or you are a DBA working with developers you have heard all this terms at least once in your life…and usually in the middle of a strong discussion, debating about one or the other. Well, thanks to StackOverflow's Dapper, all these fights are finished. Dapper is a blazing fast microORM that allows developers to map SQL queries to classes automatically, leaving (and encouraging) the usage of stored procedures, parameterized statements and all the good stuff that SQL Server offers (JSON and TVP are supported too!) In this session I'll show how to use Dapper in your projects from the very basis to some more complex usages that will help you to create *really fast* applications without the burden of huge and complex ORMs. The days of Impedance Mismatch are finally over!
Hear Ryan Millay, IBM Cloudant software development manager, discuss what you need to consider when moving from world of relational databases to a NoSQL document store.
You'll learn about the key differences between relational databases and JSON document stores like Cloudant, as well as how to dodge the pitfalls of migrating from a relational database to NoSQL.
The breath and depth of Azure products that fall under the AI and ML umbrella can be difficult to follow. In this presentation I’ll first define exactly what AI, ML, and deep learning is, and then go over the various Microsoft AI and ML products and their use cases.
In this talk we'll look at simple building-block techniques for predicting metrics over time based on past data, taking into account trend, seasonality and noise, using Python with Tensorflow.
As data science workloads grow, so does their need for infrastructure. But, is it fair to ask data scientists to also become infrastructure experts? If not the data scientists, then, who is responsible for spinning up and managing data science infrastructure? This talk will address the context in which ML infrastructure is emerging, walk through two examples of ML infrastructure tools for launching hyperparameter optimization jobs, and end with some thoughts for building better tools in the future.
Originally given as a talk at the PyData Ann Arbor meetup (https://www.meetup.com/PyData-Ann-Arbor/events/260380989/)
Automated machine learning (automated ML) automates feature engineering, algorithm and hyperparameter selection to find the best model for your data. The mission: Enable automated building of machine learning with the goal of accelerating, democratizing and scaling AI. This presentation covers some recent announcements of technologies related to Automated ML, and especially for Azure. The demonstrations focus on Python with Azure ML Service and Azure Databricks.
Java Developers, make the database work for you (NLJUG JFall 2010)Lucas Jellema
The general consensus among Java developers has evolved from a dogmatic strive for database independence to a much more pragmatic wish to leverage the power of the database. This session demonstrates some of the (hidden) powers of the database and how these can be utilized from Java applications using either straight JDBC or working through JPA. The Oracle database is used as example: SQL for Aggregation and Analysis, Flashback Queries for historical comparison and trends, Virtual Private Database, complex validation, PL/SQL and collections for bulk data manipulation, view and instead-of triggers for data model morphing, server push of relevant data changes, edition based redefinition for release management.
- overview of role of database in JEE architecture (and a little history on how the database is perceived through the years)
- discussion on the development of database functionality
- demonstration of some powerful database features
- description of how we leveraged these features in our JSF (RichFaces)/JPA (Hibernate) application
- demo of web application based on these features
- discussion on how to approach the database
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...Teresa Giacomini
PostgreSQL is becoming the relational database of choice. An important factor in the rising popularity of Postgres is the extension APIs that allow developers to improve any database module’s behavior. As a result, Postgres users have access to hundreds of extensions today.
In this talk, we're going to first describe extension APIs. Then, we’re going to present four popular Postgres extensions, and demo their use.
* PostGIS turns Postgres into a spatial database through adding support for geographic objects.
* HLL & TopN add approximation algorithms to Postgres. These algorithms are used when real-time responses matter more than exact results.
* pg_partman makes managing partitions in Postgres easy. Through partitions, Postgres provide 5-10x higher performance for time-series data.
* Citus transforms Postgres into a distributed database. To do this, Citus shards data, performs distributed deadlock detection, and parallelizes queries.
Finally, we’ll conclude with why we think Postgres sets the way forward for relational databases.
PostgreSQL is becoming the relational database of choice. One important factor in the rising popularity of Postgres are the extension APIs. These APIs allow developers to extend any database sub-module’s behavior for higher performance, security, or new functionality. As a result, Postgres users have access to over a hundred extensions today, and more to come in the future.
In this talk, I’m going to first describe PostgreSQL’s extension APIs. These APIs are unique to Postgres, and have the potential to change the database landscape. Then, we’re going to present the four most popular Postgres extensions, show the use cases where they are applicable, and demo their usage.
PostGIS turns Postgres into a spatial database. It adds support for geographic objects, allowing location queries to be run in SQL.
HyperLogLog (HLL) & TopN add approximation algorithms to Postgres. These sketch algorithms are used in distributed systems when real-time responses to queries matter more than exact results.
pgpartman makes creating and managing partitions in Postgres easy. Through careful partition management with pgpartman, Postgres offers 5-10x higher write and query performance for time-series data.
Citus transforms Postgres into a distributed database. Citus transparently shards and replicates data, performs distributed deadlock detection, and parallelizes queries.
After demoing these popular extensions, we’ll conclude with why we think the monolithic relational database is dying and how Postgres sets a path for the future. We’ll end the talk with a Q&A.
Slides from my talk at Big Data Conference 2018 in Vilnius
Doing data science today is far more difficult than it will be in the next 5-10 years. Sharing, collaborating on data science workflows in painful, pushing models into production is challenging.
Let’s explore what Azure provides to ease Data Scientists’ pains. What tools and services can we choose based on a problem definition, skillset or infrastructure requirements?
In this talk, you will learn about Azure Machine Learning Studio, Azure Databricks, Data Science Virtual Machines and Cognitive Services, with all the perks and limitations.
SpringPeople - Introduction to Cloud ComputingSpringPeople
Cloud computing is no longer a fad that is going around. It is for real and is perhaps the most talked about subject. Various players in the cloud eco-system have provided a definition that is closely aligned to their sweet spot –let it be infrastructure, platforms or applications.
This presentation will provide an exposure of a variety of cloud computing techniques, architecture, technology options to the participants and in general will familiarize cloud fundamentals in a holistic manner spanning all dimensions such as cost, operations, technology etc
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
2. Who am I?
• Andy Cross; andy@elastacloud.com; @andyelastacloud
• Microsoft Regional Director
• Azure MVP for 7 years
• Co-Founder of Elastacloud, an international Microsoft Gold Partner
specialising in data
3. What am I here to talk about?
• Machine Learning
• A new dotnet approach to data
• ML.NET
• Parquet.NET
• Big Data in production
• Relevant tools in Azure for data
4. .NET Core
DESKTOPWEB CLOUD IoT AI
LIBRARIES
INFRASTRUCTURE
.NET CORE.NET CORE 3
.NET Core 3 expands supported
workloads to include Windows
Desktop, IoT & AI
.NET Core is perfectly suited for the requirements
of cloud-native, cross-platform workloads
5. Highly-compatible,
targeted improvements,
like last few releases
• .NET Framework support unchanged
(supported for life of Windows)
• Side-by-side support
• Self-contained EXEs
UWP
Windows-only
DATAWEB
ASP.NET
Core
EF 6
EF Core
AI/ML
ML.NET
WPF
.NET Core 3.NET Framework 4.8
.NET STANDARD
Existing
App
Highly
compatible
updates
Install .NET
Core updates
per your
needs
Modernize Desktop Apps with .NET Core 3
Windows-only Cross-platform
Windows
Forms
Update .NET Framework Apps
• XAML Islands - WinForms & WPF apps can host UWP controls
• Full access to Windows 10 APIs
FEATURES IN BOTH FXs
6. What is machine learning?
And what does it mean to say “Artificial Intelligence”
7. Machine learning for prediction talk quality
Formal definition: -
"A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P if its performance at tasks in T, as
measured by P, improves with experience E.“
E = historical quantities of beer consumed and corresponding audience
ratings
T = predicting talk quality
P = accuracy of the talk quality prediction
8. Types of machine learning
Two most commonly used machine learning methods: -
1. Supervised
• the computer is presented with example inputs and their desired outputs, given
by a "teacher“
• goal is to learn a general rule that maps inputs to outputs
2. Unsupervised
• no desired outputs (“labels”) are given to the computer, leaving it on its own to
find structure in the input
• the goal may be to discover hidden patterns or groups in the data
9. Examples tonight
• Sentiment analysis of website comments
• Finding mappings of historic taxi fares
• Predicting the most likely group of flowers a certain flower belongs to
• Time Series
• Image Recognition
• NOTE: We reuse algorithms; to do so in most cases is akin to writing a
new database to write a website – a vast overreach of engineering
11. Machine learning best practice
Machine learning should be approached in a methodical manner, in
this way we are more likely to achieve accurate, reliable and
generalisable models
This is achieved by following best practice for machine learning
Best practice mostly revolves around how the data is used
12. Machine learning best practice
Data preparation
• Ensure that the feature do not contain future data (aka time-travelling)
• Training, validation and testing data sets
Cross validation
• Think of this as using mini training and testing data sets to find a model that
generalises to the problem
Validation and testing data sets
• Data that the model has never seen before – simulate the future
• Gives a final ‘sanity’ check of our model’s performance
13. Model training
Provide an algorithm with the training data set
The algorithm is ‘tuned’ to find the best parameters that minimise
some measure of error
14. Model testing
To ensure that our model has not overfit to the training data it is
imperative that we use it to predict values from unseen data
This is how we ensure that our model is generalisable and will provide
reliable predictions on future data
Use a validation set and test sets
15. Validation set
The validation set is randomly chosen data from the same data that the
model is trained with – but not used for training
Used to check that the trained model gives predictions that are
representative of all data
Used to prevent overfitting
Gives a ‘best’ accuracy
16. Test set
The test set data should be ‘future’ data
We simulate this by selecting data from the end of the available time-
frame with a gap
Use the trained model to predict for this
More realistic of the model in production
Gives a conservative estimate of the accuracy
17. Test 2
= Training data
= Validation data
= Testing data
= Ignored data
01/01/15 13/11/1721/06/17 01/11/17
Test 1
09/06/1704/02/17
18. Model training, validation and testing
Train with cross validation to find best parameters
Assess overfitting on validation set
Retrain with best parameters on training data set
Evaluate performance on test data sets
20. Distributed Computing
• Break a larger problem into smaller tasks
• Distribute tasks around multiple computation hosts
• Aggregate results into an overall result
• Types of Distributed Compute
• HPC – Compute Bound
• Big Data – IO Bound
• Big Data – Database free at scale IO over flat files
21. Freed of historic constraints
• Algorithms such as we’ll see tonight are not new
• From the 1960s many
• The difference is the ability to show the algorithm more examples
• Removing IO bounds gives us access to more data
• Removing CPU bounds allows us to compute over larger domains
22. Azure Services for Distributed Compute
• Azure Batch
• Azure HDInsight
• Azure Databricks
• Bring your own partitioner
• Azure Functions
• Azure Service Fabric (Mesh)
• Virtual Machines
• ACS/AKS
24. Various places for ML
• Azure Databricks
• Azure HDInsight
• Azure Data Lake Analytics
• Azure ML
• V1
• V2 with Workbench
• V2 RC no Workbench
• R/Python in many hosts
• Functions
• Batch
• SQL Database
• C# and dotnet hosted in many
places
• Typical Azure DevOps pipelines
more mature for .NET
25. Azure Databricks
• Databricks Spark hosted as SaaS on Azure
• Focussed on collaboration between data scientists and data engineers
• Powered by Apache Spark
• Dev in:
• Scala
• Python
• Spark-SQL
• More?
27. Databricks Notebooks Visuals
Visualisations made easy
Use popular libraries – ggplot, matplotlib
Create a dashboard of pinned visuals
Use Pip, CRAN & JAR packages to add
additional visual tools
28. Azure HDInsight
• Multipurpose Big Data platform
• Operates as a host and configurator for various tools
• Hadoop
• Hive
• Hbase
• Kafka
• Spark
• Storm
30. Data Landscape
.NET
Utilising System.IO
is unstructured,
our evidence is
that most users
handroll some
form of CSV/TSV
or use JSON
encoding as it is a
safe space.
First stage transcoding
from “something .NET
can write to”
All subsequent data
inspection and
manipulation outside
of .NET
Parquet
Operationalisation
mandates database
rectangularisation to take
advantage of ADO.NET
.NET
31. Data Landscape
.NET
Utilising System.IO
is unstructured,
our evidence is
that most users
handroll some
form of CSV/TSV
or use JSON
encoding as it is a
safe space.
Immediately serialize
.NET POCOs to
Parquet
Parquet specific
optimisation and
inspection tools in
.net
Parquet
.NET can read operational
output from ML and Big
Data natively using
Parquet.net
.NETParquet.NET
Machine
Learnin
g for
.NET
33. 0.6.0 Goodbye: ML.NET Pipelines
• Library evolving quickly
• Towards common approach in Spark/SciKit-Learn
• LearningPipeline weaknesses:
• Enumerable of actions not common elsewhere
• Logging and State not handled
• Bring in Context, Environment and lower level data tools IDataView
• Not all <0.6.0 features available, so later demos are “Legacy”
34. Sentiment Analysis
• By showing examples of content and a Toxic bit flag, learn that makes
things Toxic
• The quality of the model reflects the quality of the data
• Representation
• Variety
• Breadth
35. Testing and Training Data
Sentiment SentimentText
1 ==You're cool== You seem like a really cool guy... *bursts out laughing at sarcasm*.
0 I just want to point something out (and I'm in no way a supporter of the strange old git), but he is referred to as Dear Leader, and his father
was referred to as Great Leader.
1 ==RUDE== Dude, you are rude upload that carl picture back, or else.
0 " : I know you listed your English as on the ""level 2"", but don't worry, you seem to be doing nicely otherwise, judging by the same page -
so don't be taken aback. I just wanted to know if you were aware of what you wrote, and think it's an interesting case. : I would write that
sentence simply as ""Theoretically I am an altruist, but only by word, not by my actions."". : PS. You can reply to me on this same page, as I
have it on my watchlist. "
Tab Separated, two columned data set with headers.
36.
37. Simple to build classifier
var pipeline = new TextTransform(env, "Text", "Features")
.Append(new LinearClassificationTrainer(env, new
LinearClassificationTrainer.Arguments(),
"Features",
"Label"));
var model = pipeline.Fit(trainingData);
38. Evaluation of Models
IDataView testData = GetData(env, _testDataPath);
var predictions = model.Transform(testData);
var binClassificationCtx = new BinaryClassificationContext(env);
var metrics = binClassificationCtx.Evaluate(predictions, "Label");
PredictionModel quality metrics evaluation
------------------------------------------
Accuracy: 94.44%
Auc: 98.77%
F1Score: 94.74%
42. The problem with flat files
• With no database or storage engine Data is written arbitrarily to disc
• Format errors
• Caused by bug
• Caused by error
• Inefficiencies
• Compression an afterthought
• GZIP splittable problem
43. More problems with flat files
• Access errors
• Mutability
• Variability between files in a fileset
• Naivety
• Just because brute force scanning is possible doesn’t mean it’s optimal
• Predicate Push-Down
• Partitioning
44. Parquet
• Apache Parque is a file format, primarily driven from Hadoop
Ecosystem, particularly loved by Spark
• Columnar format
• Block
• File
• Row Group
• Column Chunk
• Page
45. Designed for Data(!)
• Schemas consistent through file
• Held at the end so can be quickly seeked
• By Convention: WRITE ONCE
• Parallelism:
• Partition workload by File or Row Group
• Read data by column chunk
• Compress per Page
46.
47. Parquet.NET
• Until recently no approach available to .NET
• Leading to System.IO to write arbitrary data and then requiring data
engineering to sort the data out
• Libraries for cpp can be used
• Implementation by G-Research called ParquetSharp uses Pinvoke
• A full .NET Core implementation is Parquet.NET
• https://github.com/elastacloud/parquet-dotnet
48.
49. Parq!
• End to end tooling for .NET devs on a platform they’re familiar with.
• Uses dotnet global tooling
53. What is regression?
• Supervised Machine Learning
• Features go to a Label
• Label is a “real value” not a categorical like in classification
• Regression algorithms generate weights for features
54.
55. Taxi Fare Data
• NYC data
• Taxi Fares, Time, Distance, Payment Method etc
• Use these as features and predict the most likely Fare ahead of time.
64. What does it tell us?
• In a multiple-regression model to determine solar energy production:
• The energy production is the dependent variable (Y)
• The cloud cover level is an independent variable (X1)
• The season of year is an independent variable (X2)
• Y = X1*weightX1 + X2*weightX2
• It’s a coefficient (ratio) of how good my predictions are versus the
amount of variability in my dependent variable.
65. How does it do that?
• It measures the relationship between two variables:
• Y-hat : ŷ
• The estimator is based on regression variables, Intercept, X Variable 1 to X Variable n
• The distance from this estimator (prediction) and the real y value (y- ŷ)
• Squared
• Y-bar : ȳ
• The average value of all Ys
• The distance of the real y value from the mean of all y values (y-ȳ), which is how much the
data varies from average
• Squared
• These two squares are summed, and calculated:
• 1-(((y-ŷ)^2)/((y-ȳ)^2))
• 1-(estimation error/actual distance from average)
66. Y is our actual value:
Energy generation
X1 is our first input value
we want to predict from,
X2 our second: Irradiance
Percentage and Days left
until service
The data science team works hard to
understand the data and model it, which
means produce weights for how much
each input variable affects the actual value
If you were to plot out these values
Intercept + X1 * X Variable 1 + X2 * X Variable 2
You would get a straight line like the red one above
(weightX1) (weightX2)
67. To calculate a prediction (called y-hat) we add the intercept to the variable
values multiplied by their modifiers (coefficient)
68. IF we take away the prediction
(estimator) from the actual value we
have the residual, or the size of the
error. If this is too high, the error is
positive, if the number is too low, the
error is negative.
You might sometimes hear data
scientists talking about residuals; these
are the residuals.
It is simply the actual value minus the
prediction.
69. We can also measure the difference between
the observed actual value and the average
(mean) of the whole set of actuals.
This gives us a distance measure of variance,
how far the actual varies from the average
value of the actuals. This is a measure of the
variance of the data.
If the number is bigger than average, the
number is positive, if it is smaller than average,
the number is negative.
70. Both the size of the measures we
just created have a sum of 0,
because some values are bigger
and some smaller than the
actuals. This means when they get
added up they cancel out to 0.
This is correct, but we are trying to
compare the sizes of the errors,
not whether they are above or
below actual, so we need to lose
the sign and make everything
positive.
The way we’ll do that is by times-
ing the number by itself (squaring
the number), as -1*-1 = 1, -2 * -2 =
4 etc
Adding all these up across the set
gives us the sums of squares
71. We then divide the sum of squares of our estimator error by the sum of squares of our distance from average.
The estimator errors are always lower than the variance of the data, and we take the result from 1, to give us a
value like 0.80, which we shouldn’t describe as 80% accurate, but you can think of it along these lines; 80% of
variance is explained by the model.
72. Root Mean Squared Error
RMSE is the average of how far out we are on an estimation by estimation basis
76. What does it tell us?
• By comparing the average distance between the estimator ŷ and the
observed actual, we can get a measure of how close in real world
terms we are on average to the actual.
• Unlike r2 which gives an abstract view on variance, RMSE gives the
bounds to the accuracy of our prediction.
• On average, the real value is +/- RMSE from the estimator.
77. How do we calculate it?
• It’s very easy and we’ve already done the hard work.
• The MSE part means average of the squared distance of the estimator
from the actual
• =AVERAGE(data[(y-ŷ)^2])
• Since the values were squared, this number is still big; square rooting
this gives us a real world value.
• =SQRT(AVERAGE(data[(y-ŷ)^2]))
78. Take the average of the
squared estimator
distance from actual.
Squaring it earlier was
useful, as the non-
squared value averages
out to zero!
79. Since the MSE is still in the order of magnitude of the Squares, square root it to give us a real world value
and this is Root Mean Square Error (RMSE).
In this example, the estimate is on average 147.673 away from the actual value.
80. This model can therefore be described as:
being able to describe 80% of variance
being on average 147.6 away from the
actual
82. What is Clustering
• Used to generate groups or clusters of similarities
• Identify relationships that are not evident to humans
• Examples may be:
• Who is a VIP customer?
• Who is cheating in a game?
• How many people are likely to drop out of class?
• Which components in manufacturing are likely to fail?
84. Supervised Learning
Definition
- You give the input data (X)
and an output variable (Y)
(labels), and you use an
algorithm to learn the
mapping function from
the input to the output.
Y = f(X)
Techniques
- Classification: you want to
classify a new input value.
- Regression
85. Unsupervised Learning
Definition
- You give the input data (X)
and no corresponding
output variables (labels).
Techniques
- Clustering: you want to
discover the inherent
groupings in the data.
- Association: you want to
discover rules that
describe large portions of
your data.
92. TensorFlow since 0.5.0
• Additional work for:
• CNTK
• Torch
• This means ML.NET can be the OSS, XPLAT host for many data science
frameworks, anywhere that NETCORE runs.
• On devices with Xamarin, on edge devices, on web servers.
95. Input
Files
Stream
Unified Cloud Data Environment
Excel
XML
Other
Landing
Excel
XML
Other
A true copy
of the input
data, used
to trace
linage and
rebuild data
Staging
Validated once
and for all,
optimised for
processing
Parquet
Enriched
Results of
machine
learning
experiments
Parquet
Operationalised
Any combination of modern data
platform to support visualisations
and operational systems
Parquet.NET ML.NET
97. The questions to answer
• Opportunity discovery
• What can we do?
• Adoption
• How do I get my company to use this?
• Sustainability
• How do we control costs and/or build a viable price point?
• Measurement
• How do we know it’s generating value?
Build
Use
Profit
98. The end goals
• Providing better customer service and experience
• Better products
• Self-healing and self-improving
• Improving operational processes and resource consumption
• Faster Time to Market
• Lighter Bill of Materials
• Stronger Supply Chain
• New business models
• Insights from data
• Different pricing models
100. Mental Models
In order to envision the transformation possibilities, adopt the
following mental models.
AI is an extremely focussed entry level clerical assistant
AI is exceptionally reactive
AI is auditable, repeatable, improvable
AI is resistant of external bias
101. One Million Clerical Workers
What is achievable with a million trained office
workers?
102. Billy the Kid Reaction
Times
What if all workers
can react and form a
course of action in 1
millisecond?
103. Auditable Actions
What if all actions and their impact were
measurable and qualified?
Great companies have high cultures of accountability, it
comes with this culture of criticism […], and I think our
culture is strong on that.
Steve Ballmer
104. Cement the views of the Founders
What if the decision making process of the hyper-successful
founders was cemented and eternally referenceable? Resistant
to the distractions of hype and false promises.
105. All tools are Smart Tools
What if every tool used by a business was as smart as a
person. A forklift could tell you the context of a crate
was the wrong weight. A machine could tell you it was
feeling unwell. A room could tell you someone had left
a dangerous tool in an unsafe state.
107. JIT Service Regimes
• Migrate from manufacturer led maintenance cycles to just-in-time
• Use data to establish mean time to failure
• Conditional maintenance
108. AzureML
~ 1 minute telemetry
through dedicated cloud
gateway
Reporting
Alarm
- Machine Learning used in real-time with “Anomaly
Detection” which enables history to be assessed and
false positive spikes in behaviour to be discarded
- Model retraining can occur to help understand
acceptable lowering of efficiency as equipment degrades
over time within acceptable parameters
- Predictions through AI become an
integral part of business reporting
109. Supplier efficiency
• Use data to judge the reliability of assets
• Track fault recurrence and resolution speeds
• Contract with SLAs around reducing outages
111. JIT Warehousing
• Regarding Parts
• JIT Maintenance leads to JIT Warehousing
• Regarding Fuel
• Integrated Data streams including ERP and CRM
• Manage the supply chain and keep the right stock level
112. G4S Data Science Programme
• G4S plc is a British multinational security services company and operates an
integrated security business in more than 90 countries across the globe.
• They aim to differentiate G4S by providing industry leading security solutions that
are innovative, reliable and efficient.
• Aim: Predict anomalous card activity in monitored buildings
114. Data Science Conceptual Model – Crisp DM outline Methodology
Business
Understanding
Business
Objective
Data
Understanding
Data
Preparation
Deployment
Modelling and
Evaluation
Data Science Team
Business Understanding
• IDENTIFYING YOUR BUSINESS GOALS
• A problem that your management wants to address
• The business goals
• Constraints (limitations on what you may do, the kinds of solutions that can be
used, when the work must be completed, and so on)
• Impact (how the problem and possible solutions fit in with the business)
• ASSESSING YOUR SITUATION
• Inventory of resources: A list of all resources available for the project.
• Requirements, assumptions, and constraints:
• Risks and contingencies:
• Terminology
• Costs and benefits:
• DEFINING YOUR DATA-MINING GOALS
• Data-mining goals: Define data-mining deliverables, such as models, reports,
presentations, and processed datasets.
• Data-mining success criteria: Define the data-mining technical criteria necessary
to support the business success criteria. Try to define these in quantitative terms
(such as model accuracy or predictive improvement compared to an existing
method).
• PRODUCING YOUR PROJECT PLAN
• Project plan: Outline your step-by-step action plan for the project. (for example,
modelling and evaluation usually call for several back-and-forth repetitions).
• Initial assessment of tools and techniques
115. Data Science Conceptual Model – Crisp DM outline Methodology
Business
Understanding
Business
Objective
Data
Understanding
Data
Preparation
Deployment
Modelling and
Evaluation
Data Science Team
Data Understanding
• GATHERING DATA
• Outline data requirements: Create a list of the types of data necessary to address
the data mining goals. Expand the list with details such as the required time
range and data formats.
• Verify data availability: Confirm that the required data exists, and that you can
use it.
• Define selection criteria: Identify the specific data sources (databases, files,
documents, and so on.)
• DESCRIBING DATA
• Now that you have data, prepare a general description of what you have.
• EXPLORING DATA
• Get familiar with the data.
• Spot signs of data quality problems.
• Set the stage for data preparation steps.
• VERIFYING DATA QUALITY
• The data you need doesn’t exist. (Did it never exist, or was it discarded? Can this
data be collected and saved for future use?)
• It exists, but you can’t have it. (Can this restriction be overcome?)
• You find severe data quality issues (lots of missing or incorrect values that can’t
be corrected).
116. Data Science Conceptual Model – Crisp DM outline Methodology
Business
Understanding
Business
Objective
Data
Understanding
Data
Preparation
Deployment
Modelling and
Evaluation
Data Science Team
Data Preparation
• SELECTING DATA
• Now you will decide which portion of the data that you have is actually going to
be used for data mining.
• The deliverable for this task is the rationale for inclusion and exclusion. In it,
you’ll explain what data will, and will not, be used for further data-mining work.
• You’ll explain the reasons for including or excluding each part of the data that
you have, based on relevance to your goals, data quality, and technical issues
• CLEANING DATA
• The data that you’ve chosen to use is unlikely to be perfectly clean (error-free).
• You’ll make changes, perhaps tracking down sources to make specific data
corrections, excluding some cases or individual cells (items of data), or replacing
some items of data with default values or replacements selected by a more
sophisticated modelling technique.
• CONSTRUCTING DATA
• You may need to derive some new fields (for example, use the delivery date and
the date when a customer placed an order to calculate how long the customer
waited to receive an order), aggregate data, or otherwise create a new form of
data.
• INTEGRATING DATA
• Your data may now be in several disparate datasets. You’ll need to merge some
or all of those disparate datasets together to get ready for the modelling phase.
• FORMATTING DATA
• Data often comes to you in formats other than the ones that are most
convenient for modelling. (Format changes are usually driven by the design of
your tools.) So convert those formats now.
117. Data Science Conceptual Model – Crisp DM outline Methodology
Business
Understanding
Business
Objective
Data
Understanding
Data
Preparation
Deployment
Modelling and
Evaluation
Data Science Team
Modelling and Evaluation (Modelling)
• SELECTING MODELING TECHNIQUES
• Modelling technique: Specify the technique(s) that you will use.
• Modelling assumptions: Many modelling techniques are based on certain
assumptions.
• DESIGNING TESTS
• The test in this task is the test that you’ll use to determine how well your model
works. It may be as simple as splitting your data into a group of cases for model
training and another group for model testing.
• Training data is used to fit mathematical forms to the data model, and test data
is used during the model-training process to avoid overfitting
• BUILDING MODEL(S)
• Parameter settings: When building models, most tools give you the option of
adjusting a variety of settings, and these settings have an impact on the structure
of the final model. Document these settings in a report.
• Model descriptions: Describe your models. State the type of model (such as
linear regression or neural network) and the variables used.
• Models: This deliverable is the models themselves. Some model types can be
easily defined with a simple equation; others are far too complex and must be
transmitted in a more sophisticated format.
• ASSESSING MODEL(S)
• Model assessment: Summarizes the information developed in your model
review. If you have created several models, you may rank them based on your
assessment of their value for a specific application.
• Revised parameter settings: You may choose to fine-tune settings that were used
to build the model and conduct another round of modelling and try to improve
your results.
118. Data Science Conceptual Model – Crisp DM outline Methodology
Business
Understanding
Business
Objective
Data
Understanding
Data
Preparation
Deployment
Modelling and
Evaluation
Data Science Team
Modelling and Evaluation Cont… (Evaluation)
• EVALUATING RESULTS
• Assessment of results (for business goals): Summarize the results with respect to
the business success criteria that you established in the business-understanding
phase. Explicitly state whether you have reached the business goals defined at
the start of the project.
• Approved models: These include any models that meet the business success
criteria.
• REVIEWING THE PROCESS
• Now that you have explored data and developed models, take time to review
your process. This is an opportunity to spot issues that you might have
overlooked and that might draw your attention to flaws in the work that you’ve
done while you still have time to correct the problem before deployment. Also
consider ways that you might improve your process for future projects.
• DETERMINING THE NEXT STEPS
• List of possible actions: Describe each alternative action, along with the
strongest reasons for and against it.
• Decision: State the final decision on each possible action, along with the
reasoning behind the decision.
119. Data Science Conceptual Model – Crisp DM outline Methodology
Business
Understanding
Business
Objective
Data
Understanding
Data
Preparation
Deployment
Modelling and
Evaluation
Data Science Team
Deployment
• PLANNING DEPLOYMENT
• When your model is ready to use, you will need a strategy for putting it to work
in your business.
• PLANNING MONITORING AND MAINTENANCE
• Data-mining work is a cycle, so expect to stay actively involved with your models
as they are integrated into everyday use.
• REPORTING FINAL RESULTS
• Final report: The final report summarizes the entire project by assembling all the
reports created up to this point, and adding an overview summarizing the entire
project and its results.
• Final presentation: A summary of the final report is presented in a meeting with
management. This is also an opportunity to address any open questions.
• REVIEW PROJECT
• Finally, the data-mining team meets to discuss what worked and what didn’t,
what would be good to do again, and what should be avoided!
120. Business
Understanding
Data Science Continuous Integration and improvement Cycle
Business
Objective
Data
Understanding
Data
Preparation
Deployment
Modelling and
Evaluation
Business
Understanding
New
Objective
Data
Understanding
Data
Preparation
Deployment
Modelling and
Evaluation
Business
Understanding
New
Objective
Data
Understanding
Data
Preparation
Deployment
Modelling and
Evaluation
Data Science Team Data Science Team Data Science Team
Editor's Notes
ANIMATED SLIDE – shows progression of entire .NET platform into .NET Core specific workloads.
Recap – assume no one knows anything about netcore
.NET Core is our cross-platform, open source implementation of .NET and is perfectly suited for requirements of cloud-native, cross-platform services. We’ve made significant investments in the core performance as well as the web stack so that you can easily take advantage of cloud patterns and scale.
.NET Core 3 will expand on the supported workloads to include IoT, AI and Windows Desktop.
.NET Framework will continue with highly-compatible targeted improvements like the last few releases and support will remain unchanged. You will be able to retarget existing apps to 4.8 and gain these improvements.
When you’re ready to move to .NET Core 3 (to take advantage of SxS install, interop between desktop app frameworks, and .NET Core performance) you can retarget and rebuild your apps to .NET Core.
Notebooks have a primary language R, Python & Scala
Use magic to change language
Supports SQL, R, Python & Scala
Create table from dataframe in any language to share between spark contexts
Easy to pair to translate
Treat Notebooks as pseudo-OO call out to function or values in other notebooks, can be combined with if / switch logic etc
Chain notebooks passing data frames and resources from one to another
Schedule notebooks
Plots make easy one line display(dataframe)
GUI driven experience between plots with drag and drop
Use SQL, R, Python & Scala as primary language
Support for popular visual libraries including GGPlot & Matplotlib
Pip, R & JAR packages to add additional visuals
Interactive Visuals i.e. Plotly
Create a Dashboard of pinned visuals