DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
Sesja pokazująca zarówno Machine Learning Server (czyli algorytmy uczenia maszynowego w językach R i Python), ale także możliwość korzystania z danych JSON w SQL Server, czy też łączenia się do danych znajdujących się na HDFS, HADOOP, czy Spark poprzez Polybase w SQL Server, by te dane wykorzystywać do analizy, predykcji poprzez modele w językach R lub Python.
Sesja na temat analizy sentymentu, ale także i algorytmów uczenia maszynowego w bibliotekach do języka R Microsoft. Sesja była prezentowana na konferencji WhyR? w Warszawie
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria. However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. In addition, there are no means to run complex queries on models and related data.
In this talk, we present ModelDB, a novel end-to-end system for managing machine learning (ML) models. Using client libraries, ModelDB automatically tracks and versions ML models in their native environments (e.g. spark.ml, scikit-learn). A common set of abstractions enable ModelDB to capture models and pipelines built across different languages and environments. The structured representation of models and metadata then provides a platform for users to issue complex queries across various modeling artifacts. Our rich web frontend provides a way to query ModelDB at varying levels of granularity.
ModelDB has been open-sourced at https://github.com/mitdbg/modeldb.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
Superworkflow of Graph Neural Networks with K8S and FugueDatabricks
When machine learning models are productionized, they are commonly formed as workflows with multiple tasks, managed by a task scheduler such as Airflow, Prefect. Traditionally each task within the same workflow uses similar computing frameworks (e.g. Python, Spark, and PyTorch) in the same backend computing environment (e.g. AWS EMR, Google DataProc) with globally fixed settings (e.g. instances, cores, memory).
In complicated use cases, such traditional workflows create large resource and runtime inefficiency, hence it is highly desired to use different computing frameworks in the same workflow in different computing environments. Such workflows can be named as superworkflows. Fugue is an open-sourced abstraction layer on top of different computing frameworks and creates uniform interfaces to use these frameworks without dealing with the complexities associated with them. To this end, Fugue can be viewed as a superframework.
In addition, Kubernetes (K8S) is a container orchestration system, and it is easy to create different computing environments (e.g. Spark, PyTorch) with different docker images as everything is containerized in K8S. It is natural to combine K8S and Fugue to create superworkflows for complicated machine learning problems. In this talk, we use a popular graph neural network named Node2Vec as an example to illustrate how to create an efficient superworkflow using Fugue and K8S on very large graphs with hundreds of millions of vertices and edges.
We also demonstrate how to partition the whole Node2Vec process into multiple tasks based on their complexities and parallelism. Benchmark testing is conducted for comparing performance and resource efficiency. Finally, it is easy to generalize this superworkflow concept to other deep learning problems.
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
Sesja pokazująca zarówno Machine Learning Server (czyli algorytmy uczenia maszynowego w językach R i Python), ale także możliwość korzystania z danych JSON w SQL Server, czy też łączenia się do danych znajdujących się na HDFS, HADOOP, czy Spark poprzez Polybase w SQL Server, by te dane wykorzystywać do analizy, predykcji poprzez modele w językach R lub Python.
Sesja na temat analizy sentymentu, ale także i algorytmów uczenia maszynowego w bibliotekach do języka R Microsoft. Sesja była prezentowana na konferencji WhyR? w Warszawie
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria. However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. In addition, there are no means to run complex queries on models and related data.
In this talk, we present ModelDB, a novel end-to-end system for managing machine learning (ML) models. Using client libraries, ModelDB automatically tracks and versions ML models in their native environments (e.g. spark.ml, scikit-learn). A common set of abstractions enable ModelDB to capture models and pipelines built across different languages and environments. The structured representation of models and metadata then provides a platform for users to issue complex queries across various modeling artifacts. Our rich web frontend provides a way to query ModelDB at varying levels of granularity.
ModelDB has been open-sourced at https://github.com/mitdbg/modeldb.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
Superworkflow of Graph Neural Networks with K8S and FugueDatabricks
When machine learning models are productionized, they are commonly formed as workflows with multiple tasks, managed by a task scheduler such as Airflow, Prefect. Traditionally each task within the same workflow uses similar computing frameworks (e.g. Python, Spark, and PyTorch) in the same backend computing environment (e.g. AWS EMR, Google DataProc) with globally fixed settings (e.g. instances, cores, memory).
In complicated use cases, such traditional workflows create large resource and runtime inefficiency, hence it is highly desired to use different computing frameworks in the same workflow in different computing environments. Such workflows can be named as superworkflows. Fugue is an open-sourced abstraction layer on top of different computing frameworks and creates uniform interfaces to use these frameworks without dealing with the complexities associated with them. To this end, Fugue can be viewed as a superframework.
In addition, Kubernetes (K8S) is a container orchestration system, and it is easy to create different computing environments (e.g. Spark, PyTorch) with different docker images as everything is containerized in K8S. It is natural to combine K8S and Fugue to create superworkflows for complicated machine learning problems. In this talk, we use a popular graph neural network named Node2Vec as an example to illustrate how to create an efficient superworkflow using Fugue and K8S on very large graphs with hundreds of millions of vertices and edges.
We also demonstrate how to partition the whole Node2Vec process into multiple tasks based on their complexities and parallelism. Benchmark testing is conducted for comparing performance and resource efficiency. Finally, it is easy to generalize this superworkflow concept to other deep learning problems.
Tech-Talk at Bay Area Spark Meetup
Apache Spark(tm) has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment. How do I embed what I have learned into customer facing data applications. Like all things in engineering, it depends.
In this meetup, we will discuss best practices from Databricks on how our customers productionize machine learning models and do a deep dive with actual customer case studies and live demos of a few example architectures and code in Python and Scala. We will also briefly touch on what is coming in Apache Spark 2.X with model serialization and scoring options.
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
This talk discusses the trajectory of MLlib, the Machine Learning (ML) library for Apache Spark. We will review the history of the project, including major trends and efforts leading up to today. These discussions will provide perspective as we delve into ongoing and future efforts within the community. This talk is geared towards both practitioners and developers and will provide a deeper understanding of priorities, directions and plans for MLlib.
Since the original MLlib project was merged into Apache Spark, some of the most significant efforts have been in expanding algorithmic coverage, adding multiple language APIs, supporting ML Pipelines, improving DataFrame integration, and providing model persistence. At an even higher level, the project has evolved from building a standard ML library to supporting complex workflows and production requirements.
This momentum continues. We will discuss some of the major ongoing and future efforts in Apache Spark based on discussions, planning and development amongst the MLlib community. We (the community) aim to provide pluggable and extensible APIs usable by both practitioners and ML library developers. To take advantage of Projects Tungsten and Catalyst, we are exploring DataFrame-based implementations of ML algorithms for better scaling and performance. Finally, we are making continuous improvements to core algorithms in performance, functionality, and robustness. We will augment this discussion with statistics from project activity.
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
I will share the vision and the production journey of how we build enterprise shared AI As A Service platforms with distributed deep learning technologies. Including those topics:
1) The vision of Enterprise Shared AI As A Service and typical AI services use cases at FinTech industry
2) The high level architecture design principles for AI As A Service
3) The technical evaluation journey to choose an enterprise deep learning framework with comparisons, such as why we choose Deep learning framework based on Spark ecosystem
4) Share some production AI use cases, such as how we implemented new Users-Items Propensity Models with deep learning algorithms with Spark,improve the quality , performance and accuracy of offer and campaigns design, targeting offer matching and linking etc.
5) Share some experiences and tips of using deep learning technologies on top of Spark , such as how we conduct Intel BigDL into a real production.
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
Does more data always improve ML models? Is it better to use distributed ML instead of single node ML?
In this talk I will show that while more data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video more data does not significantly improve high bias problem spaces where traditional ML is more appropriate. Additionally, even in the deep learning domain, single node models can still outperform distributed models via transfer learning.
Data scientists have pain points running many models in parallel automating the experimental set up. Getting others (especially analysts) within an organization to use their models Databricks solves these problems using pandas udfs, ml runtime and MLflow.
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceWei Di
B2B sales intelligence has become an integral part of LinkedIn’s business to help companies optimize resource allocation and design effective sales and marketing strategies. This new trend of data-driven approaches has “sparked” a new wave of AI and ML needs in companies large and small. Given the tremendous complexity that arises from the multitude of business needs across different verticals and product lines, Apache Spark, with its rich machine learning libraries, scalable data processing engine and developer-friendly APIs, has been proven to be a great fit for delivering such intelligence at scale.
See how Linkedin is utilizing Spark for building sales intelligence products. This session will introduce a comprehensive B2B intelligence system built on top of various open source stacks. The system puts advanced data science to work in a dynamic and complex scenario, in an easily controllable and interpretable way. Balancing flexibility and complexity, the system can deal with various problems in a unified manner and yield actionable insights to empower successful business. You will also learn about some impactful Spark-ML powered applications such as prospect prediction and prioritization, churn prediction, model interpretation, as well as challenges and lessons learned at LinkedIn while building such platform.
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
We’re always told to ‘Go for the Gold!,’ but how do we get there? This talk will walk you through the process of moving your data to the finish fine to get that gold metal! A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (‘Bronze’ tables), transformation/feature engineering (‘Silver’ tables), and machine learning training or prediction (‘Gold’ tables). Combined, we refer to these tables as a ‘multi-hop’ architecture. It allows data engineers to build a pipeline that begins with raw data as a ‘single source of truth’ from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake, so you can be the champion in your organization.
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
As data grows in size and connectedness dramatically in all dimensions, the potential for graph-enriched machine learning grows likewise, but scalable technologies are needed to both build models and apply them in real-time. Real-time deep-link graph pattern matching and analytics provides new opportunities for enriching your machine learning models with graph features.
‘In addition to the real-time deep-link aspect, the ability to process large datasets in a production pipeline provides a synergistic approach for the two distributed and performant platforms: Spark and TigerGraph. The TigerGraph graph database provides scalable real-time deep link graph analytics and augments Spark with graph analytics and predictions for a wide range of Machine Learning use cases.
In this session, we will explain the architecture and technical implementation for a TigerGraph+Spark graph-enhanced Machine Learning pipeline: Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data; use Spark to train and tune machine learning models at scale. As an example, we will present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph.
Specifically, the solution generates 118 graph features for 600 million users, to feed a machine learning system which detects three types of unwanted phone calls. TigerGraph then helps to deploy the model by extracting these 118 features in real-time for up to 10,000 calls per second, to give customers a real-time diagnosis of their incoming calls.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.
At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.
In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird
This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.
Outline:
Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A
NLP Text Recommendation System Journey to Automated TrainingDatabricks
This talk will cover how we built and productionized automated machine learning pipelines at Salesforce. Starting with heuristics to automated retraining using technologies including but not limited to Scala, Python, Apache Spark, Docker, Sagemaker for training, and serving. We will walk through the generally applicable data prep, feature engineering, training, evaluation/comparisons, and continuous model training including data feedback loops in containerized environments with Sagemaker. We will talk about our deployment and validation approach. Finally, we’ll draw lessons from iteratively building an enterprise ML product. Attendees will learn about the mental models for building end to end prod ML pipelines and GA ready products.
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
A 3 hours session introducing the concept of Machine Learning and Distributed Computing.
It includes many examples running in notebooks of experience run on data exploring models like LM, RF, K-Means, Deep Learning.
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Databricks
Deep Learning is now the standard in object detection, but it is not easy to analyze large amounts of images, especially in an interactive fashion. Traditionally, there has been a gap between Deep Learning frameworks, which excel at image processing, and more traditional ETL and data science tools, which are usually not designed to handle huge batches of complex data types such as images.
In this talk, we show how manipulating large corpora of images can be accomplished in a few lines of code because of recent developments in Apache Spark. Thanks to Spark’s unique ability to blend different libraries, we show how to start from satellite images and rapidly build complex queries on high level information such as houses or buildings. This is possible thanks to Magellan, a geospatial package, and Deep Learning Pipelines, a library that streamlines the integration of Deep Learning frameworks in Spark. At the end of this session, you will walk away with the confidence that you can solve your own image detection problems at any scale thanks to the power of Spark.
Artificial Intelligence and Deep Learning in Azure, CNTK and TensorflowJen Stirrup
Artificial Intelligence and Deep Learning in Azure, using Open Source technologies CNTK and Tensorflow. The tutorial can be found on GitHub here: https://github.com/Microsoft/CNTK/tree/master/Tutorials
and the CNTK video can be found here: https://youtu.be/qgwaP43ZIwA
Tech-Talk at Bay Area Spark Meetup
Apache Spark(tm) has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment. How do I embed what I have learned into customer facing data applications. Like all things in engineering, it depends.
In this meetup, we will discuss best practices from Databricks on how our customers productionize machine learning models and do a deep dive with actual customer case studies and live demos of a few example architectures and code in Python and Scala. We will also briefly touch on what is coming in Apache Spark 2.X with model serialization and scoring options.
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
This talk discusses the trajectory of MLlib, the Machine Learning (ML) library for Apache Spark. We will review the history of the project, including major trends and efforts leading up to today. These discussions will provide perspective as we delve into ongoing and future efforts within the community. This talk is geared towards both practitioners and developers and will provide a deeper understanding of priorities, directions and plans for MLlib.
Since the original MLlib project was merged into Apache Spark, some of the most significant efforts have been in expanding algorithmic coverage, adding multiple language APIs, supporting ML Pipelines, improving DataFrame integration, and providing model persistence. At an even higher level, the project has evolved from building a standard ML library to supporting complex workflows and production requirements.
This momentum continues. We will discuss some of the major ongoing and future efforts in Apache Spark based on discussions, planning and development amongst the MLlib community. We (the community) aim to provide pluggable and extensible APIs usable by both practitioners and ML library developers. To take advantage of Projects Tungsten and Catalyst, we are exploring DataFrame-based implementations of ML algorithms for better scaling and performance. Finally, we are making continuous improvements to core algorithms in performance, functionality, and robustness. We will augment this discussion with statistics from project activity.
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
I will share the vision and the production journey of how we build enterprise shared AI As A Service platforms with distributed deep learning technologies. Including those topics:
1) The vision of Enterprise Shared AI As A Service and typical AI services use cases at FinTech industry
2) The high level architecture design principles for AI As A Service
3) The technical evaluation journey to choose an enterprise deep learning framework with comparisons, such as why we choose Deep learning framework based on Spark ecosystem
4) Share some production AI use cases, such as how we implemented new Users-Items Propensity Models with deep learning algorithms with Spark,improve the quality , performance and accuracy of offer and campaigns design, targeting offer matching and linking etc.
5) Share some experiences and tips of using deep learning technologies on top of Spark , such as how we conduct Intel BigDL into a real production.
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
Does more data always improve ML models? Is it better to use distributed ML instead of single node ML?
In this talk I will show that while more data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video more data does not significantly improve high bias problem spaces where traditional ML is more appropriate. Additionally, even in the deep learning domain, single node models can still outperform distributed models via transfer learning.
Data scientists have pain points running many models in parallel automating the experimental set up. Getting others (especially analysts) within an organization to use their models Databricks solves these problems using pandas udfs, ml runtime and MLflow.
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceWei Di
B2B sales intelligence has become an integral part of LinkedIn’s business to help companies optimize resource allocation and design effective sales and marketing strategies. This new trend of data-driven approaches has “sparked” a new wave of AI and ML needs in companies large and small. Given the tremendous complexity that arises from the multitude of business needs across different verticals and product lines, Apache Spark, with its rich machine learning libraries, scalable data processing engine and developer-friendly APIs, has been proven to be a great fit for delivering such intelligence at scale.
See how Linkedin is utilizing Spark for building sales intelligence products. This session will introduce a comprehensive B2B intelligence system built on top of various open source stacks. The system puts advanced data science to work in a dynamic and complex scenario, in an easily controllable and interpretable way. Balancing flexibility and complexity, the system can deal with various problems in a unified manner and yield actionable insights to empower successful business. You will also learn about some impactful Spark-ML powered applications such as prospect prediction and prioritization, churn prediction, model interpretation, as well as challenges and lessons learned at LinkedIn while building such platform.
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
We’re always told to ‘Go for the Gold!,’ but how do we get there? This talk will walk you through the process of moving your data to the finish fine to get that gold metal! A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (‘Bronze’ tables), transformation/feature engineering (‘Silver’ tables), and machine learning training or prediction (‘Gold’ tables). Combined, we refer to these tables as a ‘multi-hop’ architecture. It allows data engineers to build a pipeline that begins with raw data as a ‘single source of truth’ from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake, so you can be the champion in your organization.
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
As data grows in size and connectedness dramatically in all dimensions, the potential for graph-enriched machine learning grows likewise, but scalable technologies are needed to both build models and apply them in real-time. Real-time deep-link graph pattern matching and analytics provides new opportunities for enriching your machine learning models with graph features.
‘In addition to the real-time deep-link aspect, the ability to process large datasets in a production pipeline provides a synergistic approach for the two distributed and performant platforms: Spark and TigerGraph. The TigerGraph graph database provides scalable real-time deep link graph analytics and augments Spark with graph analytics and predictions for a wide range of Machine Learning use cases.
In this session, we will explain the architecture and technical implementation for a TigerGraph+Spark graph-enhanced Machine Learning pipeline: Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data; use Spark to train and tune machine learning models at scale. As an example, we will present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph.
Specifically, the solution generates 118 graph features for 600 million users, to feed a machine learning system which detects three types of unwanted phone calls. TigerGraph then helps to deploy the model by extracting these 118 features in real-time for up to 10,000 calls per second, to give customers a real-time diagnosis of their incoming calls.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.
At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.
In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird
This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.
Outline:
Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A
NLP Text Recommendation System Journey to Automated TrainingDatabricks
This talk will cover how we built and productionized automated machine learning pipelines at Salesforce. Starting with heuristics to automated retraining using technologies including but not limited to Scala, Python, Apache Spark, Docker, Sagemaker for training, and serving. We will walk through the generally applicable data prep, feature engineering, training, evaluation/comparisons, and continuous model training including data feedback loops in containerized environments with Sagemaker. We will talk about our deployment and validation approach. Finally, we’ll draw lessons from iteratively building an enterprise ML product. Attendees will learn about the mental models for building end to end prod ML pipelines and GA ready products.
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
A 3 hours session introducing the concept of Machine Learning and Distributed Computing.
It includes many examples running in notebooks of experience run on data exploring models like LM, RF, K-Means, Deep Learning.
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Databricks
Deep Learning is now the standard in object detection, but it is not easy to analyze large amounts of images, especially in an interactive fashion. Traditionally, there has been a gap between Deep Learning frameworks, which excel at image processing, and more traditional ETL and data science tools, which are usually not designed to handle huge batches of complex data types such as images.
In this talk, we show how manipulating large corpora of images can be accomplished in a few lines of code because of recent developments in Apache Spark. Thanks to Spark’s unique ability to blend different libraries, we show how to start from satellite images and rapidly build complex queries on high level information such as houses or buildings. This is possible thanks to Magellan, a geospatial package, and Deep Learning Pipelines, a library that streamlines the integration of Deep Learning frameworks in Spark. At the end of this session, you will walk away with the confidence that you can solve your own image detection problems at any scale thanks to the power of Spark.
Artificial Intelligence and Deep Learning in Azure, CNTK and TensorflowJen Stirrup
Artificial Intelligence and Deep Learning in Azure, using Open Source technologies CNTK and Tensorflow. The tutorial can be found on GitHub here: https://github.com/Microsoft/CNTK/tree/master/Tutorials
and the CNTK video can be found here: https://youtu.be/qgwaP43ZIwA
Novi Sad AI is the first AI community in Serbia with goal of democratizing knowledge of AI. On our first event we talked about Belief networks, Deep learning and many more.
This talk was presented in Startup Master Class 2017 - http://aaiitkblr.org/smc/ 2017 @ Christ College Bangalore. Hosted by IIT Kanpur Alumni Association and co-presented by IIT KGP Alumni Association, IITACB, PanIIT, IIMA and IIMB alumni.
My co-presenter was Biswa Gourav Singh. And contributor was Navin Manaswi.
http://dataconomy.com/2017/04/history-neural-networks/ - timeline for neural networks
Separating Hype from Reality in Deep Learning with Sameer FarooquiDatabricks
Deep Learning is all the rage these days, but where does the reality of what Deep Learning can do end and the media hype begin? In this talk, I will dispel common myths about Deep Learning that are not necessarily true and help you decide whether you should practically use Deep Learning in your software stack.
I’ll begin with a technical overview of common neural network architectures like CNNs, RNNs, GANs and their common use cases like computer vision, language understanding or unsupervised machine learning. Then I’ll separate the hype from reality around questions like:
• When should you prefer traditional ML systems like scikit learn or Spark.ML instead of Deep Learning?
• Do you no longer need to do careful feature extraction and standardization if using Deep Learning?
• Do you really need terabytes of data when training neural networks or can you ‘steal’ pre-trained lower layers from public models by using transfer learning?
• How do you decide which activation function (like ReLU, leaky ReLU, ELU, etc) or optimizer (like Momentum, AdaGrad, RMSProp, Adam, etc) to use in your neural network?
• Should you randomly initialize the weights in your network or use more advanced strategies like Xavier or He initialization?
• How easy is it to overfit/overtrain a neural network and what are the common techniques to ovoid overfitting (like l1/l2 regularization, dropout and early stopping)?
Automatic Attendace using convolutional neural network Face Recognitionvatsal199567
Automatic Attendance System will recognize the face of the student through the camera in the class and mark the attendance. It was built in Python with Machine Learning.
Webinar: Machine Learning para MicrocontroladoresEmbarcados
Neste webinar, serão apresentados conceitos sobre inteligência artificial, assim como ferramentas disponíveis para o desenvolvimento integradas ao MPLAB X e ao Harmony 3 e demonstração de um sistema de detecção de anomalia utilizando um microcontrolador da família ATSAMD21 (ARM Cortex M0+).
Startup.Ml: Using neon for NLP and Localization Applications Intel Nervana
Speaker: Arjun Bansal, co-founder of Nervana Systems
Arjun Bansal’s workshop focused on neon, an open-source python based deep learning framework that has been build from the ground up for speed and ease of use. The workshop highlights how to use neon, build Recurrent Recurrent Neural Networks to generate and analyze text, and build Convolutional Autoencoders to generate images and to localize objects. Arjun also demoed the integration of neon with the Nervana cloud (in private beta) for multi-GPU training of deep networks.
الموعد الإثنين 03 يناير 2022
143
مبادرة
#تواصل_تطوير
المحاضرة ال 143 من المبادرة
المهندس / محمد الرافعي طرباي
نقيب المبرمجين بالدقهلية
بعنوان
"IT INDUSTRY"
How To Getting Into IT With Zero Experience
وذلك يوم الإثنين 03 يناير2022
السابعة مساء توقيت القاهرة
الثامنة مساء توقيت مكة المكرمة
و الحضور من تطبيق زووم
https://us02web.zoom.us/meeting/register/tZUpf-GsrD4jH9N9AxO39J013c1D4bqJNTcu
علما ان هناك بث مباشر للمحاضرة على القنوات الخاصة بجمعية المهندسين المصريين
ونأمل أن نوفق في تقديم ما ينفع المهندس ومهمة الهندسة في عالمنا العربي
والله الموفق
للتواصل مع إدارة المبادرة عبر قناة التليجرام
https://t.me/EEAKSA
ومتابعة المبادرة والبث المباشر عبر نوافذنا المختلفة
رابط اللينكدان والمكتبة الالكترونية
https://www.linkedin.com/company/eeaksa-egyptian-engineers-association/
رابط قناة التويتر
https://twitter.com/eeaksa
رابط قناة الفيسبوك
https://www.facebook.com/EEAKSA
رابط قناة اليوتيوب
https://www.youtube.com/user/EEAchannal
رابط التسجيل العام للمحاضرات
https://forms.gle/vVmw7L187tiATRPw9
ملحوظة : توجد شهادات حضور مجانية لمن يسجل فى رابط التقيم اخر المحاضرة
Similar to Cognitive Toolkit - Deep Learning framework from Microsoft (20)
AnalyticsConf2016 - Innowacyjność poprzez inteligentną analizę informacji - C...Łukasz Grala
Sesja o wprowadzeniu do sztucznej inteligencji, cognitive services i uczeniu maszynowym. Na wyciągnięcie ręki serwisy, które dają możliwość analizy obrazu, dźwięku, tekstów. Analiza obrazu w czasie rzeczywitym, rozpoznawanie twarzy i ludzi, ruchu, emocji. Odczytywanie tekstów z video, oraz boty.
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsightŁukasz Grala
Sesja or ozwiązaniu Big Data Analytics Microsoft. Jest to Hortonowrks (HADOOP, HBase, Storm, Spark), wraz z wydajnym R Server. Zaawansowana analityka przy użyciui RevoScaleR
eRum2016 -RevoScaleR - Performance and Scalability RŁukasz Grala
Conference eRum2016.
European R users meeting (eRum) is an international conference that aims at integrating users of the R language. eRum 2016 will be a good chance to exchange experiences, broaden knowledge on R and collaborate. One can participate in eRum 2016: (1) with a regular oral presentation, (2) with a lightning talk, (3) with a poster presentation, (4) or attending without presentation or poster. Due to space available at the conference venue, organizers set limit of participants at 250.
Session about RevoScale R.
AzureDay North 2016. Conference about cloud solutions.
What is Machine Learning? Why we need Machine Learning? Where and When we use this? What is Azure Machine Learning and language R. Session introduce to paradigm machine learning, data mining, classes of problems and fundamentals of algorithms.
By Data Scientist as a Service.
AzureDay - Introduction Big Data Analytics.Łukasz Grala
AzureDay North 2016. Conference about cloud solutions.
What is Analytics? What is Big Data? Why Big Data we have in the cloud. What offer Microsoft for Big Data Analytics. How to start with Big Data Analytics or Advanced Analytics? Session introduce fundamentals for Big Data and Advanced Analytics.
By Data Scientist as a Service
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...Łukasz Grala
Wzrost ilości danych w postaci strumieni danych spowodował potrzebę analizy danych w czasie rzecyzwistych będących strumieniami. W czasie sesji pokazano połączenie:
- event hub/Iot hub
- Azure Stream Analytics
- Azure Machine Learning
20160405 Cloud Community Poznań - Cloud Analytics on AzureŁukasz Grala
Cloud Analytics on Platform Azure. Overview about analytics. Talking about Azure Data Lake Storage & Analytics, Azure Stream Analytics, HDInsight, Hortonowrks, PowerBI...
Pierwsza edycja konferencji AzureDay Poland 2016. W ramach tej konferencji sesja o analizie danych strumieniowych przy użyciu Azure Stream Analytics, rozszerzone o możliwości algorytmów uczenia maszynowego przetwarzane w Azure Machine Learning
Wprowadzenie do składowania danych w chmurze. Od relacyjnych Azure SQL Database, Azure SQL Data Warehouse, NoSQL - Azure DocumentDB, HDInsight (Hadoop, Spark, Hbase), Azure Search i Azure Data Factory
Wprowadzenie do analizy danych w chmurze. Między innymi o Azure Stream Analytics, Azure Data Lake Analytics, Azure Machine Learning, ale też i o rozwiazaniach OpenSource (Spark, Yupiter, Storm, Zepelin)
Session about types of analytics. Descriptive, diagnostic, predictive and prescriptive analytics.
Conference DATA ANALYSIS DEVELOPMENT 2016 by RZECZPOSPOLITA.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Cognitive Toolkit - Deep Learning framework from Microsoft
1. Cognitive Toolkit - Deep
Learning framework from
Microsoft
Łukasz Grala
lukasz@tidk.pl | lukasz.grala@cs.put.poznan.pl
2. Łukasz Grala
• Architekt danych w TIDK
• Twórca „Data Scientist as as Service”
• Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach
• Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów
• Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP
• Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych, uczenie maszynowe)
• Prelegent na licznych konferencjach w kraju i na świecie
• Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…)
• Członek zarządu Polskiego Towarzystwa Informatycznego Oddział Wielkopolski
• Członek i lider Data Community Poland (dawniej Polish SQL Server User Group (PLSSUG))
• Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu i MTB
email lukasz@tidk.pl - lukasz.grala@cs.put.poznan.pl blog: grala.it
6. Machine Learning
1763 1805 1812 1913 1950 1951 1967 1982 1995 1997 2016
The Underpinngs of
Bayes' Theorem
Least Squares
Bayes' Theorem
Markov Chains First Neural
Network Machine
Nearest NeighborTuring's Learning
Machine
Recurrent Neural
Network
Random Forest
Algorithm
Support Vector
Machines
IBM Deep Blue
Beats Kasparov
2012
Recognizing Cats
on YouTube
AlphaGo
1958
Single-layer neural
network on a
room size
computer
7. Machine Learning
“Can machines do what we (as thinking entities)
can do?”
Alan Touring, “Computing Machinery and Intelligence”. Mind, 1950
Turing’s test
8. Machine Learning
“Machine Learning is a field of computer science that
gives computers the ability to learn without explicitly
programmed.”
Samuel Arthur, “Some Studies in Machine Learning Using the Game of
Checkers”, IBM Journal of Research and Development, 1959
9. Machine Learning
“A computer program is said to learn from experience
E with respect to some class of tasks T and
performance measure P if its performance at tasks in
T, as measured by P, improves with experience E.”
Tom. M. Mitchell, “Machine Learning. McGraw Hill, 1997
10. Supervised
Each point in the training data is associated with a label or output
Task is to learn a model/hypothesis that predicts output for points not
in the training dataset
Classification
Given a set of features, predict discrete outputs (Fraud/Not Fraud)
Regression
Given a set of features, predict continuous outputs (Credit score, item price, …)
Recommendation
Given a set of {user, item, rating} triplets and optionally features about users and items,
predict ratings for an item, items similar to a given item, users similar to a given user
Anomaly detection
Given a set of features for “normal” examples,
predict normal vs anomaly
11. Unsupervised
Points on a training dataset are not associated with known output values
Creates a model that learns inherent structure in training data
Clustering
Given a training dataset, find a small number of centers, ‘k’, that are “close” to points in the dataset
Each point in the dataset is associated with at most a single center
Principal Component Analysis
Given a training dataset with ‘N’ features, find a set of ‘k’ features that approximates the data with bounded
error
The set of ‘k’ principal components is representative of the original dataset but with much lower
dimensionality
Time Series
ARIMA, ETS,…
16. Artificial Neural Networks
ANNs are processing devices (algorithms or actual hardware) that are loosely modeled after the neuronal
structure of the mamalian cerebral cortex but on much smaller scales.
The simplest definition of a neural network, more properly referred to as an 'artificial' neural network
(ANN), is provided by the inventor of one of the first neurocomputers, Dr. Robert Hecht-Nielsen. He
defines a neural network as:
"...a computing system made up of a number of simple, highly interconnected processing
elements, which process information by their dynamic state response to external inputs. “
19. Overfitting
The green line represents an overfitted
model and the black line represents a
regularized model. While the green line
best follows the training data, it is too
dependent on that data and it is likely to
have a higher error rate on new unseen
data, compared to the black line.
20. Deep Learning Use Cases
•Sentiment Analysis
•Augemented Search
•Fraud Detection
•NLP
Text
•Facial Recognition
•Emotion Recognition
•Image Search
•Photo Clustering
•Tags
•Motion Detection
Video and Image
•Voice Recognition
•Voice Search
•Sentiment Analysis
•Flaw Detection
Sound
•Prediction
•Recommendation
•Risk Detection
Time Series
21. Convolutional network
Convolutional neural network (CNN, or ConvNet) is a class of deep,
feed-forward artificial neural networks that has successfully been
applied to analyzing visual imagery.
24. ImageNet CNN
• Model, który zwyciężył w konkursie ImageNet w 2012
• 5 warstw konwolucyjnych i 2 warstwy pełne
• Jednostki ReLU i Droput o najwyższej warstwie 60 milionów parametrów
• 1.2 mln obrazów treningowych
• Klasyfikacja do 1000 klas
• Uczenie na dwóch GPU przez tydzień
• Błąd 16.4% (drugie miejsce 26.2%)
26. Recurrent Neural Networks
A recurrent neural network (RNN) is a class of artificial neural
network where connections between units form a directed cycle.
40. Learners
Algorithms Strengths
rxFastLinear Fast, accurate linear learner with auto L1 & L2
rxLogisticRegression Logistic Regression with L1 & L2
rxFastTree
Boosted Decision tree from Bing. Competitive wth
XGBoost. Most accurate learner for most cases
rxFastForest Random Forest
rxNeuralNet GPU accelereted Net# DNNs with Convolutions
rxOneClassSvm Anomaly or unbalanced binary classification
41. Learners - Scalability
• Streaming (not RAM bound)
• Billions of features
• Multi-proc
• GPU acceleration for DNNs
• Distributed on Hadoop/Spark via Ensambling
45. ONNX is a community project created by Facebook and Microsoft.
ONNX provides a definition of an extensible computation graph model, as well as
definitions of built-in operators and standard data types.
Each computation dataflow graph is structured as a list of nodes that form an
acyclic graph. Nodes have one or more inputs and one or more outputs. Each
node is a call to an operator. The graph also has metadata to help document its
purpose, author, etc.
Operators are implemented externally to the graph, but the set of built-in
operators are portable across frameworks. Every framework supporting ONNX will
provide implementations of these operators on the applicable data types.
48. Cognitive Toolkit
• FFN, CNN, RNN/LSTN, Batch normalization, Sequence-to-Sequence with
attention and more
• Reinforcement learning, generative adversarial networks, supervised and
unsupervised learning
• Ability to add new user-defined core-components on the GPU from Python
• Automatic hyperparameter tuning
• Built-in readers optimized for massive datasets
• Full API’s for defining networks, leaners, readers, training and evaluation from
Python, C++, C#, BrainScript
• Evaluate models with Python, C++, C#, R and BrainScript
• Automatic shape inference based on your data
49. CNTK – Layers Library
Simple 1-layer hidden layer model – function Dense()
50. CNTK – Layers Library
Alternative Sequential()
2011-style feed-forward speech-recognition network
with 6 hidden sigmoid layers of identical dimensions
62. 6464
Łukasz Grala, Microsoft MVP
CEO, Data Architect
Lukasz.grala@cs.put.poznan.pl
+48 663832323
http://tidk.pl
Lukasz.grala@tidk.pl
http://dsaas.co
Editor's Notes
AI:
ML (Uczenie Bayesowskie, Drzew decyzyjnych, Zbioru reguł,
Sieci ekspertowe
Sieci neuronowe
Dowodzenie twierdzeń
Podejmowanie decyzji przy braku pełnych danych
Rozumowanie logiczne/racjonalne
Logika rozmyta
Algorytmy ewolucyjne
Microsoft Azure VM opis
https://docs.microsoft.com/pl-pl/azure/machine-learning/data-science-virtual-machine/overview
VM:
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.linux-data-science-vm-ubuntu?tab=Overview
https://azuremarketplace.microsoft.com/en-us/marketplace/apps?search=Data%20Science%20Virtual%20Machine&page=1
shape: output dimension of this layer
activation (default: None: pass a function here to be used as the activation function, such as activation=relu
input_rank: if given, number of trailing dimensions that are transformed by Dense() (map_rankmust not be given)
map_rank: if given, the number of leading dimensions that are not transformed by Dense()(input_rank must not be given)
init (default: glorot_uniform()): initializer descriptor for the weights. See cntk.initializer for a full list of random-initialization options.
bias: if False, do not include a bias parameter
init_bias (default: 0): initializer for the bias
FOR EMBEDDING
shape: the dimension of the desired embedding vector. Must not be None unless weights are passed
init: initializer descriptor for the weights to be learned. See cntk.initializer for a full list of initialization options.
weights (numpy array): if given, embeddings are not learned but specified by this array (which could be, e.g., loaded from a file) and not updated further during training
filter_shape: shape of receptive field of the filter, e.g. (5,5) for a 2D filter (not including the input feature-map depth)
num_filters: number of output channels (number of filters)
activation: optional non-linearity, e.g. activation=relu
init: initializer descriptor for the weights, e.g. glorot_uniform(). See cntk.initializer for a full list of random-initialization options.
pad: if False (default), then the filter will be shifted over the “valid” area of input, that is, no value outside the area is used. If pad is True on the other hand, the filter will be applied to all input positions, and values outside the valid region will be considered zero.
strides: increment when sliding the filter over the input. E.g. (2,2) to reduce the dimensions by 2
bias: if False, do not include a bias parameter
init_bias: initializer for the bias
use_correlation: currently always True and cannot be changed. It indicates that Convolution()actually computes the cross-correlation rather than the true convolution
filter_shape: receptive field (window) to pool over, e.g. (2,2) (not including the input feature-map depth)
strides: increment when sliding the pool over the input. E.g. (2,2) to reduce the dimensions by 2
pad: if False (default), then the pool will be shifted over the “valid” area of input, that is, no value outside the area is used. If pad is True on the other hand, the pool will be applied to all input positions, and values outside the valid region will be considered zero. For average pooling, count for average does not include padded values.
filter_shape: receptive field (window) to pool over, e.g. (2,2) (not including the input feature-map depth)
strides: increment when sliding the pool over the input. E.g. (2,2) to reduce the dimensions by 2
pad: if False (default), then the pool will be shifted over the “valid” area of input, that is, no value outside the area is used. If pad is True on the other hand, the pool will be applied to all input positions, and values outside the valid region will be considered zero. For average pooling, count for average does not include padded values.
shape: dimension of the output
cell_shape (optional): the dimension of the LSTM’s cell. If None, the cell shape is identical to shape. If specified, an additional linear projection will be inserted to project from the cell dimension to the output shape.
use_peepholes (optional): if True, then use peephole connections in the LSTM
init: initializer descriptor for the weights. See cntk.initializer for a full list of initialization options.
enable_self_stabilization (optional): if True, insert a Stabilizer() for the hidden state and cell
BatchNormalization:
map_rank: if given then normalize only over this many leading dimensions. E.g. 1 to tie all (h,w) in a (C, H, W)-shaped input. Currently, the only allowed values are None (no pooling) and 1 (e.g. pooling across all pixel positions of an image)
normalization_time_constant (default 5000): time constant in samples of the first-order low-pass filter that is used to compute mean/variance statistics for use in inference
initial_scale: initial value of scale parameter
epsilon: small value that gets added to the variance estimate when computing the inverse
use_cntk_engine: if True, use CNTK’s native implementation. If false, use cuDNN’s implementation (GPU only).
disable_regularization: if True then disable regularization in BatchNormalization.
LayerNormalization:
initial_scale: initial value of scale parameter
initial_bias: initial value of bias parameter
Stabilizer:
steepness: sharpness of the knee of the softplus function