Topic: How to use big data to enhance AI
Outline:
1. Spark ETL
Spark SQL
Spark Streaming
2. Spark ML
Spark ML pipeline
Distributed model tuning
Spark ML model and data lineage management
3. Spark XGboost
XGboost introduction
XGboost with Spark
XGboost with GPU
4. Spark Deep Learning pipeline
Transfer learning
Build Spark ML pipeline with TensorFlow
Model selection on distributed TF model
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
We have reached a remarkable point in history with the evolution of AI, from applying this technology to incredible use cases in healthcare, to addressing the world's biggest humanitarian and environmental issues. Our ability to learn task-specific functions for vision, language, sequence and control tasks is getting better at a rapid pace. This talk will survey some of the current advances in AI, compare AI to other fields that have historically developed over time, and calibrate where we are in the relative advancement timeline. We will also speculate about the next inflection points and capabilities that AI can offer down the road, and look at how those might intersect with other emergent fields, e.g. Quantum computing.
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016MLconf
Building a Machine Learning Platform at Quora: Each month, over 100 million people use Quora to share and grow their knowledge. Machine learning has played a critical role in enabling us to grow to this scale, with applications ranging from understanding content quality to identifying users’ interests and expertise. By investing in a reusable, extensible machine learning platform, our small team of ML engineers has been able to productionize dozens of different models and algorithms that power many features across Quora.
In this talk, I’ll discuss the core ideas behind our ML platform, as well as some of the specific systems, tools, and abstractions that have enabled us to scale our approach to machine learning.
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
Many Shades of Scale: Big Learning Beyond Big Data: In the machine learning research community, much of the attention devoted to ‘big data’ in recent years has been manifested as development of new algorithms and systems for distributed training on many examples. This focus has led to significant advances in the field, from basic but operational implementations on popular platforms to highly sophisticated prototypes in the literature. In the meantime, other aspects of scaling up learning have received relatively little attention, although they are often more pressing in practice. The talk will survey these less-studied facets of big learning: scaling to an extremely large number of features, to many components in predictive pipelines, and to multiple data scientists collaborating on shared experiments.
QCon Rio - Machine Learning for EveryoneDhiana Deva
Já não são mais necessários supercomputadores e times de PhDs do MIT para a criação de modelos preditivos baseados em dados. Estamos presenciando inovações em Aprendizado de Máquina que estão tornando este campo cada vez mais acessível.
Esta palestra tem como objetivo desmistificar o aprendizado de máquina, através da exposição de conceitos e uso de uma série de tecnologias.
Serão abordados os tipos de problemas desta área(classificação, regressão, clusterização, redução de dimensionalidade, etc.), suas as etapas (normalização, treinamento, otimização, regularização, etc.) e seus algoritmos, desde regressão linear, k-means, passando por árvores de decisão e até redes neurais, sempre aplicadas a problemas reais.
Na palestra, também conheceremos ferramentas como Sckit-learn, Pandas, R, MATLAB e Amazon Machine Learning, além de uma forma para praticar e experimentar estas ideias através de competições como o Kaggle.
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf
DL4J and DataVec for Enterprise Deep Learning Workflows: Applications in NLP, sensor processing (IoT), image processing, and audio processing have all emerged as prime deep learning applications. In this session we will take a look at a practical review of building practical and secure Deep Learning workflows in the enterprise. We’ll see how DL4J’s DataVec tool enables scalable ETL and vectorization pipelines to be created for a single machine or scale out to Spark on Hadoop. We’ll also see how Deep Networks such as Recurrent Neural Networks are able to leverage DataVec to more quickly process data for modeling.
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...MLconf
Local Search Optimization for Hyper-Parameter Tuning: Many machine learning algorithms are sensitive to their hyper-parameter settings, lacking good universal rule-of-thumb defaults. In this talk we discuss the use of black-box local search optimization (LSO) for machine learning hyper-parameter tuning. Viewed as a black-box objective function of hyper-parameters, machine learning algorithms create a difficult class of optimization problems. The corresponding objective functions involved tend to be nonsmooth, discontinuous, unpredictably computationally expensive, requiring support for both continuous, categorical, and integer variables. Further evaluations can fail for a variety of reasons such as early exits due to node failure or hitting max time. Additionally, not all hyper-parameter combinations are compatible (creating so called “hidden constraints”). In this context, we apply a parallel hybrid derivative-free optimization algorithm that can make progress despite these difficulties providing significantly improved results over default settings with minimal user interaction. Further, we will address efficient parallel paradigms for different types of machine learning problems, while exploring the importance of validation to avoid overfitting and emphasizing that even for small data problems, the need to perform cross validations can create computationally intense functions that benefit from a distributed/threaded environment.
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
We have reached a remarkable point in history with the evolution of AI, from applying this technology to incredible use cases in healthcare, to addressing the world's biggest humanitarian and environmental issues. Our ability to learn task-specific functions for vision, language, sequence and control tasks is getting better at a rapid pace. This talk will survey some of the current advances in AI, compare AI to other fields that have historically developed over time, and calibrate where we are in the relative advancement timeline. We will also speculate about the next inflection points and capabilities that AI can offer down the road, and look at how those might intersect with other emergent fields, e.g. Quantum computing.
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016MLconf
Building a Machine Learning Platform at Quora: Each month, over 100 million people use Quora to share and grow their knowledge. Machine learning has played a critical role in enabling us to grow to this scale, with applications ranging from understanding content quality to identifying users’ interests and expertise. By investing in a reusable, extensible machine learning platform, our small team of ML engineers has been able to productionize dozens of different models and algorithms that power many features across Quora.
In this talk, I’ll discuss the core ideas behind our ML platform, as well as some of the specific systems, tools, and abstractions that have enabled us to scale our approach to machine learning.
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
Many Shades of Scale: Big Learning Beyond Big Data: In the machine learning research community, much of the attention devoted to ‘big data’ in recent years has been manifested as development of new algorithms and systems for distributed training on many examples. This focus has led to significant advances in the field, from basic but operational implementations on popular platforms to highly sophisticated prototypes in the literature. In the meantime, other aspects of scaling up learning have received relatively little attention, although they are often more pressing in practice. The talk will survey these less-studied facets of big learning: scaling to an extremely large number of features, to many components in predictive pipelines, and to multiple data scientists collaborating on shared experiments.
QCon Rio - Machine Learning for EveryoneDhiana Deva
Já não são mais necessários supercomputadores e times de PhDs do MIT para a criação de modelos preditivos baseados em dados. Estamos presenciando inovações em Aprendizado de Máquina que estão tornando este campo cada vez mais acessível.
Esta palestra tem como objetivo desmistificar o aprendizado de máquina, através da exposição de conceitos e uso de uma série de tecnologias.
Serão abordados os tipos de problemas desta área(classificação, regressão, clusterização, redução de dimensionalidade, etc.), suas as etapas (normalização, treinamento, otimização, regularização, etc.) e seus algoritmos, desde regressão linear, k-means, passando por árvores de decisão e até redes neurais, sempre aplicadas a problemas reais.
Na palestra, também conheceremos ferramentas como Sckit-learn, Pandas, R, MATLAB e Amazon Machine Learning, além de uma forma para praticar e experimentar estas ideias através de competições como o Kaggle.
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf
DL4J and DataVec for Enterprise Deep Learning Workflows: Applications in NLP, sensor processing (IoT), image processing, and audio processing have all emerged as prime deep learning applications. In this session we will take a look at a practical review of building practical and secure Deep Learning workflows in the enterprise. We’ll see how DL4J’s DataVec tool enables scalable ETL and vectorization pipelines to be created for a single machine or scale out to Spark on Hadoop. We’ll also see how Deep Networks such as Recurrent Neural Networks are able to leverage DataVec to more quickly process data for modeling.
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...MLconf
Local Search Optimization for Hyper-Parameter Tuning: Many machine learning algorithms are sensitive to their hyper-parameter settings, lacking good universal rule-of-thumb defaults. In this talk we discuss the use of black-box local search optimization (LSO) for machine learning hyper-parameter tuning. Viewed as a black-box objective function of hyper-parameters, machine learning algorithms create a difficult class of optimization problems. The corresponding objective functions involved tend to be nonsmooth, discontinuous, unpredictably computationally expensive, requiring support for both continuous, categorical, and integer variables. Further evaluations can fail for a variety of reasons such as early exits due to node failure or hitting max time. Additionally, not all hyper-parameter combinations are compatible (creating so called “hidden constraints”). In this context, we apply a parallel hybrid derivative-free optimization algorithm that can make progress despite these difficulties providing significantly improved results over default settings with minimal user interaction. Further, we will address efficient parallel paradigms for different types of machine learning problems, while exploring the importance of validation to avoid overfitting and emphasizing that even for small data problems, the need to perform cross validations can create computationally intense functions that benefit from a distributed/threaded environment.
Online Machine Learning: introduction and examplesFelipe
In this talk I introduce the topic of Online Machine Learning, which deals with techniques for doing machine learning in an online setting, i.e. where you train your model a few examples at a time, rather than using the full dataset (off-line learning).
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
Why Machine Learning Algorithms Fall Short (And What You Can Do About It): Many think that machine learning is all about the algorithms. Want a self-learning system? Get your data, start coding or hire a PhD that will build you a model that will stand the test of time. Of course we know that this is not enough. Models degrade over time, algorithms that work great on yesterday’s data may not be the best option, new data sources and types are made available. In short, your self-learning system may not be learning anything at all. In this session, we will examine how to overcome challenges in creating self-learning systems that perform better and are built to stand the test of time. We will show how to apply mathematical optimization algorithms that often prove superior to local optimization methods favored by typical machine learning applications and discuss why these methods can crate better results. We will also examine the role of smart automation in the context of machine learning and how smart automation can create self-learning systems that are built to last.
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
Applying Deep Learning at Facebook Scale: Facebook leverages Deep Learning for various applications including event prediction, machine translation, natural language understanding and computer vision at a very large scale. There are more than a billion users logging on to Facebook every daily generating thousands of posts per second and uploading more than a billion images and videos every day. This talk will explain how Facebook scaled Deep Learning inference for realtime applications with latency budgets in the milliseconds.
Anima Anandkumar at AI Frontiers : Modern ML : Deep, distributed, Multi-dimen...AI Frontiers
As the data and models scale, it becomes necessary to have multiple processing units for both training and inference. SignSGD is a gradient compression algorithm that only transmits the sign of the stochastic gradients during distributed training. This algorithm uses 32 times less communication per iteration than distributed SGD. We show that signSGD obtains free lunch both in theory and practice: no loss in accuracy while yielding speedups. Pushing the current boundaries of deep learning also requires using multiple dimensions and modalities. These can be encoded into tensors, which are natural extensions of matrices. These functionalities are available in the Tensorly package with multiple backend interfaces for large-scale deep learning.
Native ads (ads that match the look and feel of the embedding page) have become a multi-billion dollar business in recent years. Gemini native is Yahoo’s native advertisement platform and this talk will overview some of the science behind its ad ranking.
The accurate prediction of an ad’s click-through rate (CTR) for a given impression is a key component of any such ad ranking system as it allows one to rank the ads according to their expected revenue. I will give a short overview of different CTR prediction models and deep dive into the major components of large-scale logistic regression models; a special focus will be given to implementing such a logistic regression model in Apache Spark.
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
« Le « Machine Learning » – « Apprentissage statistique » ou « Analyse prédictive » - sort des labos de recherche et des cercles de spécialistes pour être de plus en plus être utilisé au sein des entreprises, et pas seulement les startups. En témoigne l’essor de la toolkit OpenSource Scikit-learn très vite répandue internationalement comme l’un des nouveaux standards de cette nouvelle façon de faire du logiciel, mais aussi la disponibilité depuis juillet 2014 d’Azure ML, le service de Machine Learning de Microsoft Azure. Dans cette session nous vous proposons un aperçu du développement de logiciel d’apprentissage statistique en Python avec SciKit-Learn. Nous invitons l'un des principaux contributeurs de cette toolkit, Olivier Grisel , ingénieur de recherche dans l’équipe équipe Inria PARIETAL à Saclay, à venir nous en présenter un aperçu dans une session interactive et basée sur de nombreux exemples et démos. Pour en savoir plus: http://scikit-learn.org https://team.inria.fr/parietal/ https://twitter.com/ogrisel
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
We all know what they say – the bigger the data, the better. But when the data gets really big, how do you mine it and what deep learning framework to use? This talk will survey, with a developer’s perspective, three of the most popular deep learning frameworks—TensorFlow, Keras, and PyTorch—as well as when to use their distributed implementations.
We’ll compare code samples from each framework and discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data) as well as help you answer questions such as:
As a developer how do I pick the right deep learning framework?
Do I want to develop my own model or should I employ an existing one?
How do I strike a trade-off between productivity and control through low-level APIs?
What language should I choose?
In this session, we will explore how to build a deep learning application with Tensorflow, Keras, or PyTorch in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you.
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsBill Liu
website: https://learn.xnextcon.com/event/eventdetails/W20051110
video: https://www.youtube.com/watch?v=8tG8PJC6oaU
In reinforcement learning (RL), an agent learns how to optimize performance solely by collecting experience in the real world or via a simulator. RL is being applied to problems such as decision making, process optimization (e.g., manufacturing and supply chains), ad serving, recommendations, self-driving cars, and algorithmic trading.
In this talk, I will discuss RLlib, a reinforcement learning library built on Ray with a strong focus on large-scale execution and scalability, ease-of-use for general users, as well as customizability for developers and researchers.
RLlib offers autonomous task-learning via many common RL algorithms and it scales from a laptop to a cluster with hundreds of machines. It is used by dozens of organizations, from startups to research labs to large organizations. You will see RLlib in action with a live demo.
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017MLconf
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud Prevention:
PayPal is at the forefront of applying large scale graph processing and machine learning algorithms to keep fraudsters at bay. In this talk, I’ll present how advanced graph processing and machine learning algorithms such as Deep Learning and Gradient Boosting are applied at PayPal for fraud prevention. I’ll elaborate on specific challenges in applying large scale graph processing & machine technique to payment fraud prevention. I’ll explain how we employ sophisticated machine learning tools – open source and in-house developed.
I will also present results from experiments conducted on a very large graph data set containing millions of edges and vertices.
In this talk by AWeber's Michael Becker, you will get a brief overview of Machine Learning and scikit-learn. This is a scaled down version of this talk from Pycon 2013: http://github.com/jakevdp/sklearn_pycon2013
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
Machine Intelligence at Google Scale: Tensor Flow and Cloud Machine Learning: The biggest challenge of Deep Learning technology is the scalability. As long as using single GPU server, you have to wait for hours or days to get the result of your work. This doesn’t scale for production service, so you need a Distributed Training on the cloud eventually. Google has been building infrastructure for training the large scale neural network on the cloud for years, and now started to share the technology with external developers. In this session, we will introduce new pre-trained ML services such as Cloud Vision API and Speech API that works without any training. Also, we will look how TensorFlow and Cloud Machine Learning will accelerate custom model training for 10x – 40x with Google’s distributed training infrastructure.
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
"Project Hydrogen is a major Apache Spark initiative to bring state-of-the-art AI and Big Data solutions together. It contains three major projects: 1) barrier execution mode 2) optimized data exchange and 3) accelerator-aware scheduling. A basic implementation of barrier execution mode was merged into Apache Spark 2.4.0, and the community is working on the latter two. In this talk, we will present progress updates to Project Hydrogen and discuss the next steps.
First, we will review the barrier execution mode implementation from Spark 2.4.0. It enables developers to embed distributed training jobs properly on a Spark cluster. We will demonstrate distributed AI integrations built on top it, e.g., Horovod and Distributed TensorFlow. We will also discuss the technical challenges to implement those integrations and future work. Second, we will outline on-going work for optimized data exchange. Its target scenario is distributed model inference. We will present how we do performance testing/profiling, where the bottlenecks are, and how to improve the overall throughput on Spark. If time allows, we might also give updates on accelerator-aware scheduling.
"
Online Machine Learning: introduction and examplesFelipe
In this talk I introduce the topic of Online Machine Learning, which deals with techniques for doing machine learning in an online setting, i.e. where you train your model a few examples at a time, rather than using the full dataset (off-line learning).
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
Why Machine Learning Algorithms Fall Short (And What You Can Do About It): Many think that machine learning is all about the algorithms. Want a self-learning system? Get your data, start coding or hire a PhD that will build you a model that will stand the test of time. Of course we know that this is not enough. Models degrade over time, algorithms that work great on yesterday’s data may not be the best option, new data sources and types are made available. In short, your self-learning system may not be learning anything at all. In this session, we will examine how to overcome challenges in creating self-learning systems that perform better and are built to stand the test of time. We will show how to apply mathematical optimization algorithms that often prove superior to local optimization methods favored by typical machine learning applications and discuss why these methods can crate better results. We will also examine the role of smart automation in the context of machine learning and how smart automation can create self-learning systems that are built to last.
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
Applying Deep Learning at Facebook Scale: Facebook leverages Deep Learning for various applications including event prediction, machine translation, natural language understanding and computer vision at a very large scale. There are more than a billion users logging on to Facebook every daily generating thousands of posts per second and uploading more than a billion images and videos every day. This talk will explain how Facebook scaled Deep Learning inference for realtime applications with latency budgets in the milliseconds.
Anima Anandkumar at AI Frontiers : Modern ML : Deep, distributed, Multi-dimen...AI Frontiers
As the data and models scale, it becomes necessary to have multiple processing units for both training and inference. SignSGD is a gradient compression algorithm that only transmits the sign of the stochastic gradients during distributed training. This algorithm uses 32 times less communication per iteration than distributed SGD. We show that signSGD obtains free lunch both in theory and practice: no loss in accuracy while yielding speedups. Pushing the current boundaries of deep learning also requires using multiple dimensions and modalities. These can be encoded into tensors, which are natural extensions of matrices. These functionalities are available in the Tensorly package with multiple backend interfaces for large-scale deep learning.
Native ads (ads that match the look and feel of the embedding page) have become a multi-billion dollar business in recent years. Gemini native is Yahoo’s native advertisement platform and this talk will overview some of the science behind its ad ranking.
The accurate prediction of an ad’s click-through rate (CTR) for a given impression is a key component of any such ad ranking system as it allows one to rank the ads according to their expected revenue. I will give a short overview of different CTR prediction models and deep dive into the major components of large-scale logistic regression models; a special focus will be given to implementing such a logistic regression model in Apache Spark.
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
« Le « Machine Learning » – « Apprentissage statistique » ou « Analyse prédictive » - sort des labos de recherche et des cercles de spécialistes pour être de plus en plus être utilisé au sein des entreprises, et pas seulement les startups. En témoigne l’essor de la toolkit OpenSource Scikit-learn très vite répandue internationalement comme l’un des nouveaux standards de cette nouvelle façon de faire du logiciel, mais aussi la disponibilité depuis juillet 2014 d’Azure ML, le service de Machine Learning de Microsoft Azure. Dans cette session nous vous proposons un aperçu du développement de logiciel d’apprentissage statistique en Python avec SciKit-Learn. Nous invitons l'un des principaux contributeurs de cette toolkit, Olivier Grisel , ingénieur de recherche dans l’équipe équipe Inria PARIETAL à Saclay, à venir nous en présenter un aperçu dans une session interactive et basée sur de nombreux exemples et démos. Pour en savoir plus: http://scikit-learn.org https://team.inria.fr/parietal/ https://twitter.com/ogrisel
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
We all know what they say – the bigger the data, the better. But when the data gets really big, how do you mine it and what deep learning framework to use? This talk will survey, with a developer’s perspective, three of the most popular deep learning frameworks—TensorFlow, Keras, and PyTorch—as well as when to use their distributed implementations.
We’ll compare code samples from each framework and discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data) as well as help you answer questions such as:
As a developer how do I pick the right deep learning framework?
Do I want to develop my own model or should I employ an existing one?
How do I strike a trade-off between productivity and control through low-level APIs?
What language should I choose?
In this session, we will explore how to build a deep learning application with Tensorflow, Keras, or PyTorch in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you.
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsBill Liu
website: https://learn.xnextcon.com/event/eventdetails/W20051110
video: https://www.youtube.com/watch?v=8tG8PJC6oaU
In reinforcement learning (RL), an agent learns how to optimize performance solely by collecting experience in the real world or via a simulator. RL is being applied to problems such as decision making, process optimization (e.g., manufacturing and supply chains), ad serving, recommendations, self-driving cars, and algorithmic trading.
In this talk, I will discuss RLlib, a reinforcement learning library built on Ray with a strong focus on large-scale execution and scalability, ease-of-use for general users, as well as customizability for developers and researchers.
RLlib offers autonomous task-learning via many common RL algorithms and it scales from a laptop to a cluster with hundreds of machines. It is used by dozens of organizations, from startups to research labs to large organizations. You will see RLlib in action with a live demo.
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017MLconf
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud Prevention:
PayPal is at the forefront of applying large scale graph processing and machine learning algorithms to keep fraudsters at bay. In this talk, I’ll present how advanced graph processing and machine learning algorithms such as Deep Learning and Gradient Boosting are applied at PayPal for fraud prevention. I’ll elaborate on specific challenges in applying large scale graph processing & machine technique to payment fraud prevention. I’ll explain how we employ sophisticated machine learning tools – open source and in-house developed.
I will also present results from experiments conducted on a very large graph data set containing millions of edges and vertices.
In this talk by AWeber's Michael Becker, you will get a brief overview of Machine Learning and scikit-learn. This is a scaled down version of this talk from Pycon 2013: http://github.com/jakevdp/sklearn_pycon2013
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
Machine Intelligence at Google Scale: Tensor Flow and Cloud Machine Learning: The biggest challenge of Deep Learning technology is the scalability. As long as using single GPU server, you have to wait for hours or days to get the result of your work. This doesn’t scale for production service, so you need a Distributed Training on the cloud eventually. Google has been building infrastructure for training the large scale neural network on the cloud for years, and now started to share the technology with external developers. In this session, we will introduce new pre-trained ML services such as Cloud Vision API and Speech API that works without any training. Also, we will look how TensorFlow and Cloud Machine Learning will accelerate custom model training for 10x – 40x with Google’s distributed training infrastructure.
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
"Project Hydrogen is a major Apache Spark initiative to bring state-of-the-art AI and Big Data solutions together. It contains three major projects: 1) barrier execution mode 2) optimized data exchange and 3) accelerator-aware scheduling. A basic implementation of barrier execution mode was merged into Apache Spark 2.4.0, and the community is working on the latter two. In this talk, we will present progress updates to Project Hydrogen and discuss the next steps.
First, we will review the barrier execution mode implementation from Spark 2.4.0. It enables developers to embed distributed training jobs properly on a Spark cluster. We will demonstrate distributed AI integrations built on top it, e.g., Horovod and Distributed TensorFlow. We will also discuss the technical challenges to implement those integrations and future work. Second, we will outline on-going work for optimized data exchange. Its target scenario is distributed model inference. We will present how we do performance testing/profiling, where the bottlenecks are, and how to improve the overall throughput on Spark. If time allows, we might also give updates on accelerator-aware scheduling.
"
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work[1], we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries. However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model[1] is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations. Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration. In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them. [1] Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD
Monitoring of GPU Usage with Tensorflow Models Using PrometheusDatabricks
Understanding the dynamics of GPU utilization and workloads in containerized systems is critical to creating efficient software systems. We create a set of dashboards to monitor and evaluate GPU performance in the context of TensorFlow. We monitor performance in real time to gain insight into GPU load, GPU memory and temperature metrics in a Kubernetes GPU enabled system. Visualizing TensorFlow training job metrics in real time using Prometheus allows us to tune and optimize GPU usage. Also, because Tensor flow jobs can have both GPU and CPU implementations it is useful to view detailed real time performance data from each implementation and choose the best implementation. To illustrate our system, we will show a live demo gathering and visualizing GPU metrics on a GPU enabled Kubernetes cluster with Prometheus and Grafana.
BlazingSQL gave a talk at the GPU Technology conference in San Jose 2019 to talk about our GPU-accelerated SQL engine on the open source RAPIDS AI stack. Directly SQL any data source and raw files into GPU memory with BlazingSQL and 5 lines of Python code!
In this talk I'll discuss how we can combine the power of PostgreSQL with TensorFlow to perform data analysis. By using the pl/python3 procedural language we can integrate machine learning libraries such as TensorFlow with PostgreSQL, opening the door for powerful data analytics combining SQL with AI. Typical use-cases might involve regression analysis to find relationships in an existing dataset and to predict results based on new inputs, or to analyse time series data and extrapolate future data taking into account general trends and seasonal variability whilst ignoring noise. Python is an ideal language for building custom systems to do this kind of work as it gives us access to a rich ecosystem of libraries such as Pandas and Numpy, in addition to TensorFlow itself.
Divya Jain at AI Frontiers : Video SummarizationAI Frontiers
As video content is becoming mainstream, video summarization is becoming a hot research topic in academia and industry. Video thumbnail generation and summarization has been worked on for years, but deep learning and reinforcement learning is changing the landscape and emerging as the winner for optimal frame selection. Recent advances in GANs are improving the quality, aesthetics and relevancy of the frames to represent the original videos. Come join this session to get an understanding of various challenges and emerging solutions around video summarization.
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...AI Frontiers
In this tutorial I will introduce recent work in applying weak supervision and reinforcement learning to Questions Answering (QA) systems. Specifically we discuss the semantic parsing task for which natural language queries are converted to computation steps on knowledge graphs or data tables and produce the expected answers. State-of-the-art results can be achieved by novel memory structure for sequence models and improvements in reinforcement learning algorithms. Related code and experiment setup can be found at https://github.com/crazydonkey200/neural-symbolic-machines. Related paper: https://openreview.net/pdf?id=SyK00v5xx.
Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural NetworksAI Frontiers
Instructor: Mat Leonard
Outline
1. Text Processing
Using Python + NLTK
Cleaning
Normalization
Tokenization
Part-of-speech Tagging
Stemming and Lemmatization
2. Feature Extraction
Bag of Words
TF-IDF
Word Embeddings
Word2Vec
GloVe
3. Topic Modeling
Latent Variables
Beta and Dirichlet Distributions
Laten Dirichlet Allocation
4. NLP with Deep Learning
Neural Networks
Recurrent Neural Networks (RNNs)
Word Embeddings
Sentiment Analysis with RNNs
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...AI Frontiers
Sequence to sequence learning is a powerful way to train deep networks for machine translation, various NLP tasks, but also image generation and recently video and music generation. We will give a hands-on tutorial showing how to use the open-source Tensor2Tensor library to train state-of-the-art models for translation, image generation, and a task of your choice!
Percy Liang at AI Frontiers : Pushing the Limits of Machine LearningAI Frontiers
In recent years, machine learning has undoubtedly been hugely successful in driving progress in AI applications. However, as we will explore in this talk, even state-of-the-art systems have "blind spots" which make them generalize poorly out of domain and render them vulnerable to adversarial examples. We then suggest that more unsupervised learning settings can encourage the development of more robust systems. We show positive results on two tasks: (i) text style and attribute transfer, the task of converting a sentence with one attribute (e.g., sentiment) to one with another; and (ii) solving SAT instances (classical problems requiring logical reasoning) using end-to-end neural networks.
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI missionAI Frontiers
I will present several advances in deep learning from OpenAI. First, I will present OpenAI Five, a neural network that learned to play on par with some of the strongest professional Dota 2 teams in the world in an 18-hero version of the game. Next, I will present Dactyl, a human-like robot hand trained entirely in simulation with reinforcement learning that has achieved unprecedented dexterity on a physical robot. I will also present our results on unsupervised learning in language, that show that pre-training and finetuning can achieve a significant improvement over state of the art. Finally, I will present an overview of the historical progress in the field.
Mario Munich at AI Frontiers : Consumer robotics: embedding affordable AI in ...AI Frontiers
The availability of affordable electronics components, powerful embedded microprocessors, and ubiquitous internet access and WiFi in the household has enabled a new generation of connected consumer robots. In 2015, iRobot launched the Roomba 980, introducing intelligent visual navigation to its successful line of vacuum cleaning robots. In 2018, iRobot launched the Roomba i7, equipped with the latest mapping and navigation technology that provides spatial information to the broader ecosystem of connected devices in the home. In this talk, I will describe the challenges and the potential of introducing consumer robots capable of developing spatial context by exploring the physical space of the home, and I will elaborate on the impact of AI in the future of robotics applications. Moreover, I will describe our vision of the Smart Home, an AI-powered home that maintains itself and magically just does the right thing in anticipation of occupant needs. This home will be built on an ecosystem of connected and coordinated robots, sensors, and devices that provides the occupants with a high quality of life by seamlessly responding to the needs of daily living – from comfort to convenience to security to efficiency.
Sumit Gupta at AI Frontiers : AI for EnterpriseAI Frontiers
The use of AI for voice search and image recognition is talked about often. Enterprises, however, have different challenges and requirements. In this talk, we will focus on talking about use cases in the enterprise and challenges in building out AI solutions. We will talk about how an Auto-machine learning software for videos and images called PowerAI Vision enables quick AI model training & deployment for various enterprise use cases.
Yuandong Tian at AI Frontiers : Planning in Reinforcement LearningAI Frontiers
Deep Reinforcement Learning (DRL) has made strong progress in many tasks, such as board games, robotics, navigation, neural architecture search, etc. I will present our recent open-sourced DRL frameworks to facilitate game research and development. Our framework is scalable so we can can reproduce AlphaGoZero and AlphaZero using 2000 GPUs, achieving super-human performance of Go AI that beats 4 top-30 professional players. We also show usability of our platform by training agents in real-time strategy games, and show interesting behaviors with a small amount of resource.
Alex Ermolaev at AI Frontiers : Major Applications of AI in HealthcareAI Frontiers
The latest AI advances have the potential to massively improve our health and well being. However, most of the work is yet to be done. In this talk, we will explore the most important opportunities for AI in healthcare. For example, we will explore how AI can diagnose major life-threatening conditions even before those conditions emerge. We will talk about AI ability to recommend dramatically more effective and less harmful treatment plans based on AI understanding of patient's medical history and current conditions. Finally, we will talk about AI role in making our healthcare system effective and affordable for everyone.
Long Lin at AI Frontiers : AI in GamingAI Frontiers
Games have been leveraging AI since the 1950s, when people built a rules-based AI engine that played tic-tac-toe. With technological advances over the years, AI has become increasingly popular and widely used in the gaming industry. The typical characteristics of games and game development makes them an ideal playground for practicing and implementing AI techniques, especially deep learning and reinforcement learning. Most games are well scoped; it is relatively easy to generate and use the data; and states/actions/rewards are relatively clear. In this talk, I will show a couple of use cases where ML/AI helps in-game development and enhances player experience. Examples include AI agents playing game and services that provide personalized experience to players.
Melissa Goldman at AI Frontiers : AI & FinanceAI Frontiers
AI in finance is having wide-ranging impact and solving some of the most critical societal problems. The talk gives overview of the opportunities of applying AI in finance with specific examples and highlights some of the unique challenges financial services firms face in deploying AI at scale.
Li Deng at AI Frontiers : From Modeling Speech/Language to Modeling Financial...AI Frontiers
I will first survey how deep learning has disrupted speech and language processing industries since 2009. Then I will draw connections between the techniques for modeling speech and language and those for financial markets. Finally, I will address three unique technical challenges to financial investment.
Ashok Srivastava at AI Frontiers : Using AI to Solve Complex Economic ProblemsAI Frontiers
Nearly half of all small businesses fail within their first 5 years. However, AI-driven solutions can help solve complex economic problems for consumers and small businesses like missed bill payments, insufficient capital, overinvestment in fixed assets, and more.
Ashok Srivastava discusses technology's role in solving economic problems and details how Intuit is using its unrivaled financial dataset to power prosperity around the world. Leveraging technology enablers like deep learning, natural language processing, and automated reasoning and combining with a delightful end-user experience and sophisticated UX, Intuit is using technology to help its users have more confidence in their financial management.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
6. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Applications driven for big data
⬢ Ecosystem of Hadoop
○ How Facebook use Hadoop?
■ Hive for OLAP query processing
■ HBase for for billion users activities tracking
○ How Twitter use Hadoop?
■ Storm: streaming data processing for twitter
stream data
○ How LinkedIn use Hadoop?
■ Kafaka to subscribe users streaming data
○ When Hadoop come together?
■ Ambari: for node management and deploy
different components
7. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
The leading data science platform for big data
Apache Spark
Hadoop
Interactive Streaming Batch
Nosql Tensor
flow
⬢ Apache Spark
○ Machine learning
application driven
○ The leading computation
engine for big data
processing
○ Data pipeline for
different data source
and other computation
engine
○ Uniform data processing
object RDD and
DataFrame
○ Memory based
8. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Data pipeline for machine learning
Resilient Distributed Dataset
server server server server
ETL Exploration Machine
learning
Structural
data
RAW data
processing
Interactive,
OLAP,
Spark SQL
Feature
engineering
Model
training
Data
Product
Visualization
13. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Motivation
⬢ XGBoost is the start-of-art approach in Kaggle for structural data
○ 80% teams win the competition based on XGBoost
○ A tree based model
○ Excellent at classification and regression
○ Ref: http://xgboost.readthedocs.io/en/latest/model.html
18. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Motivation
⬢ From single machine to parallel computation
○ Distributed training
○ GPU supported
○ Cowork with big data ecosystem
⬢ How to provide the end-end solution for DS?
○ Front-end
■ Easy and efficient way for parallel XGBoost computation
■ Notebook front end for model visualization
○ Backend
■ Yarn to allocate the resource for application (CPU, Memory, GPU)
■ Docker support
20. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
How Spark enhance XGBoost
⬢ Each node of XGBoost need Rabit to communicate with each others
○ Efficient but not easy to manage Rabit
XGBoost
worker2
XGBoost
worker3
XGBoost
worker4
Training data
Partition 1 XGBoost
worker1
Training data
Partition 2
Training data
Partition 3
Training data
Partition 4
Statistic sync:
optimal split value
21. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
XGBoost on Spark ML pipeline
⬢ Distributed XGBoost inside Spark ML pipeline
⬢ XGBoost estimator
○ Extend from Spark ML estimator
⬢ XGBoost model
○ Extend from Spark ML pipelineModel
○ Naturally work inside Spark ML Pipeline for model materialization
⬢ XGBoost parameter
○ Extend from Spark ML parameter
○ Enable automatically parameter tuning
22. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
XGBoost on Spark ML pipeline
⬢ Distributed XGBoost
○ Parameter:
○ val paramMap = List( "eta" -> 0.1f, "max_depth" -> 2, "objective" -> "binary:logistic").toMap
○ training
○ val xgboostModelRDD = XGBoost.train(trainRDD, paramMap, 1, 4, useExternalMemory=true)
○ val xgboostModelDF = XGBoost.trainWithDataFrame(trainDF, paramMap, 1, 4, useExternalMemory = true)
○ Prediction
○ val xgboostPredictionRDD = xgboostModelRDD.predict(trainRDD.map{x => x.features})
○ XGBoost inside ML pipeline
○ val xgboostEstimator = new XGBoostEstimator( Map[String, Any]("num_round" -> 30, "nworkers" -> 10, "objective" ->
"reg:linear", "eta" -> 0.3, "max_depth" -> 6, "early_stopping_rounds" -> 10))
val pipeline = new Pipeline() .setStages(Array(assembler, xgboostEstimator))
○ val pipelineData = dataset.withColumnRenamed("PE","label")
○ val pipelineModel = pipeline.fit(pipelineData)
26. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
GPU speedup XGBoost
⬢ GPU is good but manage GPU cluster is not easy
○ Different versions of drivers for GPUs
○ Users have to build XGBoost for GPU supported
○ Hard to manage the resources of GPU
○ GPU resource cannot be shared
⬢ An idle environment is everything included
○ Spark is an efficient distributed engine for data processing
○ Spark ML pipeline for model tuning
○ GPU is used to speedup the XGBoost training
○ Yarn is able to manage the resources of cluster
○ Notebook is used for end users
27. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
What you can learn from this notebook
⬢ Combine Spark, and XGBoost together
○ Train and deploy XGBoost model in a unified data platform
○ Automatically tune the XGBoost model based on Spark ML pipeline
○ Speedup XGBoost training based on distributed computation and GPU
○ Multiple users can share the same cluster with GPU and Spark
⬢ Benefits
○ End to end solution for ML pipeline with XGBoost support
○ Do not need to care about GPU management
○ Train the XGBoost with Spark ML APIs
○ Visualize the predication results on notebook
28. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Spark and Xgboost for Fintech
⬢ Lending club data
⬢ Spark Dataframe for ETL
⬢ Spark SQL for OLAP
⬢ Spark ML for auto modeling tuning and model serving
⬢ Notebook link: (use databricks community edition)
○ Part1: (https://bit.ly/2QuLQ9b) https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/49999
72933037924/27242371102049/8135547933712821/latest.html
○ Part2:(https://bit.ly/2AZJI3Z)
https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/49999
72933037924/27242371102070/8135547933712821/latest.html
⬢ Acknowledgment: https://databricks.com/blog/2018/08/09/loan-risk-analysis-
with-xgboost-and-databricks-runtime-for-machine-learning.html
32. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
What is deep learning
⬢ A set of machine learning techniques that can learn useful representations of
features directly from images, text and sound.
⬢ Achievements
○ ImageNet
○ Google Neural Machine
Translation
○ AlphaGo/AlphaZero
⬢ Benefit from big data and GPU
40. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Deep Learning in Spark MLlib Pipeline
⬢ Spark MLlib pipeline
○ Sequence of Transformers and Estimators
○ Simple, concise API and ease of use
⬢ Integrates with Spark APIs
○ Spark is great at scaling out computations
○ Image representation and reader in Spark DataFrame/Dataset (new in Spark 2.3)
⬢ Spark Deep Learning Pipelines (github.com/databricks/spark-deep-learning)
○ Plugin your own TensorFlow Graph or Keras Model as Transformers
○ Open source under Apache 2.0 license
41. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
Auto ML in Spark ML pipeline
⬢ Spark to prepare the data
○ Spark streaming
○ Spark SQL
⬢ Spark for model parameter tuning
○ Hyper parameter
○ Save memory usage
⬢ TensorFlow auto network structure tuning
○ Reinforce learning
○ Transfer learning
⬢ Model deploy as a service
43. 了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
What you can learn this section
⬢ How to combine deep learning and Spark together
⬢ Take DL as a operator in Spark ML pipeline
⬢ Transfer learning with DL model
⬢ DL model parameter tuning
⬢ Apply DL model into Spark SQL
⬢ Notebook: https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4999972933037924/4
324977500035919/8135547933712821/latest.html
⬢ Acknowledgment: https://docs.databricks.com/applications/deep-learning/deep-learning-
pipelines.html