This document discusses scalable out-of-core data structures for data science. It introduces SFrame and SGraph, which allow machine learning on large datasets that exceed memory by using compressed columnar storage and lazy evaluation. SFrame provides a Python API for feature engineering and vectorized operations on tabular data. SGraph supports graph algorithms like PageRank on very large graphs with billions of nodes and edges. These tools are open source and support HDFS, S3, and other storage backends to enable scalable machine learning.
Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis.
The SFrame package provides the complete implementation of:
SFrame
SArray
SGraph
The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)
Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis.
The SFrame package provides the complete implementation of:
SFrame
SArray
SGraph
The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
Smarter Search With Spark-Solr: Search gets smarter when you know more about your documents and their relationship to each other (think: PageRank) and the users (i.e. popularity), in addition to what you already know about their content (text search). It also gets smarter when you know more about your users (personalization) and both their affinity for certain kinds of content and their similarities to each other (collaborative filtering recommenders).
Building all of these pieces typically requires a big mix of batch workloads to do log processing, as well as training machine-learned models to use during realtime querying, and are highly domain specific, but many techniques are fairly universal: we will discuss how Spark can interface with a Solr Cloud cluster to efficiently perform many of the pieces to this puzzle in one relatively self-contained package (no HDFS/S3, all data stored in Solr!), and introduce “spark-solr” – an open-source JVM library to facilitate this.
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
As machine learning matures, the standard supervised learning setup is no longer sufficient. Instead of making and serving a single prediction as a function of a data point, machine learning applications increasingly must operate in dynamic environments, react to changes in the environment, and take sequences of actions to accomplish a goal. These modern applications are better framed within the context of reinforcement learning (RL), which deals with learning to operate within an environment. RL-based applications have already led to remarkable results, such as Google’s AlphaGo beating the Go world champion, and are finding their way into self-driving cars, UAVs, and surgical robotics.
These applications have very demanding computational requirements–at the high end, they may need to execute millions of tasks per second with millisecond level latencies, and support heterogeneous and dynamic computation graphs. In this talk, we present Ray, a new cluster computing framework that meets these requirements, give some application examples, and discuss how it can be integrated with Apache Spark.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
This session explores building graph databases on AWS, examining common use cases, design patterns, and best practices. We then discuss the main options for running graph databases on AWS and go deeper into the Amazon DynamoDB storage backend plugin for Titan launched earlier this year. The Amazon Fulfillment team will share their story of running the Titan graph database on DynamoDB to track inventory going in and out of the company's fulfillment network. They provide best practices on running an efficient graph database at massive scale.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner.
This webinar is the first of a series in which we survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
1. It has a simple API that integrates well with enterprise Machine Learning pipelines.
2. It automatically scales out common Deep Learning patterns, thanks to Spark.
3. It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this webinar, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
* how to build deep learning models in a few lines of code;
* how to scale common tasks like transfer learning and prediction; and
* how to publish models in Spark SQL.
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Databricks
Reinforcement learning (RL) algorithms involve the deep nesting of distinct components, where each component typically exhibits opportunities for distributed computation. Current RL libraries offer parallelism at the level of the entire program, coupling all the components together and making existing implementations difficult to extend, combine, and reuse.
We argue for building composable RL components by encapsulating parallelism and resource requirements within individual components, which can be achieved by building on top of a flexible task-based programming model. We demonstrate this principle by building Ray RLLib on top of the the Ray distributed execution engine and show that we can implement a wide range of state-of-the-art algorithms by composing and reusing a handful of standard components. This composability does not come at the cost of performance — in our experiments, RLLib matches or exceeds the performance of highly optimized reference implementations.
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
This talk discusses the trajectory of MLlib, the Machine Learning (ML) library for Apache Spark. We will review the history of the project, including major trends and efforts leading up to today. These discussions will provide perspective as we delve into ongoing and future efforts within the community. This talk is geared towards both practitioners and developers and will provide a deeper understanding of priorities, directions and plans for MLlib.
Since the original MLlib project was merged into Apache Spark, some of the most significant efforts have been in expanding algorithmic coverage, adding multiple language APIs, supporting ML Pipelines, improving DataFrame integration, and providing model persistence. At an even higher level, the project has evolved from building a standard ML library to supporting complex workflows and production requirements.
This momentum continues. We will discuss some of the major ongoing and future efforts in Apache Spark based on discussions, planning and development amongst the MLlib community. We (the community) aim to provide pluggable and extensible APIs usable by both practitioners and ML library developers. To take advantage of Projects Tungsten and Catalyst, we are exploring DataFrame-based implementations of ML algorithms for better scaling and performance. Finally, we are making continuous improvements to core algorithms in performance, functionality, and robustness. We will augment this discussion with statistics from project activity.
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
Deep Reinforcement Learning (DRL) is a thriving area in the current AI battlefield. AlphaGO by DeepMind is a very successful application of DRL which has drawn the attention of the entire world. Besides playing games, DRL also has many practical use in industry, e.g. autonomous driving, chatbots, financial investment, inventory management, and even recommendation systems. Although DRL applications has something in common with supervised Computer Vision or Natural Language Processing tasks, they are unique in many ways.
For example, they have to interact (explore) with the environment to obtain training samples along the optimization, and the method to improve the model is usually different from common supervised applications. In this talk we will share our experience of building Deep Reinforcement Learning applications on BigDL/Spark. BigDL is a well-developed deep learning library on Spark which is handy for Big Data users, but it has been mostly used for supervised and unsupervised machine learning. We have made extensions particularly for DRL algorithms (e.g. DQN, PG, TRPO and PPO, etc.), implemented classical DRL algorithms, built applications with them and did performance tuning. We are happy to share what we have learnt during this process.
We hope our experience will help our audience learn how to build a RL application on their own for in their production business.
Practical Machine Learning Pipelines with MLlibDatabricks
This talk from 2015 Spark Summit East discusses Pipelines and related concepts introduced in Spark 1.2 which provide a simple API for users to set up complex ML workflows.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
This talk tells the story of implementation and optimization of a sparse logistic regression algorithm in spark. I would like to share the lessons I learned and the steps I had to take to improve the speed of execution and convergence of my initial naive implementation. The message isn’t to convince the audience that logistic regression is great and my implementation is awesome, rather it will give details about how it works under the hood, and general tips for implementing an iterative parallel machine learning algorithm in spark. The talk is structured as a sequence of “lessons learned” that are shown in form of code examples building on the initial naive implementation. The performance impact of each “lesson” on execution time and speed of convergence is measured on benchmark datasets. You will see how to formulate logistic regression in a parallel setting, how to avoid data shuffles, when to use a custom partitioner, how to use the ‘aggregate’ and ‘treeAggregate’ functions, how momentum can accelerate the convergence of gradient descent, and much more. I will assume basic understanding of machine learning and some prior knowledge of spark. The code examples are written in scala, and the code will be made available for each step in the walkthrough.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
Graph theory could potentially make a big impact on how we conduct businesses. Imagine the case where you wish to maximize the reach of your promotion via leveraging your customers' influence, to advocate your products and bring their friends on board. The same logic of harnessing one's networks can be applied to purchase recommendation, customer behavior, and fraud detection.
Running analyses on large graphs was not trivial for many companies - until recently. The field has made significant steps in the last five years and scalable graph computations are now the norm. You can now run graph computations out-of-core (no memory constraints) and in parallel (multiple machines), especially in Spark which is spreading like wildfire.
A lot of people are familiar with graphX, a pretty solid implementation of scalable graphs in Spark. GraphX is pretty interesting but the project seems to be orphaned. The good news is, there is now an alternative: Graphframes. They are a new data structure that takes the best parts of dataframes and graphs
In this talk, I will be explaining how to use Graphframes from Python, a new data structure in Spark 2.0 that takes the best parts of dataframes and graphs, with an example using personalized pagerank for recommendations.
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
Smarter Search With Spark-Solr: Search gets smarter when you know more about your documents and their relationship to each other (think: PageRank) and the users (i.e. popularity), in addition to what you already know about their content (text search). It also gets smarter when you know more about your users (personalization) and both their affinity for certain kinds of content and their similarities to each other (collaborative filtering recommenders).
Building all of these pieces typically requires a big mix of batch workloads to do log processing, as well as training machine-learned models to use during realtime querying, and are highly domain specific, but many techniques are fairly universal: we will discuss how Spark can interface with a Solr Cloud cluster to efficiently perform many of the pieces to this puzzle in one relatively self-contained package (no HDFS/S3, all data stored in Solr!), and introduce “spark-solr” – an open-source JVM library to facilitate this.
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
As machine learning matures, the standard supervised learning setup is no longer sufficient. Instead of making and serving a single prediction as a function of a data point, machine learning applications increasingly must operate in dynamic environments, react to changes in the environment, and take sequences of actions to accomplish a goal. These modern applications are better framed within the context of reinforcement learning (RL), which deals with learning to operate within an environment. RL-based applications have already led to remarkable results, such as Google’s AlphaGo beating the Go world champion, and are finding their way into self-driving cars, UAVs, and surgical robotics.
These applications have very demanding computational requirements–at the high end, they may need to execute millions of tasks per second with millisecond level latencies, and support heterogeneous and dynamic computation graphs. In this talk, we present Ray, a new cluster computing framework that meets these requirements, give some application examples, and discuss how it can be integrated with Apache Spark.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
This session explores building graph databases on AWS, examining common use cases, design patterns, and best practices. We then discuss the main options for running graph databases on AWS and go deeper into the Amazon DynamoDB storage backend plugin for Titan launched earlier this year. The Amazon Fulfillment team will share their story of running the Titan graph database on DynamoDB to track inventory going in and out of the company's fulfillment network. They provide best practices on running an efficient graph database at massive scale.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner.
This webinar is the first of a series in which we survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
1. It has a simple API that integrates well with enterprise Machine Learning pipelines.
2. It automatically scales out common Deep Learning patterns, thanks to Spark.
3. It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this webinar, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
* how to build deep learning models in a few lines of code;
* how to scale common tasks like transfer learning and prediction; and
* how to publish models in Spark SQL.
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Databricks
Reinforcement learning (RL) algorithms involve the deep nesting of distinct components, where each component typically exhibits opportunities for distributed computation. Current RL libraries offer parallelism at the level of the entire program, coupling all the components together and making existing implementations difficult to extend, combine, and reuse.
We argue for building composable RL components by encapsulating parallelism and resource requirements within individual components, which can be achieved by building on top of a flexible task-based programming model. We demonstrate this principle by building Ray RLLib on top of the the Ray distributed execution engine and show that we can implement a wide range of state-of-the-art algorithms by composing and reusing a handful of standard components. This composability does not come at the cost of performance — in our experiments, RLLib matches or exceeds the performance of highly optimized reference implementations.
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
This talk discusses the trajectory of MLlib, the Machine Learning (ML) library for Apache Spark. We will review the history of the project, including major trends and efforts leading up to today. These discussions will provide perspective as we delve into ongoing and future efforts within the community. This talk is geared towards both practitioners and developers and will provide a deeper understanding of priorities, directions and plans for MLlib.
Since the original MLlib project was merged into Apache Spark, some of the most significant efforts have been in expanding algorithmic coverage, adding multiple language APIs, supporting ML Pipelines, improving DataFrame integration, and providing model persistence. At an even higher level, the project has evolved from building a standard ML library to supporting complex workflows and production requirements.
This momentum continues. We will discuss some of the major ongoing and future efforts in Apache Spark based on discussions, planning and development amongst the MLlib community. We (the community) aim to provide pluggable and extensible APIs usable by both practitioners and ML library developers. To take advantage of Projects Tungsten and Catalyst, we are exploring DataFrame-based implementations of ML algorithms for better scaling and performance. Finally, we are making continuous improvements to core algorithms in performance, functionality, and robustness. We will augment this discussion with statistics from project activity.
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
Deep Reinforcement Learning (DRL) is a thriving area in the current AI battlefield. AlphaGO by DeepMind is a very successful application of DRL which has drawn the attention of the entire world. Besides playing games, DRL also has many practical use in industry, e.g. autonomous driving, chatbots, financial investment, inventory management, and even recommendation systems. Although DRL applications has something in common with supervised Computer Vision or Natural Language Processing tasks, they are unique in many ways.
For example, they have to interact (explore) with the environment to obtain training samples along the optimization, and the method to improve the model is usually different from common supervised applications. In this talk we will share our experience of building Deep Reinforcement Learning applications on BigDL/Spark. BigDL is a well-developed deep learning library on Spark which is handy for Big Data users, but it has been mostly used for supervised and unsupervised machine learning. We have made extensions particularly for DRL algorithms (e.g. DQN, PG, TRPO and PPO, etc.), implemented classical DRL algorithms, built applications with them and did performance tuning. We are happy to share what we have learnt during this process.
We hope our experience will help our audience learn how to build a RL application on their own for in their production business.
Practical Machine Learning Pipelines with MLlibDatabricks
This talk from 2015 Spark Summit East discusses Pipelines and related concepts introduced in Spark 1.2 which provide a simple API for users to set up complex ML workflows.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
This talk tells the story of implementation and optimization of a sparse logistic regression algorithm in spark. I would like to share the lessons I learned and the steps I had to take to improve the speed of execution and convergence of my initial naive implementation. The message isn’t to convince the audience that logistic regression is great and my implementation is awesome, rather it will give details about how it works under the hood, and general tips for implementing an iterative parallel machine learning algorithm in spark. The talk is structured as a sequence of “lessons learned” that are shown in form of code examples building on the initial naive implementation. The performance impact of each “lesson” on execution time and speed of convergence is measured on benchmark datasets. You will see how to formulate logistic regression in a parallel setting, how to avoid data shuffles, when to use a custom partitioner, how to use the ‘aggregate’ and ‘treeAggregate’ functions, how momentum can accelerate the convergence of gradient descent, and much more. I will assume basic understanding of machine learning and some prior knowledge of spark. The code examples are written in scala, and the code will be made available for each step in the walkthrough.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
Graph theory could potentially make a big impact on how we conduct businesses. Imagine the case where you wish to maximize the reach of your promotion via leveraging your customers' influence, to advocate your products and bring their friends on board. The same logic of harnessing one's networks can be applied to purchase recommendation, customer behavior, and fraud detection.
Running analyses on large graphs was not trivial for many companies - until recently. The field has made significant steps in the last five years and scalable graph computations are now the norm. You can now run graph computations out-of-core (no memory constraints) and in parallel (multiple machines), especially in Spark which is spreading like wildfire.
A lot of people are familiar with graphX, a pretty solid implementation of scalable graphs in Spark. GraphX is pretty interesting but the project seems to be orphaned. The good news is, there is now an alternative: Graphframes. They are a new data structure that takes the best parts of dataframes and graphs
In this talk, I will be explaining how to use Graphframes from Python, a new data structure in Spark 2.0 that takes the best parts of dataframes and graphs, with an example using personalized pagerank for recommendations.
As someone who likes to travel, I know that good road games is a good way to entertain everyone in the vehicle, keep the driver focused on what really matters which is getting everyone from A to B safe and sound. These games are especially important when there may not be a DVD player in the back of a seat to keep everyone entertained.
One day a driver was on the freeway and noticed the flashing red light of a police car behind them. The driver promptly pulled over to the side of the freeway. While waiting for the officer, the driver sat and wondered why in the world the officer had just pulled them over.
Hello folks!
This is my first time uploading any presentation, its about how you can make an effective website by avoiding certain mistakes. I hope you like it :))
Included in this article remain a few car tips for a vehicle’s survival during cold weather. Most people would not desire a break-down in the middle of a snowy winter. These tips aim at helping foresee caution before driving extensively. If unsure of how to check several items, seek the help of an oil changing business. Such establishments often offer service in areas other than just oil, and they may even specialize in outfitting an automobile with the majority of proper safety necessities.
Nash Community College BDF Program Presentation - Local Economic Outlook Lunc...rmtjaycees
Presentation slides for Nash Community College Brewing, Distillation and Fermentation Program, presented by Dr. Trent Mohrbutter, during Rocky Mount Area Jaycees Local Economic Outlook Luncheon, March 10, 2015.
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.
At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.
Updated version of "Big Data Science in Scala" featuring Spark pipelines and hyper-parameter optimization techniques. This talk presents you how three scala libraries - Smile, Saddle and Spark ML - satisfy requirements of new Big Data Science projects. Let's see it on example of click-through rate prediction.
AI from your data lake: Using Solr for analyticsDataWorks Summit
Introductory technical session on Apache Solr's (HDP Search) artificial intelligence and machine learning features to discover relationships and insights across big data in the enterprise. Discussions will include how Solr performs graph traversal, anomaly detection, NLP and time-series analysis, and how you can display this data to users with easy-to-create dashboards.
This technical session will review Apache Solr’s streaming expressions, which were introduced in Solr 6.5. With over 100 expressions and evaluators, conditional logic, variables and data structures these functions form the basis of a new paradigm that brings many of the features from the relational world into search. These new capabilities form the basis of a powerful functional programming language that enables the implementation of many parallel computing use cases such as anomaly detection, streaming NLP, graph traversal and time-series analysis.
In order to discover and analyze big data, third party tools such as Jupyter, Tableau, and Lucidworks Insights will be reviewed.
Speaker
Cassandra Targett, Lucidworks, Director of Engineering
Marcelline Saunders, Lucidworks, Director, Global Partner Enablement
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Enrich Search User Experience Using Amazon CloudSearch (SVC302) | AWS re:Inve...Amazon Web Services
Today's applications work across many different data assets - documents stored in Amazon S3, metadata stored in NoSQL data stores, catalogs and orders stored in relational database systems, raw files in filesystems, etc. Building a great search experience across all these disparate datasets and contexts can be daunting. Amazon CloudSearch provides simple, low-cost search, enabling your users to find the information they are looking for. In this session, we will show you how to integrate search with your application, including key areas such as data preparation, domain creation and configuration, data upload, integration of search UI, search performance and relevance tuning. We will cover search applications that are deployed for both desktop and mobile devices.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology and Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. The combination of the two can provide a solution to power advanced analytics for not only what has happened in the past, but make intelligent predictions about the future. Please join this webinar to learn how get the most value from your data for your data driven business.
Learning Objectives:
How to scale your Redshift queries with user-defined functions (UDFs)
How to apply Machine learning to historical data in Amazon Redshift
How to visualize your data with Amazon QuickSight
Present a reference architecture for advanced analytics
Who Should Attend:
Application developers looking to add UDFs, or predictive analytics to their applications, database administrators that need to meet the demand of data driven organizations, decision makers looking to derive more insight from their data
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...Amazon Web Services
Customers are adopting Apache Spark ‒ an open-source distributed processing framework ‒ on Amazon EMR for large-scale machine learning workloads, especially for applications that power customer segmentation and content recommendation. By leveraging Spark ML, a set of machine learning algorithms included with Spark, customers can quickly build and execute massively parallel machine learning jobs. Additionally, Spark applications can train models in streaming or batch contexts, and can access data from Amazon S3, Amazon Kinesis, Amazon Redshift, and other services. This session explains how to quickly and easily create scalable Spark clusters with Amazon EMR, build and share models using Apache Zeppelin and Jupyter notebooks, and use the Spark ML pipelines API to manage your training workflow. In addition, Jasjeet Thind, Senior Director of Data Science and Engineering at Zillow Group, will discuss his organization's development of personalization algorithms and platforms at scale using Spark on Amazon EMR.
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
Live webinar session with Carlos Guestrin, Dato CEO and Amazon Professor of Machine Learning at University of Washington. Carlos reviewed 2015 highlights, previewed the Dato roadmap, and answered real-time questions from participants about use cases, algorithms, and resources.
Tutorial for Machine Learning 101 (an all-day tutorial at Strata + Hadoop World, New York City, 2015)
The course is designed to introduce machine learning via real applications like building a recommender image analysis using deep learning.
In this talk we cover deployment of machine learning models.
Overview of Machine Learning and Feature EngineeringTuri, Inc.
Machine Learning 101 Tutorial at Strata NYC, Sep 2015
Overview of machine learning models and features. Visualization of feature space and feature engineering methods.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
2. • Background
- Machine Learning (ML) Research.
- Ph.D Numerical Optimization @Wisconsin
• Now
- Build ML tools for data-scientists & developers @Dato.
- Help deploy ML algorithms.
@krishna_srd, @DatoInc
About Me
12. SFrame Python API
Make a little SFrame of 1 column and 5 values:
>> sf = gl.SFrame({‘x’:[1,2,3,4,5]})
Normalizes the column x:
>> sf[‘x’] = sf[‘x’] / sf[‘x’].sum()
Uses a python lambda to create a new column:
>> sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x if x > 0 else 0)
Create a new column using a vectorized operator:
>> sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’]
Create a new SFrame taking only 2 of the columns:
>> sf2 = sf[[‘x’,’x-squared’]]
14. nrating
sf[‘nrating’]-=-sf2[‘rating’]
What is the SFrame?
sf#=#gl.SFrame(‘netflix_tr.frame’)
user movie rating
netflix_tr.frame
sf
user
item
rating
sf2$=$gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
rating
15. nrating
sf[‘nrating’]-=-sf2[‘rating’]
What is the SFrame?
sf#=#gl.SFrame(‘netflix_tr.frame’)
user movie rating
netflix_tr.frame
sf
user
item
rating
sf2$=$gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
rating
diff
anonymous
diff$=$sf[‘rating’]$0 sf2[‘rating’]
16. What is the SFrame?
Filtering
sf[sf[‘rating’]->=-3]
Joins
Sf.join(user_table,-on=‘user_id’)
Random/Array3indexing
row10-=-sf[10]
Table_with_every_other_row =-sf[::2]
Rather3Fast3Parallelized3UDFs3(Interproc SHM)
sf[‘rating’].apply(lambda-x:-x*x)
Not a SQL
Frontend
17. SArray Column Types
Boring Scalar Types
- int64, double, string
Interesting Scalar Types
- Datetime, image
Mathematician Type
- array(‘d’)
Industrial Data Scientist Type
- list, dict
21. Cross Platform?
Python Bindings
- Our oldest binding
- Via Cython + Interprocessing communication to a C++ binary
R Bindings
- Via RCpp
- In Beta. Soon to be released.
C++ Bindings
- Used for internal development of
Julia Bindings
- “Hackathon” mock project mature
22. SGraph: Common Crawl
1x r3.8xlarge ! using 1x SSD.
PageRank:)9 min%per%iteration.
Connected)Components:))~%1%hr.
There)isn’t)any)general)purpose)library)out)there)capable)of)this.
3.5 billion Nodes and 128 billion Edges