GraphLab Create uses a user-first architecture that optimizes for user interaction. It features SFrame for scalable tabular data manipulation and SGraph for scalable graph manipulation. These are built by data scientists for data scientists to easily translate between tabular and graph representations. SFrame uses a columnar format for efficient feature engineering and vectorized operations on large datasets.
Scalable data structures for data scienceTuri, Inc.
This document discusses scalable out-of-core data structures for data science. It introduces SFrame and SGraph, which allow machine learning on large datasets that exceed memory by using compressed columnar storage and lazy evaluation. SFrame provides a Python API for feature engineering and vectorized operations on tabular data. SGraph supports graph algorithms like PageRank on very large graphs with billions of nodes and edges. These tools are open source and support HDFS, S3, and other storage backends to enable scalable machine learning.
New Capabilities in the PyData EcosystemTuri, Inc.
This document summarizes new capabilities in the PyData ecosystem of tools for scientific computing and data science in Python. It focuses on Bokeh and Dask, which enable interactive visualization and out-of-core parallel computing respectively. Bokeh allows creating interactive web-based visualizations without writing JavaScript, while Dask enables parallel computing on large datasets that exceed memory using task scheduling. The document also briefly mentions related tools like Blaze, NumPy, Pandas, Jupyter notebooks, and conda for package and environment management.
Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis.
The SFrame package provides the complete implementation of:
SFrame
SArray
SGraph
The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)
What’s New in the Berkeley Data Analytics StackTuri, Inc.
The document discusses the Berkeley Data Analytics Stack (BDAS) developed by UC Berkeley's AMPLab. It summarizes the key components of the BDAS including Spark, Mesos, Tachyon, MLlib, and Velox. It describes how the BDAS provides a unified platform for batch, iterative, and streaming analytics using in-memory techniques. It also discusses recent developments like KeystoneML/ML Pipelines for scalable machine learning and SampleClean for human-in-the-loop analytics. The goal is to make it easier to build and deploy advanced analytics applications on large datasets.
DeepLearning4J: Open Source Neural Net PlatformTuri, Inc.
The document discusses neural networks and deep learning. It describes how neural networks can be used for tasks like natural language processing, computer vision, and recommender systems. It also discusses how platforms like Skymind's DeepLearning4J provide tools for building and training neural networks using Java and Scala. DeepLearning4J is open source and can run neural networks across CPUs and GPUs for large scale deep learning.
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
Smarter Search With Spark-Solr: Search gets smarter when you know more about your documents and their relationship to each other (think: PageRank) and the users (i.e. popularity), in addition to what you already know about their content (text search). It also gets smarter when you know more about your users (personalization) and both their affinity for certain kinds of content and their similarities to each other (collaborative filtering recommenders).
Building all of these pieces typically requires a big mix of batch workloads to do log processing, as well as training machine-learned models to use during realtime querying, and are highly domain specific, but many techniques are fairly universal: we will discuss how Spark can interface with a Solr Cloud cluster to efficiently perform many of the pieces to this puzzle in one relatively self-contained package (no HDFS/S3, all data stored in Solr!), and introduce “spark-solr” – an open-source JVM library to facilitate this.
What's new in pandas and the SciPy stack for financial usersWes McKinney
Wes McKinney discusses updates and planned improvements to Python packages for financial analysis, including pandas, NumPy, IPython, Cython, matplotlib, and statsmodels. Major changes include a redesign of pandas' DataFrame internals, hierarchical indexing, time series functionality in statsmodels, and performance optimizations. McKinney aims to make pandas the foundation for rich statistical computing and leverage the best of other languages in Python.
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.
Scalable data structures for data scienceTuri, Inc.
This document discusses scalable out-of-core data structures for data science. It introduces SFrame and SGraph, which allow machine learning on large datasets that exceed memory by using compressed columnar storage and lazy evaluation. SFrame provides a Python API for feature engineering and vectorized operations on tabular data. SGraph supports graph algorithms like PageRank on very large graphs with billions of nodes and edges. These tools are open source and support HDFS, S3, and other storage backends to enable scalable machine learning.
New Capabilities in the PyData EcosystemTuri, Inc.
This document summarizes new capabilities in the PyData ecosystem of tools for scientific computing and data science in Python. It focuses on Bokeh and Dask, which enable interactive visualization and out-of-core parallel computing respectively. Bokeh allows creating interactive web-based visualizations without writing JavaScript, while Dask enables parallel computing on large datasets that exceed memory using task scheduling. The document also briefly mentions related tools like Blaze, NumPy, Pandas, Jupyter notebooks, and conda for package and environment management.
Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis.
The SFrame package provides the complete implementation of:
SFrame
SArray
SGraph
The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)
What’s New in the Berkeley Data Analytics StackTuri, Inc.
The document discusses the Berkeley Data Analytics Stack (BDAS) developed by UC Berkeley's AMPLab. It summarizes the key components of the BDAS including Spark, Mesos, Tachyon, MLlib, and Velox. It describes how the BDAS provides a unified platform for batch, iterative, and streaming analytics using in-memory techniques. It also discusses recent developments like KeystoneML/ML Pipelines for scalable machine learning and SampleClean for human-in-the-loop analytics. The goal is to make it easier to build and deploy advanced analytics applications on large datasets.
DeepLearning4J: Open Source Neural Net PlatformTuri, Inc.
The document discusses neural networks and deep learning. It describes how neural networks can be used for tasks like natural language processing, computer vision, and recommender systems. It also discusses how platforms like Skymind's DeepLearning4J provide tools for building and training neural networks using Java and Scala. DeepLearning4J is open source and can run neural networks across CPUs and GPUs for large scale deep learning.
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
Smarter Search With Spark-Solr: Search gets smarter when you know more about your documents and their relationship to each other (think: PageRank) and the users (i.e. popularity), in addition to what you already know about their content (text search). It also gets smarter when you know more about your users (personalization) and both their affinity for certain kinds of content and their similarities to each other (collaborative filtering recommenders).
Building all of these pieces typically requires a big mix of batch workloads to do log processing, as well as training machine-learned models to use during realtime querying, and are highly domain specific, but many techniques are fairly universal: we will discuss how Spark can interface with a Solr Cloud cluster to efficiently perform many of the pieces to this puzzle in one relatively self-contained package (no HDFS/S3, all data stored in Solr!), and introduce “spark-solr” – an open-source JVM library to facilitate this.
What's new in pandas and the SciPy stack for financial usersWes McKinney
Wes McKinney discusses updates and planned improvements to Python packages for financial analysis, including pandas, NumPy, IPython, Cython, matplotlib, and statsmodels. Major changes include a redesign of pandas' DataFrame internals, hierarchical indexing, time series functionality in statsmodels, and performance optimizations. McKinney aims to make pandas the foundation for rich statistical computing and leverage the best of other languages in Python.
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.
Better {ML} Together: GraphLab Create + Spark Turi, Inc.
This document discusses using GraphLab Create and Apache Spark together for machine learning applications. It provides an overview of Spark and how to create resilient distributed datasets (RDDs) and perform parallel operations on clusters. It then lists many machine learning algorithms available in GraphLab Create, including recommender systems, classification, regression, text analysis, image analysis, and graph analytics. The document proposes using notebooks to build data science products that help deliver personalized experiences through ML and intelligent automation. It demonstrates clustering customer transactions from an expense reporting dataset to identify customer behavior patterns.
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Turi, Inc.
The document discusses declarative machine learning and the SystemML project. SystemML allows users to write machine learning algorithms in a declarative syntax and handles compiling the code and optimizing execution across single-node, Hadoop, and Spark backends. It provides speedups of 2-10x over traditional frameworks by leveraging optimizations across the entire compilation and execution chain.
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanc...Chris Fregly
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Presentation I gave at Bay Area Clojure Meetup Group on May 6th, 2010. Also demoed examples from introductory tutorial:
http://nathanmarz.com/blog/introducing-cascalog/
Practical Machine Learning Pipelines with MLlibDatabricks
This talk from 2015 Spark Summit East discusses Pipelines and related concepts introduced in Spark 1.2 which provide a simple API for users to set up complex ML workflows.
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
This document summarizes Spark's structured APIs including SQL, DataFrames, and Datasets. It discusses how structuring computation in Spark enables optimizations by limiting what can be expressed. The structured APIs provide type safety, avoid errors, and share an optimization and execution pipeline. Functions allow expressing complex logic on columns. Encoders map between objects and Spark's internal data format. Structured streaming provides a high-level API to continuously query streaming data similar to batch queries.
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
As machine learning matures, the standard supervised learning setup is no longer sufficient. Instead of making and serving a single prediction as a function of a data point, machine learning applications increasingly must operate in dynamic environments, react to changes in the environment, and take sequences of actions to accomplish a goal. These modern applications are better framed within the context of reinforcement learning (RL), which deals with learning to operate within an environment. RL-based applications have already led to remarkable results, such as Google’s AlphaGo beating the Go world champion, and are finding their way into self-driving cars, UAVs, and surgical robotics.
These applications have very demanding computational requirements–at the high end, they may need to execute millions of tasks per second with millisecond level latencies, and support heterogeneous and dynamic computation graphs. In this talk, we present Ray, a new cluster computing framework that meets these requirements, give some application examples, and discuss how it can be integrated with Apache Spark.
This session explores building graph databases on AWS, examining common use cases, design patterns, and best practices. We then discuss the main options for running graph databases on AWS and go deeper into the Amazon DynamoDB storage backend plugin for Titan launched earlier this year. The Amazon Fulfillment team will share their story of running the Titan graph database on DynamoDB to track inventory going in and out of the company's fulfillment network. They provide best practices on running an efficient graph database at massive scale.
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
This document summarizes the growth and development of the Spark project. It notes that Spark has grown significantly over the past year in terms of contributors, companies involved, and lines of code. Spark is now one of the most active projects within the Apache Hadoop ecosystem. The document outlines major new additions to Spark including Spark SQL for structured data, MLlib for machine learning algorithms, and Java 8 APIs. It discusses the vision for Spark as a unified platform and standard library for big data applications.
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
- MLlib has rapidly developed over the past 5 years, growing from a few algorithms to over 50 algorithms and featurizers for classification, regression, clustering, recommendation, and more.
- This growth has shifted from just adding algorithms to improving algorithms, infrastructure, and integrating ML workflows with Spark's broader capabilities like SQL, DataFrames, and streaming.
- Going forward, areas of focus include continued scalability improvements, enhancing core algorithms, extensible APIs, and making MLlib a more comprehensive standard library.
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
Many high-tech industries rely on machine-learning systems in production environments to automatically classify and respond to vast amounts of incoming data. Despite their critical roles, these systems are often not actively monitored. When a problem first arises, it may go unnoticed for some time. Once it is noticed, investigating its underlying cause is a time-consuming, manual process. Wouldn’t it be great if the model’s output were automatically monitored? If they could be visualized, sliced by different dimensions? If the system could automatically detect performance degradation and trigger alerts? In this presentation, we describe our experience from building such a core machine-learning services: Model Evaluation.
Our service provides automated, continuous evaluation of the performance of a deployed model over commonly-used metrics like the area-under-the-curve (AUC), root-mean-square-error (RMSE) etc. In addition, summary statistics about the model’s output, their distributions are also computed. The service also provides a dashboard to visualize the performance metrics, summary statistics and distributions of a model over time along with REST APIs to retrieve these metrics programmatically.
These metrics can be sliced by input features (e.g. Geography, Product type) to provide insights into model performance over different segments. The talk will describe various components that are required in building such a service and metrics of interest. Our system has a backend component built with spark on Azure Databricks. The backend can scale to analyze TBs of data to generate model evaluation metrics.
We will talk about how we modified Spark MLLib for computing AUC sliced by different dimensions and other optimizations in Spark to improve compute and performance. Our front-end and middle-tier, built with Docker and Azure Webapp provides visuals and REST APIs to retrieve the above metrics. This talk will cover various aspects of building, deploying and using the above system.
This document summarizes new and improved features in the pandas library. It describes how pandas provides labeled, structured arrays for heterogeneous data and time series functionality. Major updates include merging DataFrame and DataMatrix into a single optimized DataFrame, better handling of missing data, more robust IO capabilities, and improved pivoting and reshaping of data. Future plans include more powerful group by operations and integration with statsmodels.
TensorFlow & TensorFrames w/ Apache Spark presents Marco Saviano. It discusses numerical computing with Apache Spark and Google TensorFlow. TensorFrames allows manipulating Spark DataFrames with TensorFlow programs. It provides most operations in row-based and block-based versions. Row-based processes rows individually while block-based processes blocks of rows together for better efficiency. Reduction operations coalesce rows until one row remains. Future work may improve communication between Spark and TensorFlow through direct memory copying and using columnar storage formats.
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
Live webinar session with Carlos Guestrin, Dato CEO and Amazon Professor of Machine Learning at University of Washington. Carlos reviewed 2015 highlights, previewed the Dato roadmap, and answered real-time questions from participants about use cases, algorithms, and resources.
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
This document summarizes machine learning scalability from single machine to distributed systems. It discusses how true scalability is about how long it takes to reach a target accuracy level using any available hardware resources. It introduces GraphLab Create and SFrame/SGraph for scalable machine learning and graph processing. Key points include distributed optimization techniques, graph partitioning strategies, and benchmarks showing GraphLab Create can solve problems faster than other systems by using fewer machines.
Better {ML} Together: GraphLab Create + Spark Turi, Inc.
This document discusses using GraphLab Create and Apache Spark together for machine learning applications. It provides an overview of Spark and how to create resilient distributed datasets (RDDs) and perform parallel operations on clusters. It then lists many machine learning algorithms available in GraphLab Create, including recommender systems, classification, regression, text analysis, image analysis, and graph analytics. The document proposes using notebooks to build data science products that help deliver personalized experiences through ML and intelligent automation. It demonstrates clustering customer transactions from an expense reporting dataset to identify customer behavior patterns.
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Turi, Inc.
The document discusses declarative machine learning and the SystemML project. SystemML allows users to write machine learning algorithms in a declarative syntax and handles compiling the code and optimizing execution across single-node, Hadoop, and Spark backends. It provides speedups of 2-10x over traditional frameworks by leveraging optimizations across the entire compilation and execution chain.
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanc...Chris Fregly
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Presentation I gave at Bay Area Clojure Meetup Group on May 6th, 2010. Also demoed examples from introductory tutorial:
http://nathanmarz.com/blog/introducing-cascalog/
Practical Machine Learning Pipelines with MLlibDatabricks
This talk from 2015 Spark Summit East discusses Pipelines and related concepts introduced in Spark 1.2 which provide a simple API for users to set up complex ML workflows.
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
This document summarizes Spark's structured APIs including SQL, DataFrames, and Datasets. It discusses how structuring computation in Spark enables optimizations by limiting what can be expressed. The structured APIs provide type safety, avoid errors, and share an optimization and execution pipeline. Functions allow expressing complex logic on columns. Encoders map between objects and Spark's internal data format. Structured streaming provides a high-level API to continuously query streaming data similar to batch queries.
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
As machine learning matures, the standard supervised learning setup is no longer sufficient. Instead of making and serving a single prediction as a function of a data point, machine learning applications increasingly must operate in dynamic environments, react to changes in the environment, and take sequences of actions to accomplish a goal. These modern applications are better framed within the context of reinforcement learning (RL), which deals with learning to operate within an environment. RL-based applications have already led to remarkable results, such as Google’s AlphaGo beating the Go world champion, and are finding their way into self-driving cars, UAVs, and surgical robotics.
These applications have very demanding computational requirements–at the high end, they may need to execute millions of tasks per second with millisecond level latencies, and support heterogeneous and dynamic computation graphs. In this talk, we present Ray, a new cluster computing framework that meets these requirements, give some application examples, and discuss how it can be integrated with Apache Spark.
This session explores building graph databases on AWS, examining common use cases, design patterns, and best practices. We then discuss the main options for running graph databases on AWS and go deeper into the Amazon DynamoDB storage backend plugin for Titan launched earlier this year. The Amazon Fulfillment team will share their story of running the Titan graph database on DynamoDB to track inventory going in and out of the company's fulfillment network. They provide best practices on running an efficient graph database at massive scale.
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
This document summarizes the growth and development of the Spark project. It notes that Spark has grown significantly over the past year in terms of contributors, companies involved, and lines of code. Spark is now one of the most active projects within the Apache Hadoop ecosystem. The document outlines major new additions to Spark including Spark SQL for structured data, MLlib for machine learning algorithms, and Java 8 APIs. It discusses the vision for Spark as a unified platform and standard library for big data applications.
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
- MLlib has rapidly developed over the past 5 years, growing from a few algorithms to over 50 algorithms and featurizers for classification, regression, clustering, recommendation, and more.
- This growth has shifted from just adding algorithms to improving algorithms, infrastructure, and integrating ML workflows with Spark's broader capabilities like SQL, DataFrames, and streaming.
- Going forward, areas of focus include continued scalability improvements, enhancing core algorithms, extensible APIs, and making MLlib a more comprehensive standard library.
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
Many high-tech industries rely on machine-learning systems in production environments to automatically classify and respond to vast amounts of incoming data. Despite their critical roles, these systems are often not actively monitored. When a problem first arises, it may go unnoticed for some time. Once it is noticed, investigating its underlying cause is a time-consuming, manual process. Wouldn’t it be great if the model’s output were automatically monitored? If they could be visualized, sliced by different dimensions? If the system could automatically detect performance degradation and trigger alerts? In this presentation, we describe our experience from building such a core machine-learning services: Model Evaluation.
Our service provides automated, continuous evaluation of the performance of a deployed model over commonly-used metrics like the area-under-the-curve (AUC), root-mean-square-error (RMSE) etc. In addition, summary statistics about the model’s output, their distributions are also computed. The service also provides a dashboard to visualize the performance metrics, summary statistics and distributions of a model over time along with REST APIs to retrieve these metrics programmatically.
These metrics can be sliced by input features (e.g. Geography, Product type) to provide insights into model performance over different segments. The talk will describe various components that are required in building such a service and metrics of interest. Our system has a backend component built with spark on Azure Databricks. The backend can scale to analyze TBs of data to generate model evaluation metrics.
We will talk about how we modified Spark MLLib for computing AUC sliced by different dimensions and other optimizations in Spark to improve compute and performance. Our front-end and middle-tier, built with Docker and Azure Webapp provides visuals and REST APIs to retrieve the above metrics. This talk will cover various aspects of building, deploying and using the above system.
This document summarizes new and improved features in the pandas library. It describes how pandas provides labeled, structured arrays for heterogeneous data and time series functionality. Major updates include merging DataFrame and DataMatrix into a single optimized DataFrame, better handling of missing data, more robust IO capabilities, and improved pivoting and reshaping of data. Future plans include more powerful group by operations and integration with statsmodels.
TensorFlow & TensorFrames w/ Apache Spark presents Marco Saviano. It discusses numerical computing with Apache Spark and Google TensorFlow. TensorFrames allows manipulating Spark DataFrames with TensorFlow programs. It provides most operations in row-based and block-based versions. Row-based processes rows individually while block-based processes blocks of rows together for better efficiency. Reduction operations coalesce rows until one row remains. Future work may improve communication between Spark and TensorFlow through direct memory copying and using columnar storage formats.
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
Live webinar session with Carlos Guestrin, Dato CEO and Amazon Professor of Machine Learning at University of Washington. Carlos reviewed 2015 highlights, previewed the Dato roadmap, and answered real-time questions from participants about use cases, algorithms, and resources.
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
This document summarizes machine learning scalability from single machine to distributed systems. It discusses how true scalability is about how long it takes to reach a target accuracy level using any available hardware resources. It introduces GraphLab Create and SFrame/SGraph for scalable machine learning and graph processing. Key points include distributed optimization techniques, graph partitioning strategies, and benchmarks showing GraphLab Create can solve problems faster than other systems by using fewer machines.
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakPyData
This document discusses using deep learning and deep features to build an app that finds similar images. It begins with an overview of deep learning and how neural networks can learn complex patterns in data. The document then discusses how pre-trained neural networks can be used as feature extractors for other domains through transfer learning. This reduces data and tuning requirements compared to training new deep learning models. The rest of the document focuses on building an image similarity service using these techniques, including training a model with GraphLab Create and deploying it as a web service with Dato Predictive Services.
This document provides an overview of deep learning and its applications. It discusses how deep learning can be used for image classification and how neural networks learn hierarchical representations from data. The document highlights some of the challenges of deep learning, such as the large amounts of data and computation required. It also covers how deep learning models can be deployed in production using services like Amazon Web Services to ensure low latency, high availability, and continuous learning.
The document provides an overview of various machine learning algorithms and methods. It begins with an introduction to predictive modeling and supervised vs. unsupervised learning. It then describes several supervised learning algorithms in detail including linear regression, K-nearest neighbors (KNN), decision trees, random forest, logistic regression, support vector machines (SVM), and naive Bayes. It also briefly discusses unsupervised learning techniques like clustering and dimensionality reduction methods.
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.
At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.
This document provides an overview of SFrame, a scalable dataframe for machine learning developed by Dato. SFrame was created to handle large datasets and enable fast machine learning. It uses a columnar storage format and lazy evaluation to optimize performance. SFrame can handle datasets with billions of rows and columns efficiently using its out-of-core design. It also includes an SGraph extension to handle graph analytics on very large graphs with billions of edges. A variety of machine learning algorithms are built on SFrame to leverage its scalability.
Performance & Scalability Improvements in PerforcePerforce
Recent releases of Perforce include improvements targeting performance and scalability. The edge/commit server architecture and lockless reads are two such improvements. This presentation will detail the effect of these improvements as measured in concurrency simulations and production deployments. Some of the server internals implemented to achieve these gains in performance and scalability will also be discussed.
Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner
Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application using PHP, with a focus on data modeling.
Video of talk: http://skillsmatter.com/podcast/home/cassandra/js-2880
The document compares the performance of different data serialization formats (JSON, Apache Avro, Protocol Buffers) for real-time applications. It describes building a pipeline to ingest, process, and cache serialized data. Benchmark results show JSON has the highest throughput but also the highest latency, while Protocol Buffers has the lowest throughput but lowest latency. The document recommends JSON for latency-critical, small data and Protocol Buffers for data-heavy, real-time applications relying on Google services. It also provides information about monitoring throughput patterns and the presenter's background and skills.
This document discusses various technologies related to architectures, frameworks, infrastructure, services, data stores, analytics, logging and metrics. It covers Java 8 features like lambda expressions and method references. It also discusses microservices, Spring Boot basics and features, Gradle vs Maven, Swagger, AngularJS, Gulp, Jasmine, Karma, Nginx, CloudFront, Couchbase, Lambda Architecture, logging with Fluentd and Elasticsearch, metrics collection with Collectd and Statsd, and visualization with Graphite and Grafana.
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks
This document discusses challenges in deploying machine learning models for scoring in streaming applications using Apache Spark. It describes how ML Pipelines and Structured Streaming in Spark can be used to build an application that monitors web sessions for bots in real-time. However, there are issues with two-pass transformers and handling invalid data in Spark 2.2. Spark 2.3 includes fixes that allow most transformers and models to work for both batch and streaming scoring, and improves handling of invalid values. The talk provides tips on updating pipelines to work with streaming and testing them.
This document discusses end-to-end processing of 3.7 million telemetry events per second using a lambda architecture at Symantec. It provides an overview of Symantec's security data lake infrastructure, the telemetry data processing architecture using Kafka, Storm and HBase, tuning targets for the infrastructure components, and performance benchmarks for Kafka, Storm and Hive.
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
The Apache Beam programming model is designed to support several advanced data processing features such as autoscaling and dynamic work rebalancing. In this talk, we will first explain how dynamic work rebalancing not only provides a general and robust solution to the problem of stragglers in traditional data processing pipelines, but also how it allows autoscaling to be truly effective. We will then present how dynamic work rebalancing works as implemented in the Google Cloud Dataflow runner and which path other Apache Beam runners link Apache Flink can follow to benefit from it.
KSQL is an open-source streaming SQL engine for Apache Kafka. It allows users to easily interact with and analyze streaming data in Kafka using SQL-like queries. KSQL builds upon Kafka Streams to provide stream processing capabilities with exactly-once processing semantics. It aims to expand access to stream processing beyond coding by providing an interactive SQL interface for tasks like streaming ETL, anomaly detection, real-time monitoring, and simple topic transformations. KSQL can be run in standalone, client-server, or application deployment modes.
Streaming in Practice - Putting Apache Kafka in Productionconfluent
This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production
A presentation at Twitter's official developer conference, Chirp, about why we use the Scala programming language and how we build services in it. Provides a tour of a number of libraries and tools, both developed at Twitter and otherwise.
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016Metosin Oy
The document discusses Schema and Clojure.spec, two libraries for data validation and specification in Clojure. It provides an overview of Schema's capabilities for runtime validation, documentation, and data transformations. It also summarizes Clojure.spec's focus on specification through generative testing. The document compares the two approaches and argues that Schema is currently better for runtime validation in web applications, while Clojure.spec will be important for system-level specification.
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
This document discusses optimizations made to Apache Spark MLlib algorithms to better support sparse data at large scale. It describes how KMeans, linear methods, and other ML algorithms were modified to use sparse vector representations to reduce memory usage and improve performance when working with sparse data, including optimizations made for clustering large, high-dimensional datasets. The optimizations allow these algorithms to be applied to much larger sparse datasets and high-dimensional problems than was previously possible with MLlib.
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
By Rajiv Kurian, software engineer at SignalFx.
At SignalFx, we deal with high-volume high-resolution data from our users. This requires a high performance ingest pipeline. Over time we’ve found that we needed to adapt architectural principles from specialized fields such as HPC to get beyond performance plateaus encountered with more generic approaches. Some key examples include:
* Write very simple single threaded code, instead of complex algorithms
* Parallelize by running multiple copies of simple single threaded code, instead of using concurrent algorithms
* Separate the data plane from the control plane, instead of slowing data for control
* Write compact, array-based data structures with minimal indirection, instead of pointer-based data structures and uncontrolled allocation
Hadoop and HBase experiences in perf log projectMao Geng
This document discusses experiences using Hadoop and HBase in the Perf-Log project. It provides an overview of the Perf-Log data format and architecture, describes how Hadoop and HBase were configured, and gives examples of using MapReduce jobs and HBase APIs like Put and Scan to analyze log data. Key aspects covered include matching Hadoop and HBase versions, running MapReduce jobs, using column families in HBase, and filtering Scan results.
Similar to GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGraph (20)
This document discusses analyzing video data with GraphLab Create. It introduces Dato's products for ingesting, transforming, modeling and deploying machine learning models on unstructured data like images, text, graphs and tabular data. It then outlines a demo of using computer vision and face recognition techniques to match actors' faces from movie frames to subtitles and screenplay text. Instructions are provided for installing GraphLab Create and links shared for additional resources.
The document discusses using machine learning to assess patient readmission risk and reduce avoidable hospital readmissions. It begins with an introduction of the speaker and an overview of the problem of high readmission rates. It then discusses current analytic approaches and their limitations, and how machine learning can leverage complex data sources like EMRs to provide more precise, real-time risk scoring and insights. The rest of the document focuses on demonstrating Dato's machine learning platform and capabilities for building such applications for predictive readmission risk at scale.
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
Rajat Arya discusses using machine learning for lead scoring to improve sales conversions and marketing campaigns. Lead scoring uses customer data and machine learning models to predict the likelihood of leads converting and prioritize sales and marketing efforts. Implementing lead scoring can increase conversion rates, shorten sales cycles, and boost revenue. Machine learning approaches for lead scoring learn patterns from historical customer data to understand what attributes and behaviors indicate a lead's propensity to become a customer.
Webinar - Product Matching - Palombo (20160428)Turi, Inc.
This webinar discusses product matching using Dato's tools. The presenter is Alon Palombo, a Data Scientist from Dato. The webinar agenda includes an introduction to Dato, an overview of the data science workflow, a definition of product matching, and a demo of product matching using real public data. The webinar aims to explain how product matching is important for e-commerce and how Dato's tools can help with tasks like entity resolution, record linking, and de-duplication.
Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.
The document discusses churn prediction using log data. It describes how churn prediction works by observing past user behavior patterns in log data to predict the probability of users stopping engagement. It provides guidance on choosing time boundaries and lookback periods to extract meaningful features for modeling, and how to interpret the results to identify users for retention actions. The key steps are feature generation by analyzing log data patterns before time boundaries, label generation based on engagement after boundaries, and using the predictions to guide targeted retention efforts.
Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.
The document outlines a webinar presented by Alon Palombo of Dato on fraud detection. The webinar agenda includes an introduction of Dato, an overview of the data science workflow and what constitutes fraud, a live demo of fraud detection using real data, and time for questions. Various techniques for fraud detection are discussed, including classification, graph analytics, time series analysis, and anomaly detection.
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
This document discusses benchmarks for GraphLab Create, a machine learning library. It summarizes benchmarking GraphLab Create on large datasets by running PageRank on a graph with 3.5 billion nodes and 128 billion links, and gradient boosted trees on a dataset with 4.3 billion rows and 39 features. The document also provides instructions for instantiating an Amazon EC2 instance with 32 cores and 244GB RAM to run the benchmarks, and includes a link to download GraphLab Create and access the benchmark notebooks on GitHub.
Pattern Mining: Extracting Value from Log DataTuri, Inc.
Pattern mining is an unsupervised machine learning technique used to discover frequent patterns and relationships in log data. It involves finding the top frequent sets of items that occur together in the data at least a minimum number of times. There are two main approaches - candidate generation which generates and filters candidate patterns in multiple passes over the data, and pattern growth which constructs conditional databases to avoid multiple full scans. Pattern mining can be used to find commonly purchased itemsets, extract features from log data, and derive rules for recommendations.
Intelligent Applications with Machine Learning ToolkitsTuri, Inc.
Shawn Scully from Dato discusses how their machine learning toolkits can help developers quickly build intelligent applications. Their toolkits provide pre-built models for common tasks like recommendation, sentiment analysis, similarity search, churn prediction, and data matching. Developers can easily create applications with just a few lines of code, deploy models as microservices, and iteratively improve applications based on feedback. Dato aims to accelerate innovators by providing agile machine learning tools.
The document discusses text analysis with machine learning. It begins with introductions and then covers applications of text analysis like product reviews and social media. The bulk of the document discusses fundamentals of text processing like tokenization and feature engineering. It also discusses machine learning toolkits and task-oriented tools like sentiment analysis. Advanced topics like topic models and word embeddings are briefly introduced. The presentation aims to provide an overview of text analysis and point to further resources.
This document introduces Dato and its machine learning platform. Dato provides intuitive APIs and toolkits that allow developers to easily create intelligent applications for tasks like recommendation, sentiment analysis, churn prediction, and more. It offers scalable data structures, high performance algorithms, and the ability to quickly develop and deploy machine learning models and services. Customers across various industries have been able to build and operationalize intelligent solutions faster using Dato to solve problems in fraud detection, data matching, recommendations, and other domains.
Machine Learning in Production with Dato Predictive ServicesTuri, Inc.
The document discusses Dato Predictive Services, a machine learning platform that helps deploy, serve, monitor, and manage machine learning models in production. It provides an overview of key capabilities like deploying models through different options, monitoring model performance and product usage, and evaluating models with online experiments. These capabilities aim to address common challenges of machine learning in production like deploying trained models, monitoring their behavior, and continuously improving them. The presentation includes a demo of a book recommender application built with Dato Predictive Services.
Tutorial for Machine Learning 101 (an all-day tutorial at Strata + Hadoop World, New York City, 2015)
The course is designed to introduce machine learning via real applications like building a recommender image analysis using deep learning.
In this talk we cover deployment of machine learning models.
Overview of Machine Learning and Feature EngineeringTuri, Inc.
Machine Learning 101 Tutorial at Strata NYC, Sep 2015
Overview of machine learning models and features. Visualization of feature space and feature engineering methods.
Building Personalized Data Products with DatoTuri, Inc.
This document discusses building personalized data products and recommender systems using implicit and explicit user data. It describes how recommender systems work by using matrix factorization to learn latent factors about users and items from interaction data in order to predict ratings and rankings to drive personalized recommendations. The document also notes that recommender systems are commonly used by Netflix, Spotify, LinkedIn and Facebook to power personalized experiences and that even small improvements in recommendation quality can lead to significant business value.
Dato aims to accelerate the creation of intelligent applications by making sophisticated machine learning as easy as "Hello world." The company provides an integrated machine learning platform that handles data engineering, advanced ML techniques, and deployment of models as predictive services. This allows small teams to be highly productive in building intelligent applications like recommenders, fraud detection, and personalized medicine. Dato's platform provides out-of-core computation, tools for feature engineering, rich data type support, and scalable models to help customers in various industries rapidly iterate and deploy ML applications.
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
This document presents a framework for developing a comprehensive machine learning benchmark. It discusses identifying the core building blocks of machine learning algorithms, such as linear algebra, data characteristics, and memory access. It proposes evaluating these building blocks using representative algorithms, datasets, and configurations. Thousands of executions are clustered into a smaller set capturing different software and hardware behaviors. The resulting benchmark suite of 50 workloads incorporates the main building blocks and bottlenecks to help evaluate machine learning performance.
This document discusses the challenges of machine learning development circa 2013 and outlines Dato's approach to addressing these challenges. In 2013, machine learning development was difficult, slow, and expensive. It required specialized knowledge and infrastructure. Dato aims to accelerate the creation of intelligent applications by making sophisticated machine learning as easy as "Hello world" through high-level toolkits, auto feature engineering, automated machine learning (AutoML), and scalable data structures. The document demonstrates how Dato's tools can build an intelligent application with just a few lines of code and handle large datasets by leveraging out-of-core computation.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
12. SFrame Design
Jobs fail because:
• Machine run out of memory
• Did not set Java Heap Size correctly
• Resource Configuration X needs to be
bigger.
Pain Point #1: Resource Limits
13. SFrame Design
• Graceful Degradation as 1st principle
• Always Works
Pain Point #2: Too Strict or Too Weak Schemas
We want strong schema types.
We also want weak schema types.
Missing Values
15. SFrame Design
• Graceful Degradation as 1st principle
• Always Works
• Rich Datatypes
• Strong schema types: int, double, string...
• Weak schema types: list, dictionary
Pain Point #3: Feature Manipulation
Difficult or costly to inspect existing features and
create new features.
Hard to perform data exploration.
17. SFrame Python API Example
Make a little SFrame of 1 column and 5 values:
sf = gl.SFrame({‘x’:[1,2,3,4,5]})
Normalizes the column x:
sf[‘x’] = sf[‘x’] / sf[‘x’].sum()
Uses a python lambda to create a new column:
sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x)
Create a new column using a vectorized operator:
sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’]
Create a new SFrame taking only 2 of the columns:
sf2 = sf[[‘x’,’x-squared’]]
18. SFrame Querying
Supports most typical SQL SELECT operations using a
Pythonic syntax.
SQL
SELECT Book.title AS title, COUNT(*) AS authors
FROM Book
JOIN Book_author ON Book.isbn = Book_author.isbn
GROUP BY Book.title;
SFrame Python
Book.join(Book_author, on=‘isbn’)
.groupby(‘title’, {‘authors’:gl.aggregate.COUNT})
19. SFrame Columnar Encoding
user movie rating
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
20. SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
SFrame File
21. SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
User 176 MB 14.2 bits/int
SFrame File
0.02 bits/intMovie 257 KB
3.8 bits/intRating 47 MB
-------------------------------
Total 223MB
22. SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
User 176 MB 14.2 bits/int
SFrame File
0.02 bits/intMovie 257 KB
3.8 bits/intRating 47 MB
-------------------------------
Total 223MB
10s
23. SFrames Distributed
• Distributed Dataflow
• Columnar Query Optimizations
• Communicate columnar compressed blocks
rather than row tuples.
The choice of distributed or local execution is a
question of query optimization.
37. Deep Integration of SFrames and
SGraphs
• Seamless interaction between graph data
and table data.
• Queries can be performed easily across
graph and tables.
39. SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
User-first architecture.
Built by data scientists,
for data scientists.
Editor's Notes
My name is ... I am one of the co-founders and currently the chief architect at GraphLab.
Talk about
- architecture philosophy @ graphlab
- and what we have built into GraphLab Create.
The basic architecture philosophy at GraphLab is what I like to call “Users-First” architecture. What do I mean by that?
The objective of a systems Architecture is to connect systems [] to users []. To provide users with a means to access data, compute resources, and so on.
And of course, there are many ways to achieve this.
[] Some architectures are designed with a bottom up nature:
- Given particular systems constraints, what can we perform most efficiently?
- Users can then try to develop whatever applications they need around the system.
Other architectures are designed top down:
- We begin by defining an interaction model with users. SQL for instance is a great example.
- Then we figure out how to design an architecture that supports all these capabilities efficiently.
The question we are trying to answer here at GraphLab is: “what is the Users-First Architecture for Data Science? For Machine Learning?”
We think we have an answer.
----- Meeting Notes (7/21/14 03:15) -----
1 min
We designed two core datastructures we call...
<T> Now what are SFrames and SGraphs?
- SFrame is our scalable datastructure for table manipulation
- SGraph is our scalable datastructure for graph manipulation.
And one design objective is to enable ....
I am going to first talk about the design of SFrames, our datastructure for table manipulation
The SFrame design is governed by a series of common pain points we have observed.
The first is fundamental. It is *extremely* annoying when we start a job / some task, have it run for a while, then have it fail. [] because...
----- Meeting Notes (7/21/14 03:15) -----
2 min
- graceful degradation as the first core principle in the SFrame design
- instead of demanding more resources, we try to ensure that things always work even if there is insufficient memory. So we spent a lot of work on developing bounded memory algorithms to ensure that when there is insufficient memory, we might run slower, but we will always work.
- We want strong types, because sometimes... It is *very* useful to have a priori guarantees that my “rating column” always contains an integer even in the 1billionth row.
- We want weak types because sometime data is unstructured and I need some way to manage it.
Our SFrames support a rich family of datatypes. From some strong types like int, double, string to weak types like arbitrary lists and dictionaries. Our lists and dictionaries are self-recursive and hence are expressive enough to hold arbitrary JSON.
----- Meeting Notes (7/21/14 03:15) -----
4 min
Key aspects of doing ML.
I need to create new features, or delete new features. I need to be able to deeply explore a single feature or small sets of features at a time.
----- Meeting Notes (7/21/14 03:22) -----
efficiently create.
This leads to the SFrame use of a columnar architecture.
- By representing the data column-wise, we can support easy feature engineering.
- By further using immutable columns and lazy evaluation, we can push through a large number of pipelining optimizations.
- easily visualize or sketch statistics about single features
The API looks somewhat Pandas like, and carry very similar ideas.
For instance: It is as easy to load an SFrame with a billion rows as it is to construct a tiny SFrame here of 1 column and 5 values.
2) We can easily reassign the value of an entire column: here we normalize the column by the sum of the values.
3) We can easily create new columns by applying a python lambda operation to each entry. This lambda operation is parallelized behind the scenes
4) We can also create new columns by using the vectorized operators.
5) We can easily subselect columns to create new SFrames, And due to the immutable nature of columns, this operation is essentially free.
So The SFrame can be used like a general purpose table, but having a carefully curated set of scalable operations.
Further more, SFrames supports most of the key important query operators like groupby, join, etc. Most typical SQL Select operations will have a natural conversion to a pythonic syntax using SFrames
I mentioned columnar architecture briefly in an earlier slide but I did not mention one of the fundamental benefits of columnar encoding:
- aggressive compression is possible.
But it is hard to understand how much compression can do without an actual demonstration.
Data is 99M rows, 3 columns of integers of user, movie and ratings. This is ... Size ...
----- Meeting Notes (7/21/14 03:22) -----
7 min
- The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression.
- For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms.
- The on disk SFrame representation ends up as ... (movie is sorted)
- The total size is smaller than the gzip compression.
- This allows us to do more on one machine.
- Aggressive compression allows us to keep more in application cache / file system cache
- faster, more throughput even though its external memory.
- Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame?
- 10s on a 4 core desktop. (i73770K )
- The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression.
- For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms.
- The on disk SFrame representation ends up as ... (movie is sorted)
- The total size is smaller than the gzip compression.
- This allows us to do more on one machine.
- Aggressive compression allows us to keep more in application cache / file system cache
- faster, more throughput even though its external memory.
- Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame?
- 10s on a 4 core desktop. (i73770K )
- The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression.
- For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms.
- The on disk SFrame representation ends up as ... (movie is sorted)
- The total size is smaller than the gzip compression.
- This allows us to do more on one machine.
- Aggressive compression allows us to keep more in application cache / file system cache
- faster, more throughput even though its external memory.
- Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame?
- 10s on a 4 core desktop. (i73770K )
Now a natural question to bring up then is how about going Distributed?
We take a somewhat novel perspective on this. The user facing Python API never changes. Instead,
- In other words, the decision to go distributed is simply a question of query optimization.
- If you are doing something small. It is going to be more efficient locally than to pay the distributed cost.
- We are building a distributed dataflow system, taking advantage of columnar query optimizations. And when communication is necessary ,we can get reduce comms by communicating columnar compressed blocks rather than row tuples.
----- Meeting Notes (7/21/14 03:22) -----
9 min – 10 min
Next I will talk about SGraphs
10 min
Our scalable datastructure for graph manipulation
The layout works this way. Firstly, We partition the set of vertices into a collection of SFrames. This partitioning can be arbitrary, we use a simple hash function. Each vertex Sframe then contains the vertex ID and all the properties associated with the vertices. Note that this is an Sframe and hence the vertex attributes are stored column-wise.
The layout works this way. Firstly, We partition the set of vertices into a collection of SFrames. This partitioning can be arbitrary, we use a simple hash function. Each vertex Sframe then contains the vertex ID and all the properties associated with the vertices. Note that this is an Sframe and hence the vertex attributes are stored column-wise.
We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion.
This representation hence allows us to easily separate structure and data.
For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion.
This representation hence allows us to easily separate structure and data.
For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion.
This representation hence allows us to easily separate structure and data.
For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
Or introduce new features without rewriting anything.
Finally, since the vertex and edge attributes are all simply Sframes, through a trick of lazy evaluation, we can make the both the vertices and edges appear as a single Sframe to the user.
Or introduce new features without rewriting anything.
Finally, since the vertex and edge attributes are all simply Sframes, through a trick of lazy evaluation, we can make the both the vertices and edges appear as a single Sframe to the user.
This deep integration of SFrames and SGraphs allow seamless interaction between graph data and table data, and queries can be performed easily across both graphs and tables.
Now, this is not so easy to understand just from slides, so I will give a quick demo.
13 min
Now Lets briefly recap what we have just done during the demo. We have taken tabular data, converted it to graph and run a basic graph algorithm on it, and ask graph questions about it. And next we take the graph, and directly interpret it as a table: joining it with other tables to get some intelligence with very little code and very little friction.. All within a few lines of Python. This is what we have achieved by trying to understand data science from a user-first perspective. Ending up with a datastructure that is easy to use, powerful and enables our machine learning algorithms in GraphLab Create to scale easily to terabyte datasets on a single machine. Thank you.