Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
Modern Data-Intensive Scalable Computing (DISC) systems such as Apache Spark do not support sophisticated cost-based query optimizers because they are specifically designed to process data that resides in external storage systems (e.g. HDFS), or they lack the necessary data statistics. Consequently, many crucial optimizations, such as join order and plan selection, are presently out-of-scope in these DISC system optimizers. Yet, join order is one of the most important decisions a cost-optimizer can make because wrong orders can result in a query response time that can become more than an order-of-magnitude slower compared to the better order.
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads using Apache Spark. However, efficiently running these applications on
cloud computing infrastructure like Amazon EC2 is challenging and we find that choosing the right hardware configuration can significantly
improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under
various resource configurations so that we can automatically choose the optimal configuration. We present Ernest, a performance prediction
framework for large scale analytics. Ernest builds performance models based on the behavior of the job on small samples of data and then
predicts its performance on larger datasets and cluster sizes. Our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
Real-world graphs are seldom static. Applications that generate
graph-structured data today do so continuously, giving rise to an underlying graph whose structure evolves over time. Mining these time-evolving graphs can be insightful, both from research and business perspectives. While several works have focused on some individual aspects, there exists no general purpose time-evolving graph processing engine.
We present Tegra, a time-evolving graph processing system built
on a general-purpose dataflow framework. We introduce Timelapse, a flexible abstraction that enables efficient analytics on evolving graphs by allowing graph-parallel stages to iterate over complete history of nodes. We use Timelapse to present two computational models, a temporal analysis model for performing computations on multiple snapshots of an evolving graph, and a generalized incremental computation model for efficiently updating results of computations.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
Many high-tech industries rely on machine-learning systems in production environments to automatically classify and respond to vast amounts of incoming data. Despite their critical roles, these systems are often not actively monitored. When a problem first arises, it may go unnoticed for some time. Once it is noticed, investigating its underlying cause is a time-consuming, manual process. Wouldn’t it be great if the model’s output were automatically monitored? If they could be visualized, sliced by different dimensions? If the system could automatically detect performance degradation and trigger alerts? In this presentation, we describe our experience from building such a core machine-learning services: Model Evaluation.
Our service provides automated, continuous evaluation of the performance of a deployed model over commonly-used metrics like the area-under-the-curve (AUC), root-mean-square-error (RMSE) etc. In addition, summary statistics about the model’s output, their distributions are also computed. The service also provides a dashboard to visualize the performance metrics, summary statistics and distributions of a model over time along with REST APIs to retrieve these metrics programmatically.
These metrics can be sliced by input features (e.g. Geography, Product type) to provide insights into model performance over different segments. The talk will describe various components that are required in building such a service and metrics of interest. Our system has a backend component built with spark on Azure Databricks. The backend can scale to analyze TBs of data to generate model evaluation metrics.
We will talk about how we modified Spark MLLib for computing AUC sliced by different dimensions and other optimizations in Spark to improve compute and performance. Our front-end and middle-tier, built with Docker and Azure Webapp provides visuals and REST APIs to retrieve the above metrics. This talk will cover various aspects of building, deploying and using the above system.
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
Modern Data-Intensive Scalable Computing (DISC) systems such as Apache Spark do not support sophisticated cost-based query optimizers because they are specifically designed to process data that resides in external storage systems (e.g. HDFS), or they lack the necessary data statistics. Consequently, many crucial optimizations, such as join order and plan selection, are presently out-of-scope in these DISC system optimizers. Yet, join order is one of the most important decisions a cost-optimizer can make because wrong orders can result in a query response time that can become more than an order-of-magnitude slower compared to the better order.
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads using Apache Spark. However, efficiently running these applications on
cloud computing infrastructure like Amazon EC2 is challenging and we find that choosing the right hardware configuration can significantly
improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under
various resource configurations so that we can automatically choose the optimal configuration. We present Ernest, a performance prediction
framework for large scale analytics. Ernest builds performance models based on the behavior of the job on small samples of data and then
predicts its performance on larger datasets and cluster sizes. Our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
Real-world graphs are seldom static. Applications that generate
graph-structured data today do so continuously, giving rise to an underlying graph whose structure evolves over time. Mining these time-evolving graphs can be insightful, both from research and business perspectives. While several works have focused on some individual aspects, there exists no general purpose time-evolving graph processing engine.
We present Tegra, a time-evolving graph processing system built
on a general-purpose dataflow framework. We introduce Timelapse, a flexible abstraction that enables efficient analytics on evolving graphs by allowing graph-parallel stages to iterate over complete history of nodes. We use Timelapse to present two computational models, a temporal analysis model for performing computations on multiple snapshots of an evolving graph, and a generalized incremental computation model for efficiently updating results of computations.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
Many high-tech industries rely on machine-learning systems in production environments to automatically classify and respond to vast amounts of incoming data. Despite their critical roles, these systems are often not actively monitored. When a problem first arises, it may go unnoticed for some time. Once it is noticed, investigating its underlying cause is a time-consuming, manual process. Wouldn’t it be great if the model’s output were automatically monitored? If they could be visualized, sliced by different dimensions? If the system could automatically detect performance degradation and trigger alerts? In this presentation, we describe our experience from building such a core machine-learning services: Model Evaluation.
Our service provides automated, continuous evaluation of the performance of a deployed model over commonly-used metrics like the area-under-the-curve (AUC), root-mean-square-error (RMSE) etc. In addition, summary statistics about the model’s output, their distributions are also computed. The service also provides a dashboard to visualize the performance metrics, summary statistics and distributions of a model over time along with REST APIs to retrieve these metrics programmatically.
These metrics can be sliced by input features (e.g. Geography, Product type) to provide insights into model performance over different segments. The talk will describe various components that are required in building such a service and metrics of interest. Our system has a backend component built with spark on Azure Databricks. The backend can scale to analyze TBs of data to generate model evaluation metrics.
We will talk about how we modified Spark MLLib for computing AUC sliced by different dimensions and other optimizations in Spark to improve compute and performance. Our front-end and middle-tier, built with Docker and Azure Webapp provides visuals and REST APIs to retrieve the above metrics. This talk will cover various aspects of building, deploying and using the above system.
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...Accumulo Summit
Talk Abstract
The Resource Description Framework (RDF) is a standard model for expressing graph data for the World Wide Web. Developed by the W3C, RDF and related technologies such as OWL and SKOS provide a rich vocabulary for exchanging graph data in a machine understandable manner. As the size of available data continues to grow, there has been an increased desire for methods of storing very large RDF graphs within big data architectures. Rya is a government open source scalable RDF triple store built on top of Apache Accumulo. Originally developed by the Laboratory for Telecommunication Sciences and US Naval Academy, Rya is currently being used by a number of government agencies for storing, inferencing, and querying large amounts of RDF data.
As Rya’s user base has grown, there has been a stronger requirement for near real time query responsiveness over massive RDF graphs. In this talk, we detail several query optimization strategies the Rya team has pursued to better satisfy this requirement. We describe recent work allowing for the use of additional indices to eliminate large common joins within complex SPARQL queries. Additionally, we explain a number of statistics based optimizations to improve query planning. Specifically, we detail extensions to existing methods of estimating the selectivity of individual statement patterns (cardinality) and the selectivity of joining two statement patterns (join selectivity) to better fit a “big data” paradigm and utilize Accumulo. Finally, we share preliminary performance evaluation results for the optimizations that have been pursued.
Speaker
Caleb Meier
Engineer/Algorithm Developer, Parsons Corporation
Dr. Caleb Meier received a PhD from the University of California San Diego (UCSD) in Mathematics in 2012. For the past two years, he was a postdoctoral fellow at UCSD's Math department specializing in non-linear elliptic systems of partial differential equations. He received his undergraduate degree in Mathematics from Yale University in 2006. Dr. Meier is currently working as an engineer at Parsons Corporation, specializing in query optimization algorithms for large scale RDF graphs. He is an expert in semantic technologies, Accumulo, the Hadoop Ecosystem, and is actually more fun to be around than his bio suggests.
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
The renewable energy industry has only recently started to rely on data-driven models on applications that have traditionally required complex physical solutions. In this talk, we would like to show how we leverage Spark, Keras and (in our case, on-prem) high performance computing (HPC) infrastructure to potentially tackle common and interesting problems in the wind-related industry (saving hours of CPU-consuming simulations).
We use:
Apache Spark and Hive for data preparation and a combination of different data sources (some of them in the range of the petabytes scale).
Keras for model training/generation.
HPC for coordination and node-wide training of hyperparameters.
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
Apache Spark GraphX made it possible to run graph algorithms within Spark, GraphFrames integrates GraphX and DataFrames and makes it possible to perform Graph pattern queries without moving data to a specialized graph database.
This presentation will help you get started using Apache Spark GraphFrames Graph Algorithms and Graph Queries with MapR-DB JSON document database.
Large Scale Machine Learning with Apache SparkCloudera, Inc.
Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.
Balancing Automation and Explanation in Machine LearningDatabricks
For a machine learning application to be successful, it is not enough to give highly accurate predictions: Customers also want to know why the model has made that prediction, so they can compare it against their intuition and (hopefully) gain trust in the model. However, there is a trade-off between model accuracy and explainability - for example, the more complex your feature transformations become, the harder it is to explain what the resulting features mean to the end customer. However, with the right system design this doesn't mean it has to be a binary choice between these two goals. It is possible to combine complex, even automatic, feature engineering with highly accurate models and explanations. We will describe how we are using lineage tracing to solve this issue at Salesforce Einstein, allowing good model explanations to coexist with automatic feature engineering and model selection. By building this into an open source AutoML library TransmogrifAI, an extension to SparkMlLib, it is easy to ensure a consistent level of transparency in all of our ML applications. As model explanations are provided out of the box, data scientists don't need to re-invent the wheel when model explanations need to be surfaced.
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
GE is a world leader in the manufacture of commercial jet engines, offering products for many of the best-selling commercial airframes. With more than 33,000 engines in service, GE Aviation has a history of developing analytics for monitoring its commercial engines fleets. In recent years, GE Aviation Digital has developed advanced analytic solutions for engine monitoring, with the target of improving detection and reducing false alerts, when compared to conventional analytic approaches. The advanced analytics are implemented in a real-time monitoring system which notifies GE’s Fleet Support team on a per flight basis. These analytics are developed and validated using large, historical datasets.
Analytic tools such as SQL Server and MATLAB were used until recently, when GE’s data was moved to an Apache Spark environment. Consequently, our advanced analytics are now being migrated to Spark, where there should also be performance gains with bigger data sets. In this talk we will share experiences of converting our advanced algorithms to custom Spark ML pipelines, as well as outlining various case studies.
With Honor Powrie and Peter Knight
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
Talk Abstract
The ability to collect and analyze large amounts of data is a growing problem amongst the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data: volume, velocity and variety.
This tutorial aims to provide researchers and practitioners with a range of tools and techniques that they can use in conjunction with Apache Accumulo to close this gap. The proposed tutorial will focus on building solid fundamentals using a rapid prototyping tool – the Dynamic Distributed Dimensional Data Model (D4M) – to quickly prototype new algorithms that can be tested with Apache Accumulo. The tutorial will be suitable for participants from all levels of experience using Apache Accumulo. The tutorial will begin with a general introduction of the big data landscape in order to align terminology and provide a unified view of the system regardless of participant background. The tutorial will then discuss systems engineering and how it applies to big data systems. We will then introduce D4M and provide examples of D4M being used for analytics such as dimensional analysis and background model fitting. We will then discuss current areas of research on security and privacy as well as graph algorithms. Tutorial slides will be distributed to participants and brief demonstrations will be used to reinforce concepts.
The goals of the tutorial are 1) to provide participants with a theoretical foundation of big data; 2) to demonstrate how Accumulo can be used to solve real problems from diverse domains; and 3) describe future avenues of research. This tutorial provides a deep dive into the topics presented at the 2014 Accumulo Summit in the presentation entitled: “Addressing Big Data Challenges through Innovative Architecture, Databases and Software”.
Speakers
Vijay Gadepally
Technical Staff, Lincoln Laboratory, MIT
Lauren Edwards
Associate Technical Staff, Lincoln Laboratory, MIT
Jeremy Kepner
Senior Technical Staff, Lincoln Laboratory, MIT
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation.
Storm often coexists in Big Data architectures with Hadoop. We will talk about different approaches to this interoperability between the systems, their benefits & trade-offs, and a new open source library available for high throughput use.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you'll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...Accumulo Summit
Talk Abstract
The Resource Description Framework (RDF) is a standard model for expressing graph data for the World Wide Web. Developed by the W3C, RDF and related technologies such as OWL and SKOS provide a rich vocabulary for exchanging graph data in a machine understandable manner. As the size of available data continues to grow, there has been an increased desire for methods of storing very large RDF graphs within big data architectures. Rya is a government open source scalable RDF triple store built on top of Apache Accumulo. Originally developed by the Laboratory for Telecommunication Sciences and US Naval Academy, Rya is currently being used by a number of government agencies for storing, inferencing, and querying large amounts of RDF data.
As Rya’s user base has grown, there has been a stronger requirement for near real time query responsiveness over massive RDF graphs. In this talk, we detail several query optimization strategies the Rya team has pursued to better satisfy this requirement. We describe recent work allowing for the use of additional indices to eliminate large common joins within complex SPARQL queries. Additionally, we explain a number of statistics based optimizations to improve query planning. Specifically, we detail extensions to existing methods of estimating the selectivity of individual statement patterns (cardinality) and the selectivity of joining two statement patterns (join selectivity) to better fit a “big data” paradigm and utilize Accumulo. Finally, we share preliminary performance evaluation results for the optimizations that have been pursued.
Speaker
Caleb Meier
Engineer/Algorithm Developer, Parsons Corporation
Dr. Caleb Meier received a PhD from the University of California San Diego (UCSD) in Mathematics in 2012. For the past two years, he was a postdoctoral fellow at UCSD's Math department specializing in non-linear elliptic systems of partial differential equations. He received his undergraduate degree in Mathematics from Yale University in 2006. Dr. Meier is currently working as an engineer at Parsons Corporation, specializing in query optimization algorithms for large scale RDF graphs. He is an expert in semantic technologies, Accumulo, the Hadoop Ecosystem, and is actually more fun to be around than his bio suggests.
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
The renewable energy industry has only recently started to rely on data-driven models on applications that have traditionally required complex physical solutions. In this talk, we would like to show how we leverage Spark, Keras and (in our case, on-prem) high performance computing (HPC) infrastructure to potentially tackle common and interesting problems in the wind-related industry (saving hours of CPU-consuming simulations).
We use:
Apache Spark and Hive for data preparation and a combination of different data sources (some of them in the range of the petabytes scale).
Keras for model training/generation.
HPC for coordination and node-wide training of hyperparameters.
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
Apache Spark GraphX made it possible to run graph algorithms within Spark, GraphFrames integrates GraphX and DataFrames and makes it possible to perform Graph pattern queries without moving data to a specialized graph database.
This presentation will help you get started using Apache Spark GraphFrames Graph Algorithms and Graph Queries with MapR-DB JSON document database.
Large Scale Machine Learning with Apache SparkCloudera, Inc.
Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.
Balancing Automation and Explanation in Machine LearningDatabricks
For a machine learning application to be successful, it is not enough to give highly accurate predictions: Customers also want to know why the model has made that prediction, so they can compare it against their intuition and (hopefully) gain trust in the model. However, there is a trade-off between model accuracy and explainability - for example, the more complex your feature transformations become, the harder it is to explain what the resulting features mean to the end customer. However, with the right system design this doesn't mean it has to be a binary choice between these two goals. It is possible to combine complex, even automatic, feature engineering with highly accurate models and explanations. We will describe how we are using lineage tracing to solve this issue at Salesforce Einstein, allowing good model explanations to coexist with automatic feature engineering and model selection. By building this into an open source AutoML library TransmogrifAI, an extension to SparkMlLib, it is easy to ensure a consistent level of transparency in all of our ML applications. As model explanations are provided out of the box, data scientists don't need to re-invent the wheel when model explanations need to be surfaced.
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
GE is a world leader in the manufacture of commercial jet engines, offering products for many of the best-selling commercial airframes. With more than 33,000 engines in service, GE Aviation has a history of developing analytics for monitoring its commercial engines fleets. In recent years, GE Aviation Digital has developed advanced analytic solutions for engine monitoring, with the target of improving detection and reducing false alerts, when compared to conventional analytic approaches. The advanced analytics are implemented in a real-time monitoring system which notifies GE’s Fleet Support team on a per flight basis. These analytics are developed and validated using large, historical datasets.
Analytic tools such as SQL Server and MATLAB were used until recently, when GE’s data was moved to an Apache Spark environment. Consequently, our advanced analytics are now being migrated to Spark, where there should also be performance gains with bigger data sets. In this talk we will share experiences of converting our advanced algorithms to custom Spark ML pipelines, as well as outlining various case studies.
With Honor Powrie and Peter Knight
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
Talk Abstract
The ability to collect and analyze large amounts of data is a growing problem amongst the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data: volume, velocity and variety.
This tutorial aims to provide researchers and practitioners with a range of tools and techniques that they can use in conjunction with Apache Accumulo to close this gap. The proposed tutorial will focus on building solid fundamentals using a rapid prototyping tool – the Dynamic Distributed Dimensional Data Model (D4M) – to quickly prototype new algorithms that can be tested with Apache Accumulo. The tutorial will be suitable for participants from all levels of experience using Apache Accumulo. The tutorial will begin with a general introduction of the big data landscape in order to align terminology and provide a unified view of the system regardless of participant background. The tutorial will then discuss systems engineering and how it applies to big data systems. We will then introduce D4M and provide examples of D4M being used for analytics such as dimensional analysis and background model fitting. We will then discuss current areas of research on security and privacy as well as graph algorithms. Tutorial slides will be distributed to participants and brief demonstrations will be used to reinforce concepts.
The goals of the tutorial are 1) to provide participants with a theoretical foundation of big data; 2) to demonstrate how Accumulo can be used to solve real problems from diverse domains; and 3) describe future avenues of research. This tutorial provides a deep dive into the topics presented at the 2014 Accumulo Summit in the presentation entitled: “Addressing Big Data Challenges through Innovative Architecture, Databases and Software”.
Speakers
Vijay Gadepally
Technical Staff, Lincoln Laboratory, MIT
Lauren Edwards
Associate Technical Staff, Lincoln Laboratory, MIT
Jeremy Kepner
Senior Technical Staff, Lincoln Laboratory, MIT
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation.
Storm often coexists in Big Data architectures with Hadoop. We will talk about different approaches to this interoperability between the systems, their benefits & trade-offs, and a new open source library available for high throughput use.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you'll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
Integrated into Intel® Advisor, Cache-aware Roofline Modeling (CARM) provides insight into how an application behaves by helping to determine a) how optimally it works on a given hardware, b) the main factors that limit performance, c) if the workload is memory or compute-bound, and d) the right strategy to improve application performance.
This presentation gives an overview of InAccel's Accelerated Machine Learning Suite that allows to speedup ML applications on the cloud without changing your code. The Accelerated ML suite is fully compatible with Apache Spark and allows the seamless utilization of FPGA in AWS in order to speedup your applications. More at https://www.inaccel.com/
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
Linaro is building an OpenStack based Developer Cloud. Here we present what was required to bring OpenStack to 64-bit ARM, the pitfalls, successes and lessons learnt; what’s missing and what’s next.
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
This presentation is a real-world case study about moving a large portfolio of batch analytical programs that process 30 billion or more transactions every day, from a proprietary MPP database appliance architecture to the Hadoop ecosystem in the cloud, leveraging Hive, Amazon EMR, and S3.
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
This talk describes the scale-out, consistent metadata architecture of Hopsworks and how we use it to support custom metadata and provenance for ML Pipelines with Hopsworks Feature Store, NDB, and ePipe . The talk is here: https://www.youtube.com/watch?v=oPp8PJ9QBnU&feature=emb_logo
How to create an enterprise data lake for enterprise-wide information storage and sharing? The data lake concept, architecture principles, support for data science and some use case review.
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
Event: TDWI Accelerate, Seattle, Oct 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Tags: R, Spark, SQL Server
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
Event: TDWI Accelerate Seattle, October 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Description: How to develop scalable and in-DB analytics using R in Spark and SQL-Server
Similar to Introduction to Apache Hivemall v0.5.0 (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
1. Introduction to Apache Hivemall v0.5.0:
Machine Learning on Hive/Spark
Makoto YUI @myui
ApacheCon North America 2018
Takashi Yamamuro @maropu
@ApacheHivemall
1). Principal Engineer,
2). Research Engineer,
1
2. Plan of the talk
1. Introduction to Hivemall
2. Hivemall on Spark
ApacheCon North America 2018
A quick walk-through of feature, usages, what's
new in v0.5.0, and future roadmaps
New top-k join enhancement, and a feature plan
for Supporting spark 2.3 and feature selection
2
3. We released the first Apache release
v0.5.0 on Mar 3rd, 2018 !
hivemall.incubator.apache.org
ApacheCon North America 2018
We plan to start voting for the 2nd Apache release (v0.5.2) in
the next month (Oct 2018).
3
4. What’s new in v0.5.0?
Anomaly/Change Point
Detection
Topic Modeling
(Soft Clustering)
Algorithm:
LDA, pLSA
Algorithm:
ChangeFinder, SST
Hivmall on Spark
2.0/2.1/2.1
SparkSQL/Dataframe support,
Top-k data processing
ApacheCon North America 2018 4
5. What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform VersatileScalableEase-of-use
ApacheCon North America 2018 5
6. Hivemall is easy and scalable …
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
100+ lines
of code
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop
ApacheCon North America 2018 6
7. Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive
ApacheCon North America 2018 7
8. Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
Amazon S3
ApacheCon North America 2018 8
14. Versatile
Hivemall is a Versatile library ..
ü Not only for Machine Learning
ü provides a bunch of generic utility functions
Each organization has own sets of
UDFs for data preprocessing
Don’t Repeat Yourself!
Don’t Repeat Yourself!
ApacheCon North America 2018 14
15. Hivemall generic functions
Array and Map Bit and compress String and NLP
Brickhouse UDFs are merged in v0.5.2 release.
We welcome contributing your generic UDFs to Hivemall
Geo Spatial
Top-k processing
> BASE91
> UNBASE91
> NORMALIZE_UNICODE
> SPLIT_WORDS
> IS_STOPWORD
> TOKENIZE
> TOKENIZE_JA/CN
> TF/IDF
> SINGULARIZE
> TILE
> MAP_URL
> HAVERSINE_DISTANCE
ApacheCon North America 2018 15
JSON
> TO_JSON
> FROM_JSON
16. ApacheCon North America 2018
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
RANK over() query does not finishes in 24 hours L
where 20 million MOOCs classes and avg 1,000 students in each classes
16
17. ApacheCon North America 2018
student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
List top-2 students for each class
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
EACH_TOP_K finishes in 2 hours J
17
20. List of Supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones
ApacheCon North America 2018 20
22. •Squared Loss
•Quantile Loss
•Epsilon Insensitive Loss
•Squared Epsilon Insensitive
Loss
•Huber Loss
Generic Classifier/Regressor
Available Loss functions
•HingeLoss
•LogLoss (synonym: logistic)
•SquaredHingeLoss
•ModifiedHuberLoss
• L1
• L2
• ElasticNet
• RDA
Other options
For Binary Classification:
For Regression:
• SGD
• AdaGrad
• AdaDelta
• ADAM
Optimizer
• Iteration support
• mini-batch
• Early stopping
Regularization
ApacheCon North America 2018 22
24. Training of RandomForest
Good news: Sparse Vector Input (Libsvm
format) is supported since v0.5.0 in
addition Dense Vector input.
ApacheCon North America 2018 24
28. SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
ApacheCon North America 2018 28
29. Supported Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items
ApacheCon North America 2018 29
30. Other Supported Algorithms
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ Feature Binning
✓ TF-IDF vectorizer
✓ Polynomial Expansion
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓English/Japanese/Chinese
Tokenizer
Evaluation metrics
✓AUC, nDCG, logloss, precision
recall@K, and etc
ApacheCon North America 2018 30
32. Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins
ApacheCon North America 2018 32
35. Other Supported Features
Anomaly Detection
✓Local Outlier Factor (LoF)
✓ChangeFinder
Clustering / Topic models
✓Online mini-batch LDA
✓Online mini-batch PLSA
Change Point Detection
✓ChangeFinder
✓Singular Spectrum
Transformation
ApacheCon North America 2018 35
36. Efficient algorithm for finding change point and outliers from
time-series data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
ApacheCon North America 2018 36
39. Efficient algorithm for finding change point and outliers from
timeseries data
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
ApacheCon North America 2018 39
40. • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation
ApacheCon North America 2018 40
44. ü Spark 2.3 support
ü Merged Brickhouse UDFs
ü Field-aware Factorization Machines
ü SLIM recommendation
What’s new in the coming v0.5.2
ApacheCon North America 2018
Xia Ning and George Karypis, SLIM: Sparse Linear Methods for Top-N Recommender Systems, Proc. ICDM, 2011.
Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin, "Field-aware Factorization Machines for CTR
Prediction", Proc. RecSys. 2016.
State-of-the-art method for CTR prediction, often used algorithm in Kaggle
Very promising algorithm for top-k recommendation
44
45. ü Word2Vec support
ü Multi-class Logistic Regression
ü More efficient XGBoost support
ü LightGBM support
ü Gradient Boosting
ü Kafka KSQL UDF porting
Future work for v0.6 and later
PR#91
PR#116
ApacheCon North America 2018 45
75. Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
The 2nd Apache release (v0.5.2) will appear soon!
We welcome your contributions to Apache Hivemall J
HiveQL SparkSQL/Dataframe API Pig Latin
ApacheCon North America 2018 75