D4M is a software tool that connects scientists with big data technologies like Apache Accumulo. The D4M-Accumulo binding provides high performance connectivity to Accumulo for quick analytic prototyping. Current research looks to implement GraphBLAS server-side iterators and operators on Accumulo tables to support high performance graph analytics.
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit
Talk Abstract
Aggregation has long been a use case of Accumulo Iterators. Iterators' ability to reduce data during compaction and scanning can greatly simplify an aggregation system built on Accumulo. This talk will first review how Accumulo's Iterators/Combiners work in the context of aggregating values. I'll then step back and look at the abstraction of aggregation functions as commutative operations and the several benefits that result by making this abstraction. We will see how it becomes no harder to introduce powerful operations such as cardinality estimation and approximate top-k than it is to sum integers. I will show how to integrate these ideas into Accumulo with an example schema and Iterator. Finally, a practical aggregation use case will be discussed to highlight the concepts from the talk.
Speakers
Gadalia O'Bryan
Senior Solutions Architect, Koverse
Gadalia O'Bryan is a Sr. Solutions Architect at Koverse, where she leads customer projects and contributes to key feature and algorithm design, such as Koverse's Aggregation Framework. Prior to Koverse, Gadalia was a mathematician for the National Security Agency. She has an M.A. in mathematics from UCLA and has been working with Accumulo for the past 6 years.
Bill Slacum
Software Engineer, Koverse
Bill is an Accumulo committer and PMC member who has been working on large scale query and analytic frameworks since 2010. He holds BS's in computer science and financial economics from UMBC. Having never used his passport to leave the United States, he is currently a national man of mystery.
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
Clustering is often an essential first step in datamining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can
offer a richer representation by suggesting the potential group
structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a
parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of
single-linkage clustering algorithm due to its natural expression
of iterative process. Our algorithm can be deployed easily in
Amazon’s cloud environment. And a thorough performance
evaluation in Amazon’s EC2 verifies that the scalability of our
algorithm sustains when the datasets scale up.
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
Real-world graphs are seldom static. Applications that generate
graph-structured data today do so continuously, giving rise to an underlying graph whose structure evolves over time. Mining these time-evolving graphs can be insightful, both from research and business perspectives. While several works have focused on some individual aspects, there exists no general purpose time-evolving graph processing engine.
We present Tegra, a time-evolving graph processing system built
on a general-purpose dataflow framework. We introduce Timelapse, a flexible abstraction that enables efficient analytics on evolving graphs by allowing graph-parallel stages to iterate over complete history of nodes. We use Timelapse to present two computational models, a temporal analysis model for performing computations on multiple snapshots of an evolving graph, and a generalized incremental computation model for efficiently updating results of computations.
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
Modern Data-Intensive Scalable Computing (DISC) systems such as Apache Spark do not support sophisticated cost-based query optimizers because they are specifically designed to process data that resides in external storage systems (e.g. HDFS), or they lack the necessary data statistics. Consequently, many crucial optimizations, such as join order and plan selection, are presently out-of-scope in these DISC system optimizers. Yet, join order is one of the most important decisions a cost-optimizer can make because wrong orders can result in a query response time that can become more than an order-of-magnitude slower compared to the better order.
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
Apache Spark is rapidly becoming the de facto framework for big-data analytics. Spark’s built-in, large-scale Machine Learning Library (MLlib) uses traditional stochastic gradient descent (SGD) to solve standard ML algorithms. However, MlLib currently provides limited coverage of ML algorithms. Further, the convergence of the adopted SGD approach is heavily dictated by issues such as step-size selection, conditioning of the problem and so on, making it difficult for adoption by non-expert end users.
In this session, the speakers introduce a large-scale ML tool built on the Alternating Direction Method of Multipliers (ADMM) on Spark to solve a gamut of ML algorithms. The proposed approach decomposes most ML problems into smaller sub-problems suitable for distributed computation in Spark.
Learn how this toolkit provides a wider range of ML algorithms, better accuracy compared to MLlib, robust convergence criteria and a simple python API suitable for data scientists – making it easy for end users to develop advanced ML algorithms at scale, without worrying about the underlying intricacies of the optimization solver. It’s a useful arsenal for data scientists’ ML ecosystem on Spark.
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads using Apache Spark. However, efficiently running these applications on
cloud computing infrastructure like Amazon EC2 is challenging and we find that choosing the right hardware configuration can significantly
improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under
various resource configurations so that we can automatically choose the optimal configuration. We present Ernest, a performance prediction
framework for large scale analytics. Ernest builds performance models based on the behavior of the job on small samples of data and then
predicts its performance on larger datasets and cluster sizes. Our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit
Talk Abstract
Aggregation has long been a use case of Accumulo Iterators. Iterators' ability to reduce data during compaction and scanning can greatly simplify an aggregation system built on Accumulo. This talk will first review how Accumulo's Iterators/Combiners work in the context of aggregating values. I'll then step back and look at the abstraction of aggregation functions as commutative operations and the several benefits that result by making this abstraction. We will see how it becomes no harder to introduce powerful operations such as cardinality estimation and approximate top-k than it is to sum integers. I will show how to integrate these ideas into Accumulo with an example schema and Iterator. Finally, a practical aggregation use case will be discussed to highlight the concepts from the talk.
Speakers
Gadalia O'Bryan
Senior Solutions Architect, Koverse
Gadalia O'Bryan is a Sr. Solutions Architect at Koverse, where she leads customer projects and contributes to key feature and algorithm design, such as Koverse's Aggregation Framework. Prior to Koverse, Gadalia was a mathematician for the National Security Agency. She has an M.A. in mathematics from UCLA and has been working with Accumulo for the past 6 years.
Bill Slacum
Software Engineer, Koverse
Bill is an Accumulo committer and PMC member who has been working on large scale query and analytic frameworks since 2010. He holds BS's in computer science and financial economics from UMBC. Having never used his passport to leave the United States, he is currently a national man of mystery.
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
Clustering is often an essential first step in datamining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can
offer a richer representation by suggesting the potential group
structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a
parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of
single-linkage clustering algorithm due to its natural expression
of iterative process. Our algorithm can be deployed easily in
Amazon’s cloud environment. And a thorough performance
evaluation in Amazon’s EC2 verifies that the scalability of our
algorithm sustains when the datasets scale up.
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
Real-world graphs are seldom static. Applications that generate
graph-structured data today do so continuously, giving rise to an underlying graph whose structure evolves over time. Mining these time-evolving graphs can be insightful, both from research and business perspectives. While several works have focused on some individual aspects, there exists no general purpose time-evolving graph processing engine.
We present Tegra, a time-evolving graph processing system built
on a general-purpose dataflow framework. We introduce Timelapse, a flexible abstraction that enables efficient analytics on evolving graphs by allowing graph-parallel stages to iterate over complete history of nodes. We use Timelapse to present two computational models, a temporal analysis model for performing computations on multiple snapshots of an evolving graph, and a generalized incremental computation model for efficiently updating results of computations.
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
Modern Data-Intensive Scalable Computing (DISC) systems such as Apache Spark do not support sophisticated cost-based query optimizers because they are specifically designed to process data that resides in external storage systems (e.g. HDFS), or they lack the necessary data statistics. Consequently, many crucial optimizations, such as join order and plan selection, are presently out-of-scope in these DISC system optimizers. Yet, join order is one of the most important decisions a cost-optimizer can make because wrong orders can result in a query response time that can become more than an order-of-magnitude slower compared to the better order.
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
Apache Spark is rapidly becoming the de facto framework for big-data analytics. Spark’s built-in, large-scale Machine Learning Library (MLlib) uses traditional stochastic gradient descent (SGD) to solve standard ML algorithms. However, MlLib currently provides limited coverage of ML algorithms. Further, the convergence of the adopted SGD approach is heavily dictated by issues such as step-size selection, conditioning of the problem and so on, making it difficult for adoption by non-expert end users.
In this session, the speakers introduce a large-scale ML tool built on the Alternating Direction Method of Multipliers (ADMM) on Spark to solve a gamut of ML algorithms. The proposed approach decomposes most ML problems into smaller sub-problems suitable for distributed computation in Spark.
Learn how this toolkit provides a wider range of ML algorithms, better accuracy compared to MLlib, robust convergence criteria and a simple python API suitable for data scientists – making it easy for end users to develop advanced ML algorithms at scale, without worrying about the underlying intricacies of the optimization solver. It’s a useful arsenal for data scientists’ ML ecosystem on Spark.
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads using Apache Spark. However, efficiently running these applications on
cloud computing infrastructure like Amazon EC2 is challenging and we find that choosing the right hardware configuration can significantly
improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under
various resource configurations so that we can automatically choose the optimal configuration. We present Ernest, a performance prediction
framework for large scale analytics. Ernest builds performance models based on the behavior of the job on small samples of data and then
predicts its performance on larger datasets and cluster sizes. Our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you'll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
Apache Spark 2.2 ships with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Leveraging these reliable statistics helps Spark to make better decisions in picking the most optimal query plan. Examples of these optimizations include selecting the correct build side in a hash-join, choosing the right join type (broadcast hash-join vs. shuffled hash-join) or adjusting a multi-way join order, among others. In this talk, we’ll take a deep dive into Spark’s cost based optimizer and discuss how we collect/store these statistics, the query optimizations it enables, and its performance impact on TPC-DS benchmark queries.
Large Scale Machine Learning with Apache SparkCloudera, Inc.
Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.
T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.
Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we'll show how to train linear models with Elastic-Net regularization using MLlib.
Several learning algorithms such as kernel methods, decision tress, and random forests are nonlinear approaches which are widely used to have better performance compared with linear methods. However, with feature engineering techniques like polynomial expansion by mapping the data into a higher dimensional space, the performance of linear methods can be as competitive as those nonlinear methods. As a result, linear methods remain to be very useful given that the training time of linear methods is significantly faster than the nonlinear ones, and the model is just a simple small vector which makes the prediction step very efficient and easy. However, by mapping the data into higher dimensional space, those linear methods are subject to overfitting and instability of coefficients, and those issues can be successfully addressed by penalization methods including Lasso and Elastic-Net. Lasso method with L1 penalty tends to result in many coefficients shrunk exactly to zero and a few other coefficients with comparatively little shrinkage. L2 penalty trends to result in all small but non-zero coefficients. Combining L1 and L2 penalties are called Elastic-Net method which tends to give a result in between. In the first part of the talk, we'll give an overview of linear methods including commonly used formulations and optimization techniques such as L-BFGS and OWLQN. In the second part of talk, we will talk about how to train linear models with Elastic-Net using our recent contribution to Spark MLlib. We'll also talk about how linear models are practically applied with big dataset, and how polynomial expansion can be used to dramatically increase the performance.
DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project.
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
Apache Spark 2.2 shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Skewed data distributions are often inherent in many real world applications. In order to deal with skewed distributions effectively, we added equal-height histograms to Apache Spark 2.3. Leveraging reliable statistics and histogram helps Spark make better decisions in picking the most optimal query plan for real world scenarios.
In this talk, we’ll take a deep dive into how Spark’s Cost-Based Optimizer estimates the cardinality and size of each database operator. Specifically, for skewed distribution workload such as TPC-DS, we will show histogram’s impact on query plan change, hence leading to performance gain.
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created.
Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production.
In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.
Generalized Linear Models in Spark MLlib and SparkRDatabricks
Generalized linear models (GLMs) unify various statistical models such as linear regression and logistic regression through the specification of a model family and link function. They are widely used in modeling, inference, and prediction with applications in numerous fields. In this talk, we will summarize recent community efforts in supporting GLMs in Spark MLlib and SparkR. We will review supported model families, link functions, and regularization types, as well as their use cases, e.g., logistic regression for classification and log-linear model for survival analysis. Then we discuss the choices of solvers and their pros and cons given training datasets of different sizes, and implementation details in order to match R’s model output and summary statistics. We will also demonstrate the APIs in MLlib and SparkR, including R model formula support, which make building linear models a simple task in Spark. This is a joint work with Eric Liang, Yanbo Liang, and some other Spark contributors.
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
Random Walks on graphs is a useful technique in machine learning, with applications in personalized PageRank, representational learning and others. This session will describe a novel algorithm for enumerating walks on large-scale graphs that benefits from the several unique abilities of Apache Spark.
The algorithm generates a recursive branching DAG of stages that separates out the “closed” and “open” walks. Spark’s shuffle file management system is ingeniously used to accumulate the walks while the computation is progressing. In-memory caching over multi-core executors enables moving the walks several “steps” forward before shuffling to the next stage.
See performance benchmarks, and hear about LinkedIn’s experience with Spark in production clusters. The session will conclude with an observation of how Spark’s unique and powerful construct opens new models of computation, not possible with state-of-the-art, for developing high-performant and scalable algorithms in data science and machine learning.
This was presented for an O'Reilly Media webcast. http://www.oreilly.com/pub/e/3152?cmp=tw-na-webcast-product-webcast_an_introduction_to_apache_accumulo
This webcast will cover the basics of Apache Accumulo architecture and how it works, along with examples of how it is used. We'll also talk about some interesting use cases, such as text indexing, fine-grained multi-level access controls, and storing large-scale graphs. We'll also briefly touch on what sets Accumulo apart from other similar and not-so similar systems and where we think the Accumulo project is headed in a technical direction.
A description of Accumulo from the Apache Accumulo website:
The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.
Accumulo Summit 2015: Event-Driven Big Data with Accumulo - Leveraging Big Da...Accumulo Summit
Talk Abstract
Events define our world – designing a system that rapidly adapts and incorporates many diverse events into relevant, dynamic models produces rich, timely situational analysis. Additionally, events happen at a defined time allowing analysis to move backward and forward in time, even imaginary time with “what if” events.
Accumulo allows the assembly of extremely large event sets to form a high resolution, dynamic event fabric. Key Accumulo design constructs enable high velocity, disparate events to form complex models. They include establishing a high performance flexible event data model and vocabulary with efficient indexing and complex event contexts, dynamic event versioning, managing event race conditions, event security, incorporating event confidence, and correcting event errors. Processing involves high speed messaging and flexible rule tables. Complementing this is an elastic architecture that handles unpredictable event surges and timely analysis demands across multiple nodes.
Accumulo makes high event resolution possible; event-driven makes it immediately actionable. Our findings detail the ingestion and processing of billions of actual events that transform into dynamic decision models instantly available at big-data scale. Additionally, the changes over time of the events and decision models provide a rich strategic base for analytics. A working demonstration using Amazon EC2 VMs, Elastic MapReduce, and Accumulo stores summarizes the event-driven approach. VMs are available to all conference participants for future investigations.
Speaker
John Hebeler
Principal Engineer, Lockheed Martin
John Hebeler, Principal Engineer for Lockheed Martin, is the Technical Lead Developer on a major big-data analytic system based on Accumulo. He focuses on Big Data streaming architectures, diverse data integration, and the Semantic Web, has co-written Semantic Web Programming and a P2P networking book, holds two patents on distributed technologies, and presents at technical conferences. He is currently pursuing his PhD in Information Systems based upon Big Data Integration at the University of Maryland.
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you'll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
Apache Spark 2.2 ships with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Leveraging these reliable statistics helps Spark to make better decisions in picking the most optimal query plan. Examples of these optimizations include selecting the correct build side in a hash-join, choosing the right join type (broadcast hash-join vs. shuffled hash-join) or adjusting a multi-way join order, among others. In this talk, we’ll take a deep dive into Spark’s cost based optimizer and discuss how we collect/store these statistics, the query optimizations it enables, and its performance impact on TPC-DS benchmark queries.
Large Scale Machine Learning with Apache SparkCloudera, Inc.
Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.
T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.
Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we'll show how to train linear models with Elastic-Net regularization using MLlib.
Several learning algorithms such as kernel methods, decision tress, and random forests are nonlinear approaches which are widely used to have better performance compared with linear methods. However, with feature engineering techniques like polynomial expansion by mapping the data into a higher dimensional space, the performance of linear methods can be as competitive as those nonlinear methods. As a result, linear methods remain to be very useful given that the training time of linear methods is significantly faster than the nonlinear ones, and the model is just a simple small vector which makes the prediction step very efficient and easy. However, by mapping the data into higher dimensional space, those linear methods are subject to overfitting and instability of coefficients, and those issues can be successfully addressed by penalization methods including Lasso and Elastic-Net. Lasso method with L1 penalty tends to result in many coefficients shrunk exactly to zero and a few other coefficients with comparatively little shrinkage. L2 penalty trends to result in all small but non-zero coefficients. Combining L1 and L2 penalties are called Elastic-Net method which tends to give a result in between. In the first part of the talk, we'll give an overview of linear methods including commonly used formulations and optimization techniques such as L-BFGS and OWLQN. In the second part of talk, we will talk about how to train linear models with Elastic-Net using our recent contribution to Spark MLlib. We'll also talk about how linear models are practically applied with big dataset, and how polynomial expansion can be used to dramatically increase the performance.
DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project.
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
Apache Spark 2.2 shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Skewed data distributions are often inherent in many real world applications. In order to deal with skewed distributions effectively, we added equal-height histograms to Apache Spark 2.3. Leveraging reliable statistics and histogram helps Spark make better decisions in picking the most optimal query plan for real world scenarios.
In this talk, we’ll take a deep dive into how Spark’s Cost-Based Optimizer estimates the cardinality and size of each database operator. Specifically, for skewed distribution workload such as TPC-DS, we will show histogram’s impact on query plan change, hence leading to performance gain.
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created.
Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production.
In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.
Generalized Linear Models in Spark MLlib and SparkRDatabricks
Generalized linear models (GLMs) unify various statistical models such as linear regression and logistic regression through the specification of a model family and link function. They are widely used in modeling, inference, and prediction with applications in numerous fields. In this talk, we will summarize recent community efforts in supporting GLMs in Spark MLlib and SparkR. We will review supported model families, link functions, and regularization types, as well as their use cases, e.g., logistic regression for classification and log-linear model for survival analysis. Then we discuss the choices of solvers and their pros and cons given training datasets of different sizes, and implementation details in order to match R’s model output and summary statistics. We will also demonstrate the APIs in MLlib and SparkR, including R model formula support, which make building linear models a simple task in Spark. This is a joint work with Eric Liang, Yanbo Liang, and some other Spark contributors.
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
Random Walks on graphs is a useful technique in machine learning, with applications in personalized PageRank, representational learning and others. This session will describe a novel algorithm for enumerating walks on large-scale graphs that benefits from the several unique abilities of Apache Spark.
The algorithm generates a recursive branching DAG of stages that separates out the “closed” and “open” walks. Spark’s shuffle file management system is ingeniously used to accumulate the walks while the computation is progressing. In-memory caching over multi-core executors enables moving the walks several “steps” forward before shuffling to the next stage.
See performance benchmarks, and hear about LinkedIn’s experience with Spark in production clusters. The session will conclude with an observation of how Spark’s unique and powerful construct opens new models of computation, not possible with state-of-the-art, for developing high-performant and scalable algorithms in data science and machine learning.
This was presented for an O'Reilly Media webcast. http://www.oreilly.com/pub/e/3152?cmp=tw-na-webcast-product-webcast_an_introduction_to_apache_accumulo
This webcast will cover the basics of Apache Accumulo architecture and how it works, along with examples of how it is used. We'll also talk about some interesting use cases, such as text indexing, fine-grained multi-level access controls, and storing large-scale graphs. We'll also briefly touch on what sets Accumulo apart from other similar and not-so similar systems and where we think the Accumulo project is headed in a technical direction.
A description of Accumulo from the Apache Accumulo website:
The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.
Accumulo Summit 2015: Event-Driven Big Data with Accumulo - Leveraging Big Da...Accumulo Summit
Talk Abstract
Events define our world – designing a system that rapidly adapts and incorporates many diverse events into relevant, dynamic models produces rich, timely situational analysis. Additionally, events happen at a defined time allowing analysis to move backward and forward in time, even imaginary time with “what if” events.
Accumulo allows the assembly of extremely large event sets to form a high resolution, dynamic event fabric. Key Accumulo design constructs enable high velocity, disparate events to form complex models. They include establishing a high performance flexible event data model and vocabulary with efficient indexing and complex event contexts, dynamic event versioning, managing event race conditions, event security, incorporating event confidence, and correcting event errors. Processing involves high speed messaging and flexible rule tables. Complementing this is an elastic architecture that handles unpredictable event surges and timely analysis demands across multiple nodes.
Accumulo makes high event resolution possible; event-driven makes it immediately actionable. Our findings detail the ingestion and processing of billions of actual events that transform into dynamic decision models instantly available at big-data scale. Additionally, the changes over time of the events and decision models provide a rich strategic base for analytics. A working demonstration using Amazon EC2 VMs, Elastic MapReduce, and Accumulo stores summarizes the event-driven approach. VMs are available to all conference participants for future investigations.
Speaker
John Hebeler
Principal Engineer, Lockheed Martin
John Hebeler, Principal Engineer for Lockheed Martin, is the Technical Lead Developer on a major big-data analytic system based on Accumulo. He focuses on Big Data streaming architectures, diverse data integration, and the Semantic Web, has co-written Semantic Web Programming and a P2P networking book, holds two patents on distributed technologies, and presents at technical conferences. He is currently pursuing his PhD in Information Systems based upon Big Data Integration at the University of Maryland.
Almost every week, news of a proprietary or customer data breach hits the news wave. While attackers have increased the level of sophistication in their tactics, so too have organizations advanced in their ability to build a robust, data-driven defense. Join Hortonworks and Sqrrl to learn how a Modern Data Architecture with Hortonworks Data Platform (HDP) and Sqrrl Enterprise enables intuitive exploration, discovery, and pattern recognition over your big cybersecurity data.
In this webinar you will learn:
--How Apache Hadoop makes it the perfect fit to accumulate cybersecurity data and diagnose the latest attacks
--The effective ways for pinpointing and reasoning about correlated events within your data, and assessing your network security posture.
--How a Modern Data Architecture that includes the power of Hadoop with Hortonworks Data Platform with the massive, secure, entity-centric data models in Sqrrl Enterprise can discover hidden patterns and detect anomalies within your data using linked data analysis.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
How to use Impala query plan and profile to fix performance issuesCloudera, Inc.
Apache Impala is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.
Presented by Michelle Hirsch, Head of MATLAB Product Management, MathWorks on 28th April in Bangalore in joint languages meetup @Walmart.
Companies are scrambling to get insight from the massive quantities of data they collect but are struggling to find employees who combine the deep expertise in computer science, statistics and machine learning, and the domain expertise to truly understand the data. In this talk, Dr. Hirsch discusses how MATLAB enables engineers and scientists to apply their domain expertise to big data analytics.
Highlights:
* Accessing data in large text files, databases, or from the Hadoop Distributed File System (HDFS)
* Using virtual “tall” arrays to process out-of-core data with natural mathematical syntax
Developing machine learning models
* Integrating MATLAB analytics into production systems
About the speaker: Michelle Hirsch, Ph.D. is responsible for driving strategy and direction for MATLAB, the leading programming platform for engineers and scientists. Based outside of Boston, Massachusetts, Michelle is joining our meetup during a trip to meet with MATLAB users across India.
Supporting data: https://www.slideshare.net/CodeOps/flight-test-analysis-final
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Presented at IDEAS SoCal on Oct 20, 2018. I discuss main approaches of deploying data science engines to production and provide sample code for the comprehensive approach of real time scoring with MLeap and Spark ML.
Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build
and run applications that work with highly connected datasets. The core of Neptune is a purpose-built,
high-performance graph database engine. This engine is optimized for storing billions of relationships
and querying the graph with milliseconds latency. Neptune supports the popular graph query languages
Apache TinkerPop Gremlin, the W3C’s SPARQL, and Neo4j's openCypher, enabling you to build
queries that efficiently navigate highly connected datasets. Neptune powers graph use cases such as
recommendation engines, fraud detection, knowledge graphs, drug discovery, and network security Neptune is highly available, with read replicas, point-in-time recovery, continuous backup to Amazon
S3, and replication across Availability Zones. Neptune provides data security features, with support
for encryption at rest and in transit. Neptune is fully managed, so you no longer need to worry about
database management tasks like hardware provisioning, software patching, setup, configuration, or
backups
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
Presented at JAX London
MapReduce begat Hadoop begat Big Data. NoSQL moved us away from the stricture of monolithic storage architectures to fit-for-purpose designs. But, Houston, we still have a problem. Architects are still designing systems like this is the '70s. SOA, went from buzzword to the bank with the emergence and evolution of the cloud and on-demand right-now elasticity. Yet most systems are still designed to store-then-compute rather than to observe, orient, decide and act on in-flight data.
The relationships between data sets matter. Discovering, analyzing, and learning those relationships is a central part to expanding our understand, and is a critical step to being able to predict and act upon the data. Unfortunately, these are not always simple or quick tasks.
To help the analyst we introduce RAPIDS, a collection of open-source libraries, incubated by NVIDIA and focused on accelerating the complete end-to-end data science ecosystem. Graph analytics is a critical piece of the data science ecosystem for processing linked data, and RAPIDS is pleased to offer cuGraph as our accelerated graph library.
Simply accelerating algorithms only addressed a portion of the problem. To address the full problem space, RAPIDS cuGraph strives to be feature-rich, easy to use, and intuitive. Rather than limiting the solution to a single graph technology, cuGraph supports Property Graphs, Knowledge Graphs, Hyper-Graphs, Bipartite graphs, and the basic directed and undirected graph.
A Python API allows the data to be manipulated as a DataFrame, similar and compatible with Pandas, with inputs and outputs being shared across the full RAPIDS suite, for example with the RAPIDS machine learning package, cuML.
This talk will present an overview of RAPIDS and cuGraph. Discuss and show examples of how to manipulate and analyze bipartite and property graph, plus show how data can be shared with machine learning algorithms. The talk will include some performance and scalability metrics. Then conclude with a preview of upcoming features, like graph query language support, and the general RAPIDS roadmap.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]
1. D4M and Apache Accumulo
Vijay Gadepally, Lauren Edwards,
Dylan Hutchison, Jeremy Kepner
Accumulo Summit
College Park, MD
April 29, 2015
This work is sponsored by the Assistant Secretary of Defense for Research
and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions,
interpretations, recommendations and conclusions are those of the authors
and are not necessarily endorsed by the United States Government
2. Accumulo Summit
VNG - 2
Giving away the punch line
• D4M is a popular open source software tool that
connects scientists with Big Data technologies
• D4M-Accumulo binding provides high performance
connectivity to Apache Accumulo for quick analytic
prototyping
• Graphulo: Implement GraphBLAS server-side
iterators and operators on Accumulo tables
4. Accumulo Summit
VNG - 4
Common Big Data Challenge
CommandersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Data
Users
Gap
2000 2005 2010 2015 & Beyond
Rapidly increasing
- Data volume
- Data velocity
- Data variety
- Data veracity (security)
5. Accumulo Summit
VNG - 5
Common Big Data Architecture
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
6. Accumulo Summit
VNG - 6
Common Big Data Architecture
- Data Volume: Cloud Computing -
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Operators
MIT
SuperCloud
Enterprise Cloud
Big Data Cloud Database Cloud
Compute Cloud
MIT SuperCloud merges four clouds
7. Accumulo Summit
VNG - 7
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Lincoln benchmarking
validated Accumulo performance
Common Big Data Architecture
- Data Velocity: Accumulo Database -
8. Accumulo Summit
VNG - 8
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
D4M demonstrated a
universal approach to diverse data
columnsrows
Σ
raw
Common Big Data Architecture
- Data Variety: D4M Schema -
intel reports, DNA, health records, publication
citations, web logs, social media, building alarms,
cyber, … all handled by a common 4 table schema
9. Accumulo Summit
VNG - 9
Common Big Data Architecture
- Data Veracity: Security Tools-
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Using cryptography to protect
sensitive data
-Verifiable Query Results-
-Computing on Masked Data-
Big Data
Cloud
Masked
Query
Plaintext
Query
Encrypt
CMD
Masked
Analytic
Result
Decrypt
Plaintext
Analytic
Result
11. Accumulo Summit
VNG - 11
High Level Language: D4M
http://d4m.mit.edu
Accumulo
Distributed Database
Query:
Alice
Bob
Cathy
David
Earl
Associative Arrays
Numerical Computing Environment
D4M
Dynamic
Distributed
Dimensional
Data Model
A
C
D
E
B
A D4M query returns a sparse
matrix or a graph…
…for statistical signal processing
or graph analysis in MATLAB
D4M binds associative arrays to databases, enabling rapid
prototyping of data-intensive cloud analytics and visualization
12. Accumulo Summit
VNG - 12
What is D4M?
• The Dynamic Distributed Dimensional Data Model:
– Support for mathematical foundation – associative arrays
– Schema to represent most unstructured data as associative arrays
– Software tools to connect with variety of databases such as Apache
Accumulo, SciDB, mySQL, PostgreSQL, …
• Software tools currently
implemented in
MATLAB/Octave, and Julia
(v1)
• Connect to databases via
JDBC (relational), SHIM
(SciDB) or custom Java API
(Accumulo)
13. Accumulo Summit
VNG - 13
• Key innovation: mathematical closure
– All associative array operations return associative arrays
• Enables composable mathematical operations
A + B A - B A & B A|B A*B
• Enables composable query operations via array indexing
A('alice bob ',:) A('alice ',:) A('al* ',:)
A('alice : bob ',:) A(1:2,:) A == 47.0
• Simple to implement in a library in programming environments
with: 1st class support of 2D arrays, operator overloading,
sparse linear algebra
Mathematical Foundation:
Associative Arrays
• Complex queries with ~50x less effort than Java/SQL
• Naturally leads to high performance parallel implementation
• Need a schema to convert arbitrary data to associative array
14. Accumulo Summit
VNG - 14
D4M Data Schema
• A structure described in a language supported by the database
management system.
• Use D4M schema to represent heterogeneous data types in
common data format
– Schema converts structured or unstructured raw text to a tuple
representation supported by Accumulo:
• Usually use a 4 table representation
– The Edge Table, the Transpose Table, Degree Table, Raw Table
33659254179712 2013-05-20 21:21:42 20798128
kiefpief web 3b77caf94bfc81fe I am
sending love to Oklahoma. And actually -- to everyone who
may need it. You are loved. And you are not alone.
Promise. #PrayforOklahoma
33660010027264 2013-05-20 21:54:56 35.99894978 -
78.90660222 -8783842.7781526 4300476.86376416
22435220 RyanBLeslie Twitter for iPad348803787
bced47a0c99c71d0 @HaydenBigCntry RT @jiminhofe:
The devastation in Oklahoma is
…
D4M
Schema
(33659254179712, time|2013-05-20 21:21:42, 1)
(33659254179712, user|kiefpief, 1)
(33659254179712, text, Sending love to OK #PrayforOklahoma)
(33659254179712, word|Sending, 1)
(33660010027264, time|2013-05-20 21:54:56, 1)
(33660010027264, lat|-78.90660222, 1 )
(33660010027264, lon|35.99894978, 1)
(33660010027264, user|RyanBLeslie, 1)
(33660010027264, RT|@HaydenBigCntry , 1)
(33660010027264, word|Oklahoma, 1)
…
17. Accumulo Summit
VNG - 17
D4M Software Library
• Associative Array representation works very well as an interface
among databases.
• D4M currently implemented in languages with first class
support of sparse matrices:
– MATLAB
– GNU Octave
– Julia (in progress)
• Implemented in ~2000 lines of MATLAB code
Download D4M
Source from
d4m.mit.edu
d4m_api.zip
matlab_src/
d4m_api_java.jar
libext.zip
dependency JARs
18. Accumulo Summit
VNG - 18
D4M: What a user sees
(row, col, val)
Matlab strings
d4m
Matlab API
d4m_api_java
Java API
Accumulo
Java API
Accumulo
Table
% D4M Associative Array API
row = 'r1,r2,'; col = 'c1,c1,'; val = '7,3,';
A = Assoc(row,col,val,@min);
% D4M Accumulo API
DB = DBserver(’zoohost.edu:2181', 'Accumulo',
'instance', 'user', 'password');
T = DB('Table'); % Create table if doesn't exist.
put(T,A); % Put associative array in T.
Aret = T(:,:); % Scan all of T.
19. Accumulo Summit
VNG - 19
D4M: What a developer sees
Type Matlab/Julia File Java Class Use
Table
management
DBcreate.m D4mDbTableOperationsCreate table
@DBserver/ls.m D4mDbInfo List tables
@DBtable/nnz.m D4mDbTableOperations
Number of entries in table,
summed from table's tablets
DBdelete.m D4mDbTableOperationsDelete table
Write DBinsert.m D4mDbInsert Insert
Scan
@DBtable/DBtable.m D4mDataSearch Create query holder
@DBtable/subsref.m D4mDataSearch Do query, possibly holding batches
@DBtable/close.m D4mDataSearch Reset query
Delete
@DBtable/deleteTriple.m AccumuloDelete Delete entries
@DBtable/deleteAssoc.m AccumuloDelete Delete entries
Iterators
@DBtable/ColCombiner.m D4mDbTableOperationsList table iterators
@DBtable/addColCombiner.m D4mDbTableOperationsAdd all-scope table iterator
@DBtable/deleteColCombiner.m D4mDbTableOperationsRemove iterator
Splits
@DBtable/Splits.m D4mDbTableOperations
Return splits, number of entries in each
tablet, tablet server addresses
@DBtable/addSplits.m D4mDbTableOperationsAdd new table split
@DBtable/putSplits.m D4mDbTableOperationsReplace table splits, merging old splits
@DBtable/mergeSplits.m D4mDbTableOperationsRemove splits by merging tablets
• Source code released and available!
20. Accumulo Summit
VNG - 20
D4M Write
More details on Batched Insert – 500 kB by default
• putNumBytes() controls #entries to insert in one batch, on MATLAB side
• Independent batches: each creates, flushes and closes separate
BatchWriters
• Guarantee BatchWriters correctly closed
• No need to maintain BatchWriter lifecycle in MATLAB
• 30 ms maximum latency before flushing
• 50 Write threads
• 1 MB maximum memory on BatchWriter, plenty for default batch size
Key
Value
Assoc
Val
Row ID
Assoc
Row
Column Timestamp
Family
putColumn
Family()
Qualifier
Assoc
Col
Visibility
put
Security()
21. Accumulo Summit
VNG - 21
D4M Scan Example
1. Translate Matlab queries into ranges for BatchScanner
T(:,:) %Scan all
T('r1;r5;:;r7;', :) %Scan given row ranges
T(:, 'c1;') %Use fetchColumn(), or row scan
Transpose table
T('r5;:;r9;', 'c1;:;c3;') %Complicated; break into simpler
queries
2. Hold state of Scanner iterator as state of MATLAB object
T_it = Iterator(T, 'elements', 1e5); % 100k entry batch size
A = T_it(:,:); % Initial query
while nnz(A) % While there is another batch
handleBatch(A);
A = T_it(); % Get next batch
end
22. Accumulo Summit
VNG - 22
Parallel Accumulo Access
Sample script writing files to Accumulo in parallel:
T = DB('Tedge','TedgeT');
myFiles = global_ind(zeros(Nfile,1,map([Np 1],{},0:Np-1)));
for i = myFiles
fname = ['data/' num2str(i)]; % Create filename.
load([fname '.A.mat']); % Load file data.
put(T,num2str(A)); % Insert to Accumulo.
end
Run on 4 local processors: eval(pRUN('Script',4,{}));
• D4M + pMATLAB gives rise to high performance
23. Accumulo Summit
VNG - 23
Accumulo Scaling on MIT SuperCloud
• Scales linearly with ingest processes, server nodes, and data size
Servernodes
24. Accumulo Summit
VNG - 24
115,000,000 inserts per second
• Using supercomputing techniques allows peak insert to be achieve
within seconds of launch
1M edge
Graph500
graph
43K
43B edges in
5 minutes
39. Accumulo Summit
VNG - 39
Summary
• D4M is a popular software tool that connects
scientists with Big Data technologies
• D4M-Accumulo binding provides high performance
connectivity to Apache Accumulo for quick analytic
prototyping
• Current research expands this connection to support
high performance graph analytics
40. Accumulo Summit
VNG - 40
• Graphulo: Implement GraphBLAS server-side iterators and operators on
Accumulo tables
• Use case: Queued analytics = Localized within a neighborhood
• Aim for Accumulo Contrib
• Released:
– Design Document
• Upcoming:
– Beta version of tools in
late May/early June
• Future:
– Scalability
– Schemas
– More example algorithms
G R A P H U L O
http://graphulo.mit.edu
Graphulo:
Contact Dylan Hutchison if you have any thoughts!
dhutchis@mit.edu
41. Accumulo Summit
VNG - 41
Acknowledgements
• Bill Arcand
• Bill Bergeron
• David Bestor
• Chansup Byun
• Matt Hubbell
• Jeremy Kepner
• Jake Bolewski
• Pete Michaleas
• Julie Mullen
• Andy Prout
• Albert Reuther
• Tony Rosa
• Charles Yee
• Dylan Hutchison
And many more …
We make use of the high-level language D4M to enable to construction of graph representations of large-scale data
D4M is made up of 3 componenets – mathematical foundation, d4m scham and software tools.
Associative arrays operations are composable, enabling complex queries to be constructed with a few lines.
Shorter code -> Easier to audit, better for security!
Introduction to schemas and how they map onto triple stores. The D4M schema allows one to use the mathematics behind associative arrays within this schema to perform mathematical operations on big data sets
D4M schema converts structured or unstructured raw data to the 3-tuple representation supported by Accumulo:
row is a unique identifier (often some variation of a time stamp)
column is a unique representation of the data
value is typically just ‘1’
Standard “exploded” D4M Schema used with many database. 4 table schema is introduced
Outline slide.
@min is collision function. From a user perspective he/she sees a software library
From an Accumulo developer point of view, one sees the conneciton between the library and java calls to the Accumulo library.
Justification for 50 write threads and 30 ms latency? (chosen after performance tuning)
max memory could create trouble if chunk size increased
Set Column Family, Visibility with a separate call
Design Choice: We could have used a handle object with an explicit destructor that closes the BatchWriter when the object is destroyed. Instead we are synchronous. Easier.
In a D4M Scan, there are many ways in which the underlying Java library is being called.
In order to achieve high performance, one can combine D4M with pMATLAB.
Accumulo demonstrates linear scaling across data sizes and hardware
Achieved a peak performance of 115,000,000 inserts per second.
Outline slide.
Now, we will show a demonstration of D4M in action on a Twitter dataset. This shows how quickly this tools can be used to prototype algorithms.
Step 1: Set table bindings to Accumulo database
Now you can query tweets using the keyword of interest. Recall that this calls a scanner on the server side
To find common locations, we can make use of the tweet geohash
You can filter tweets based on location, in this case ww is stuff in northern california
You can use the D4M schema txt table to get the full tweets.
Remove all common words – referred to as stop words
Remove all common words – referred to as stop words
Find words that occur together in the same tweet
Get rid of self loops
Find common words used together in a tweet. Any surprises?
Plot onto a map. Of course, there are many different viz tools that can be used
Outline slide.
Conclusion Slide.
Non-Lincoln Work. Contact Dylan Hutchison for more information!
Acknowledgements page. Many people who have made this possible.