The document discusses two Spark algorithms: outlier detection on categorical data and KNN join. It describes how the algorithms work, including mapping attributes to scores for outlier detection and using z-order curves to map points to a single dimension for KNN joins. It also provides performance results and best practices for implementing the algorithms in Spark and discusses applications in graph algorithms.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.
T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.
Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
Real-world graphs are seldom static. Applications that generate
graph-structured data today do so continuously, giving rise to an underlying graph whose structure evolves over time. Mining these time-evolving graphs can be insightful, both from research and business perspectives. While several works have focused on some individual aspects, there exists no general purpose time-evolving graph processing engine.
We present Tegra, a time-evolving graph processing system built
on a general-purpose dataflow framework. We introduce Timelapse, a flexible abstraction that enables efficient analytics on evolving graphs by allowing graph-parallel stages to iterate over complete history of nodes. We use Timelapse to present two computational models, a temporal analysis model for performing computations on multiple snapshots of an evolving graph, and a generalized incremental computation model for efficiently updating results of computations.
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
Clustering is often an essential first step in datamining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can
offer a richer representation by suggesting the potential group
structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a
parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of
single-linkage clustering algorithm due to its natural expression
of iterative process. Our algorithm can be deployed easily in
Amazon’s cloud environment. And a thorough performance
evaluation in Amazon’s EC2 verifies that the scalability of our
algorithm sustains when the datasets scale up.
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.
T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.
Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
Real-world graphs are seldom static. Applications that generate
graph-structured data today do so continuously, giving rise to an underlying graph whose structure evolves over time. Mining these time-evolving graphs can be insightful, both from research and business perspectives. While several works have focused on some individual aspects, there exists no general purpose time-evolving graph processing engine.
We present Tegra, a time-evolving graph processing system built
on a general-purpose dataflow framework. We introduce Timelapse, a flexible abstraction that enables efficient analytics on evolving graphs by allowing graph-parallel stages to iterate over complete history of nodes. We use Timelapse to present two computational models, a temporal analysis model for performing computations on multiple snapshots of an evolving graph, and a generalized incremental computation model for efficiently updating results of computations.
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
Clustering is often an essential first step in datamining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can
offer a richer representation by suggesting the potential group
structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a
parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of
single-linkage clustering algorithm due to its natural expression
of iterative process. Our algorithm can be deployed easily in
Amazon’s cloud environment. And a thorough performance
evaluation in Amazon’s EC2 verifies that the scalability of our
algorithm sustains when the datasets scale up.
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
Apache Spark is rapidly becoming the de facto framework for big-data analytics. Spark’s built-in, large-scale Machine Learning Library (MLlib) uses traditional stochastic gradient descent (SGD) to solve standard ML algorithms. However, MlLib currently provides limited coverage of ML algorithms. Further, the convergence of the adopted SGD approach is heavily dictated by issues such as step-size selection, conditioning of the problem and so on, making it difficult for adoption by non-expert end users.
In this session, the speakers introduce a large-scale ML tool built on the Alternating Direction Method of Multipliers (ADMM) on Spark to solve a gamut of ML algorithms. The proposed approach decomposes most ML problems into smaller sub-problems suitable for distributed computation in Spark.
Learn how this toolkit provides a wider range of ML algorithms, better accuracy compared to MLlib, robust convergence criteria and a simple python API suitable for data scientists – making it easy for end users to develop advanced ML algorithms at scale, without worrying about the underlying intricacies of the optimization solver. It’s a useful arsenal for data scientists’ ML ecosystem on Spark.
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we'll show how to train linear models with Elastic-Net regularization using MLlib.
Several learning algorithms such as kernel methods, decision tress, and random forests are nonlinear approaches which are widely used to have better performance compared with linear methods. However, with feature engineering techniques like polynomial expansion by mapping the data into a higher dimensional space, the performance of linear methods can be as competitive as those nonlinear methods. As a result, linear methods remain to be very useful given that the training time of linear methods is significantly faster than the nonlinear ones, and the model is just a simple small vector which makes the prediction step very efficient and easy. However, by mapping the data into higher dimensional space, those linear methods are subject to overfitting and instability of coefficients, and those issues can be successfully addressed by penalization methods including Lasso and Elastic-Net. Lasso method with L1 penalty tends to result in many coefficients shrunk exactly to zero and a few other coefficients with comparatively little shrinkage. L2 penalty trends to result in all small but non-zero coefficients. Combining L1 and L2 penalties are called Elastic-Net method which tends to give a result in between. In the first part of the talk, we'll give an overview of linear methods including commonly used formulations and optimization techniques such as L-BFGS and OWLQN. In the second part of talk, we will talk about how to train linear models with Elastic-Net using our recent contribution to Spark MLlib. We'll also talk about how linear models are practically applied with big dataset, and how polynomial expansion can be used to dramatically increase the performance.
DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project.
Large Scale Machine Learning with Apache SparkCloudera, Inc.
Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
This talk tells the story of implementation and optimization of a sparse logistic regression algorithm in spark. I would like to share the lessons I learned and the steps I had to take to improve the speed of execution and convergence of my initial naive implementation. The message isn’t to convince the audience that logistic regression is great and my implementation is awesome, rather it will give details about how it works under the hood, and general tips for implementing an iterative parallel machine learning algorithm in spark. The talk is structured as a sequence of “lessons learned” that are shown in form of code examples building on the initial naive implementation. The performance impact of each “lesson” on execution time and speed of convergence is measured on benchmark datasets. You will see how to formulate logistic regression in a parallel setting, how to avoid data shuffles, when to use a custom partitioner, how to use the ‘aggregate’ and ‘treeAggregate’ functions, how momentum can accelerate the convergence of gradient descent, and much more. I will assume basic understanding of machine learning and some prior knowledge of spark. The code examples are written in scala, and the code will be made available for each step in the walkthrough.
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
GraphX is a graph processing framework built into Apache Spark. This talk introduces GraphX, describes key features of its API, and gives an update on its status.
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
Modern Data-Intensive Scalable Computing (DISC) systems such as Apache Spark do not support sophisticated cost-based query optimizers because they are specifically designed to process data that resides in external storage systems (e.g. HDFS), or they lack the necessary data statistics. Consequently, many crucial optimizations, such as join order and plan selection, are presently out-of-scope in these DISC system optimizers. Yet, join order is one of the most important decisions a cost-optimizer can make because wrong orders can result in a query response time that can become more than an order-of-magnitude slower compared to the better order.
Log Analytics in Datacenter with Apache Spark and Machine LearningPiotr Tylenda
Presented during DataMass Summit 2017.
http://summit2017.datamass.io/
https://www.youtube.com/watch?v=eGJfhHPdhuo
Data center workloads produce a significant amount of log data which has to be analyzed in order to discover any potential issues. We present an automated text mining approach for workload monitoring and data analytics, which is a combination of machine learning and big data processing. This session provides an overview of a data pipeline based on key components such as Apache Kafka, Apache Spark and generalized version of k-means algorithm.
Personal Research Overview presented at the KU-NAIST Research MeetingChawanat Nakasan
This is the overview of my research as I finish the doctoral degree. This presentation was made on 2018-02-15 as part of the Kasetsart University and Nara Institute of Science and Technology Research Meeting. The content concerns my research and possible future contributions that I can make towards KU-NAIST joint research effort.
** This document has been edited from the time of presentation to remove sensitive and confidential material.
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
Apache Spark is rapidly becoming the de facto framework for big-data analytics. Spark’s built-in, large-scale Machine Learning Library (MLlib) uses traditional stochastic gradient descent (SGD) to solve standard ML algorithms. However, MlLib currently provides limited coverage of ML algorithms. Further, the convergence of the adopted SGD approach is heavily dictated by issues such as step-size selection, conditioning of the problem and so on, making it difficult for adoption by non-expert end users.
In this session, the speakers introduce a large-scale ML tool built on the Alternating Direction Method of Multipliers (ADMM) on Spark to solve a gamut of ML algorithms. The proposed approach decomposes most ML problems into smaller sub-problems suitable for distributed computation in Spark.
Learn how this toolkit provides a wider range of ML algorithms, better accuracy compared to MLlib, robust convergence criteria and a simple python API suitable for data scientists – making it easy for end users to develop advanced ML algorithms at scale, without worrying about the underlying intricacies of the optimization solver. It’s a useful arsenal for data scientists’ ML ecosystem on Spark.
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we'll show how to train linear models with Elastic-Net regularization using MLlib.
Several learning algorithms such as kernel methods, decision tress, and random forests are nonlinear approaches which are widely used to have better performance compared with linear methods. However, with feature engineering techniques like polynomial expansion by mapping the data into a higher dimensional space, the performance of linear methods can be as competitive as those nonlinear methods. As a result, linear methods remain to be very useful given that the training time of linear methods is significantly faster than the nonlinear ones, and the model is just a simple small vector which makes the prediction step very efficient and easy. However, by mapping the data into higher dimensional space, those linear methods are subject to overfitting and instability of coefficients, and those issues can be successfully addressed by penalization methods including Lasso and Elastic-Net. Lasso method with L1 penalty tends to result in many coefficients shrunk exactly to zero and a few other coefficients with comparatively little shrinkage. L2 penalty trends to result in all small but non-zero coefficients. Combining L1 and L2 penalties are called Elastic-Net method which tends to give a result in between. In the first part of the talk, we'll give an overview of linear methods including commonly used formulations and optimization techniques such as L-BFGS and OWLQN. In the second part of talk, we will talk about how to train linear models with Elastic-Net using our recent contribution to Spark MLlib. We'll also talk about how linear models are practically applied with big dataset, and how polynomial expansion can be used to dramatically increase the performance.
DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project.
Large Scale Machine Learning with Apache SparkCloudera, Inc.
Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
This talk tells the story of implementation and optimization of a sparse logistic regression algorithm in spark. I would like to share the lessons I learned and the steps I had to take to improve the speed of execution and convergence of my initial naive implementation. The message isn’t to convince the audience that logistic regression is great and my implementation is awesome, rather it will give details about how it works under the hood, and general tips for implementing an iterative parallel machine learning algorithm in spark. The talk is structured as a sequence of “lessons learned” that are shown in form of code examples building on the initial naive implementation. The performance impact of each “lesson” on execution time and speed of convergence is measured on benchmark datasets. You will see how to formulate logistic regression in a parallel setting, how to avoid data shuffles, when to use a custom partitioner, how to use the ‘aggregate’ and ‘treeAggregate’ functions, how momentum can accelerate the convergence of gradient descent, and much more. I will assume basic understanding of machine learning and some prior knowledge of spark. The code examples are written in scala, and the code will be made available for each step in the walkthrough.
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
GraphX is a graph processing framework built into Apache Spark. This talk introduces GraphX, describes key features of its API, and gives an update on its status.
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
Modern Data-Intensive Scalable Computing (DISC) systems such as Apache Spark do not support sophisticated cost-based query optimizers because they are specifically designed to process data that resides in external storage systems (e.g. HDFS), or they lack the necessary data statistics. Consequently, many crucial optimizations, such as join order and plan selection, are presently out-of-scope in these DISC system optimizers. Yet, join order is one of the most important decisions a cost-optimizer can make because wrong orders can result in a query response time that can become more than an order-of-magnitude slower compared to the better order.
Log Analytics in Datacenter with Apache Spark and Machine LearningPiotr Tylenda
Presented during DataMass Summit 2017.
http://summit2017.datamass.io/
https://www.youtube.com/watch?v=eGJfhHPdhuo
Data center workloads produce a significant amount of log data which has to be analyzed in order to discover any potential issues. We present an automated text mining approach for workload monitoring and data analytics, which is a combination of machine learning and big data processing. This session provides an overview of a data pipeline based on key components such as Apache Kafka, Apache Spark and generalized version of k-means algorithm.
Personal Research Overview presented at the KU-NAIST Research MeetingChawanat Nakasan
This is the overview of my research as I finish the doctoral degree. This presentation was made on 2018-02-15 as part of the Kasetsart University and Nara Institute of Science and Technology Research Meeting. The content concerns my research and possible future contributions that I can make towards KU-NAIST joint research effort.
** This document has been edited from the time of presentation to remove sensitive and confidential material.
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...Rakuten Group, Inc.
Astra is a distributed SQL database for data analysis and prediction. We're aiming to achieve near real-time data analysis, and to deliver the components of a Data Lake as a Service which contains it. Astra’s another feature is integration with Machine learning to support many kinds of data analysis.
Design of 32 bit Parallel Prefix Adders IOSR Journals
In this paper, we propose 32 bit Kogge-Stone, Brent-Kung, Ladner-Fischer parallel prefix adders. In
general N-bit adders like Ripple Carry Adders (slow adders compare to other adders), and Carry Look Ahead
adders (area consuming adders) are used in earlier days. But now the most Industries are using parallel prefix
adders because of their advantages compare to other adders. Parallel prefix adders are faster and area
efficient. Parallel prefix adder is a technique for increasing the speed in DSP processor while performing
addition. We simulate and synthesis different types of 32-bit prefix adders using Xilinx ISE 10.1i tool. By using
these synthesis results, we noted the performance parameters like number of LUTs and delay. We compare these three adders in terms of LUTs (represents area) and delay values.
Design of 32 bit Parallel Prefix AddersIOSR Journals
Abstract: In this paper, we propose 32 bit Kogge-Stone, Brent-Kung, Ladner-Fischer parallel prefix adders. In general N-bit adders like Ripple Carry Adders (slow adders compare to other adders), and Carry Look Ahead adders (area consuming adders) are used in earlier days. But now the most Industries are using parallel prefix adders because of their advantages compare to other adders. Parallel prefix adders are faster and area efficient. Parallel prefix adder is a technique for increasing the speed in DSP processor while performing addition. We simulate and synthesis different types of 32-bit prefix adders using Xilinx ISE 10.1i tool. By using these synthesis results, we noted the performance parameters like number of LUTs and delay. We compare these three adders in terms of LUTs (represents area) and delay values. Keywords− prefix adder, carry operator, Kogge-Stone, Brent-Kung, Ladner-Fischer
Foundations of streaming SQL: stream & table theoryDataWorks Summit
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how can all of this work in a programmatic framework like Apache Beam? The presentation answers these questions and more as it walks you through key concepts underpinning data processing in general.
Presentation explores the relationship between the Beam model (as described in paper “The Dataflow Mode”and the “Streaming 101”and “Streaming 102” blog posts) and stream and table theory (as popularized by Martin Kleppmann and Jay Kreps, among others).
It turns out that stream and table theory does an illuminating job of describing the low-level concepts that underlie the Beam model.
The presentation explains what is required to provide robust stream processing support in SQL and discusses the concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, as well as new ideas yet to come. You’ll leave with a much better understanding of the key concepts underpinning data processing—regardless of whether that data processing is batch or streaming or SQL or programmatic—as well as a concrete notion of what robust stream processing in SQL looks like.
Speaker
Anton Kedin, Google, Software Engineer
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...IJMTST Journal
Parallel Prefix Adders have been established as the most efficient circuits for binary addition. The binary adder is the critical element in most digital circuit designs including digital signal processors and microprocessor data path units. The final carry is generated ahead to the generation of the sum which leads extensive research focused on reduction in circuit complexity and power consumption of the adder. In VLSI implementation, parallel-prefix adders are known to have the best performance. This paper investigates four types of carry-tree adders (the Kogge-Stone, sparse Kogge-Stone, spanning tree, Brent kung Adder) and compare them to the simple Ripple Carry Adder and Carry Skip Adder. These designs of varied bit-widths are simulated using implemented on a Xilinx version Spartan 3E FPGA. These fast carry-chain carry-tree adders support the bit width up to 256. We report on the area requirements and reduction in circuit complexity for a variety of classical parallel prefix adder structures.
Hybrid predictive modelling of geometry with limited data in cold spray addit...Daiki Ikeuchi
Cold spray additive manufacturing is an emerging technology that offers unique advantages, including high production rate, unlimited product size and the ability to process oxygen-sensitive materials. However, dimensional control and accuracy in cold spray additive manufacturing are challenging, which limits its integration into commercial manufacturing systems. These problems originate from the poor understanding of the complex relationship between process parameters and the resulting fabricated geometry. This knowledge gap motivated the development of an accurate predictive model for the geometry of a cold spray track profile to overcome the problems. Recently, a machine learning approach has gained interest in developing the predictive model of such a complex additive manufacturing process due to its superior nonlinear mapping capability, as seen in other manufacturing applications. Nevertheless, such a mapping capability can be realised only with a large amount of experimental data which is often impractical to collect in additive manufacturing applications. This limited data issue has motivated the exploration of a data-efficient machine learning approach suitable for complex process modelling with limited data. Therefore, the objective of this study was to investigate a data- efficient machine learning approach to geometry prediction in cold spray additive manufacturing. The proposed approach was of hybrid modelling framework, incorporating a conventional mathematical Gaussian model into the development and learning process of a data-driven model. We compared to purely mathematical Gaussian and data-driven modelling results and showed that the proposed hybrid modelling approach provided improved predictive accuracy. The findings can contribute to the control and optimisation of the process for shorter production time and the development of build strategy for better as-fabricated surface and dimensional quality control. The approach in this study is also applicable in other deposition-based additive manufacturing technologies such as Wire and Arc Additive Manufacturing.
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io
Design and Estimation of delay, power and area for Parallel prefix addersIJERA Editor
In Very Large Scale Integration (VLSI) designs, Parallel prefix adders (PPA) have the better delay performance. This paper investigates four types of PPA’s (Kogge Stone Adder (KSA), Spanning Tree Adder (STA), Brent Kung Adder (BKA) and Sparse Kogge Stone Adder (SKA)). Additionally Ripple Carry Adder (RCA), Carry Look ahead Adder (CLA) and Carry Skip Adder (CSA) are also investigated. These adders are implemented in verilog Hardware Description Language (HDL) using Xilinx Integrated Software Environment (ISE) 13.2 Design Suite. These designs are implemented in Xilinx Spartan 6 Field Programmable Gate Arrays (FPGA). Delay and area are measured using XPower analyzer and all these adder’s delay, power and area are investigated and compared finally
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
K-means and K-medoids clustering algorithms are widely used for many practical applications. Original k
medoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes it
generates unstable and empty clusters which are meaningless.
expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations.
The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro
requirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminates
unnecessary distance computation by using previous iteration. The new approach for k
systematically based on initial centroids. It generates stable clusters to improve accuracy.
Design Of 64-Bit Parallel Prefix VLSI Adder For High Speed Arithmetic CircuitsIJRES Journal
Parallel prefix adder is a kind of process for speeding up the addition of the system of writing and calculating with numbers which use only two digits. Parallel prefix adders are also known as carry-tree adders and they are known to have the best performance in VLSI designs. Due to constraints on logic blog configurations a routing overhead, this performance advantage does not translate directly into FPGA implementations. Identifying the absolutely accurate area-delay tradeoff curve of the parallel prefix is an interesting problem that has received more attention in research because parallel prefix adder on the other hand represents a type of general adder structure that displays publically in flexible area-time tradeoffs for the design of adder. Many different types of parallel prefix adders are made to increase for optimizing area, fan out, speed and performance. For high speed performance tree like structure is must which helps in greater way. There are many different method used for designing parallel prefix adder based on their speed, size and performance. For area optimization we use Brent-Kung method. If our main purpose is to get the least timing then we have to use Kogg-Stone adder method.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. Agenda
• Introduction to two core algorithms
– Outlier Detection on Categorical Data
– KNN-Join
• Application in graph algorithms
– Feedback Vertex Set of a Graph
– Geographical Information Systems
• Challenges we faced
• Best practices
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
2
3. • Its good to be different but not in data !!
• Something is wrong, generated by a different mechanism.
• How will my model generalize ?
• Image ref : http://outskirtsbattledomewiki.com/index.php/13-general-obd-terms/96-outlier
Outliers
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
3
4. Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
4
Solutions
• Distance based solutions.
– Mahalanobis Distance
• Covariance matrix solution
• Single class SVM.
• Density based solutions
– Counting frequency
Categorical Data ?
5. • Attribute Value Frequency(AVF) is based on assigning a score to
each point in the dataset using the frequency of each unique
attribute value.
• Easily parallelizable.
• Shown to perform favourably compared to other competitive but
more complex outlier detection strategies.
• Usages
– Anomaly Detection
– Security
MR-AVF
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015 5
7. Outliers on Categorical
• Attribute Value Frequency
Col 1 Col 2
A B
A C
C B
D E Outlier
Col 1 Col 2 Score
A B 4
A C 3
C B 3
D E 2
Low
Score
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
7
9. AVF – frequency Calculations
Col 1 Col 2
A B
A C
C B
D E
Input RDD
freq RDD
Information of line numbers
A unique Identifier
(1,A) 2, (2,B) 2,
(1,C) 1, (2,C) 1,
(1,D) 1, (2,E) 1,
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
9
14. AVF – Line Calculations
(1,A) X1, (2,B) X1,
(1,A) X2, (2,C) X2,
(1,C) X3, (2,B) X3,
(1,D) X4, (2,E) X4,
Col 1 Col 2
A B
A C
C B
D E
Column Index as well as row index
ZipWithIndex
data RDD
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
14
15. Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
15
(1,A) X1, (2,B) X1,
(1,A) X2, (2,C) X2,
(1,C) X3, (2,B) X3,
(1,D) X4, (2,E) X4,
data RDDfreq RDD
(1,A) 2, (2,B) 2,
(1,C) 1, (2,C) 1,
(1,D) 1, (2,E) 1,
Col 1 Col 2
A B
A C
C B
D E
Input RDD
AVF – Join
18. Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
18
Performance on Spark
Performance on different data-points
438MB Memory, Intel core i3 machine
19. Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
19
Performance on Spark
Performance on 43358 data-points with different partition of file
438MB Memory, Intel core i3 machine
20. Best Practices
• Minimal use of variable, Everything should be
immutable.
• More transformations less actions.
• Minimize broadcast.
• No updating variable in filter.
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
20
21. KNN-Join
• Finds the K nearest neighbors from a data set for a given data point.
• Approximate KNN-Join helps generate results with order of log(n) page
access.
• This idea uses Z- Values to map points in a multi dimensional space to a
single dimension.
• It translate KNN search for the query point on the single dimensional
space.
• Usages
• Similarity Search in huge Datasets
• Smoothening of images
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
21
30. 2, 13 , 7
4, 12 , 7
3 , 14 , 6
Data Point
3 12 7
Data Set
4 12 7
Z-KNN Results
1 Nearest Neighbor
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
30
31. Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
31
Performance on Spark
Performance on different Ks
438MB Memory, Intel core i3 machine
32. Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
32
Performance on Spark
Performance on different data-points
with k = 30 and 30 iterations
438MB Memory, Intel core i3 machine
33. Best Practices
• More code review at codacy (www.codacy.com)
• Integrated with Github
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
33
35. Application on GraphX
• Feedback Vertex Set of a Graph
• Geographical Information Systems
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
35
36. Future Works
• Social Content Matching (max flow Algorithm) (alpha)
• KNN for float types (requires calculation of Morton order for floats)
• Matrix multiplication by the Strassen algorithm, using Morton order as
locality search.
• Similarity between two documents, implementation of all sequence
kernels.
• More outlier detection algorithm
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
36
37. Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
37
Connect with us
@mail
ashu.trv@gmail.com
kaushikranjan.619@gmail.com
LinkedIn
Ashutosh
https://www.linkedin.com/in/ashutoshtrivedi
Kaushik
https://www.linkedin.com/in/ranjankaushik
Fork our repository at
https://github.com/anantasty/SparkAlgorithms
38. References
• Follow us at
• https://github.com/codeAshu
• https://github.com/kaushikranjan
• A. Koufakou, J. Secretan, J. Reeder, K. Cardona, and M. Georgiopoulos. “Fast parallel outlier
detection for categorical datasets using MapReduce." IEEE World Congress on computational
Intelligence International Joint Conference on Neural Networks IJCNN, pp. 3298-3304, 2008.
• DOI> 10.1109/IJCNN.2008.4634266
• Zhang, Chi, Feifei Li, and Jeffrey Jestes. "Efficient parallel kNN joins for large data in
MapReduce." Proceedings of the 15th International Conference on Extending Database
Technology. ACM, 2012.
• DOI>10.1145/2247596.2247602
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
38