This is a talk given by Badrish Chandramouli at Portland State University on May 30, 2017, and overviews his recent and ongoing research directions in the space of stream processing and big data analytics.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...Thom Lane
Summary of models and methods used for DAWNBench CIFAR-10 Challenge. Starting with an review of ResNets from high level architecture, we review Basic vs Bottleneck blocks, pre-activation blocks and Wide Resets. After a brief mention of PyramidNet, ResNext and DenseNet models, we look at regularization techniques such as Mixup. And we finish with a review of Cyclical Learning Rates, and the phenomenon of "Super Convergence".
MXNet Gluon API was used for the implementations.
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
1. Sketch algorithms provide approximate query results with sub-linear space and processing time, enabling analysis of big data that would otherwise require prohibitive resources.
2. Case studies show sketches reduce storage by over 90% and processing time by over 95% compared to exact algorithms, enabling real-time querying and rollups across multiple dimensions that were previously infeasible.
3. The DataSketches library provides open-source implementations of popular sketch algorithms like Theta, HLL, and quantiles sketches, with code samples and adapters for systems like Hive, Pig, and Druid.
The document discusses real-time big data management and Apache Flink. It provides an overview of Apache Flink, including its architecture, components, and APIs for batch and streaming data processing. It also provides examples of word count programs in Java, Scala, and Java 8 that demonstrate how to write Flink programs for batch and streaming data.
Highly Scalable Java Programming for Multi-Core SystemJames Gan
This document discusses best practices for highly scalable Java programming on multi-core systems. It begins by outlining software challenges like parallelism, memory management, and storage management. It then introduces profiling tools like the Java Lock Monitor (JLM) and Multi-core SDK (MSDK) to analyze parallel applications. The document provides techniques like reducing lock scope and granularity, using lock stripping and striping, splitting hot points, and alternatives to exclusive locks. It also recommends reducing memory allocation and using immutable/thread local data. The document concludes by discussing lock-free programming and its advantages for scalability over locking.
This document discusses the Java Virtual Machine (JVM) memory model and just-in-time (JIT) compilation. It explains that the JVM uses dynamic compilation via a JIT to optimize bytecode at runtime. The JIT profiles code and performs optimizations like inlining, loop unrolling, and escape analysis. It also discusses how the JVM memory model allows for instruction reordering and caching but ensures sequential consistency through happens-before rules and volatile variables. The document provides examples of anomalies that can occur without synchronization and how tools like synchronized, locks, and atomic operations can be used to prevent issues.
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...Anis Nasir
This paper proposes a technique called Partial Key Grouping (PKG) to balance load in distributed stream processing engines. PKG splits keys between workers and assigns instances using the "power of two choices" algorithm. This achieves load balancing while maintaining low memory and aggregation overhead of O(1), unlike shuffle grouping which has O(W) overhead. Experiments on real datasets with Apache Storm show PKG improves throughput by 60% and reduces latency by 45% compared to key grouping and shuffle grouping.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...Thom Lane
Summary of models and methods used for DAWNBench CIFAR-10 Challenge. Starting with an review of ResNets from high level architecture, we review Basic vs Bottleneck blocks, pre-activation blocks and Wide Resets. After a brief mention of PyramidNet, ResNext and DenseNet models, we look at regularization techniques such as Mixup. And we finish with a review of Cyclical Learning Rates, and the phenomenon of "Super Convergence".
MXNet Gluon API was used for the implementations.
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
1. Sketch algorithms provide approximate query results with sub-linear space and processing time, enabling analysis of big data that would otherwise require prohibitive resources.
2. Case studies show sketches reduce storage by over 90% and processing time by over 95% compared to exact algorithms, enabling real-time querying and rollups across multiple dimensions that were previously infeasible.
3. The DataSketches library provides open-source implementations of popular sketch algorithms like Theta, HLL, and quantiles sketches, with code samples and adapters for systems like Hive, Pig, and Druid.
The document discusses real-time big data management and Apache Flink. It provides an overview of Apache Flink, including its architecture, components, and APIs for batch and streaming data processing. It also provides examples of word count programs in Java, Scala, and Java 8 that demonstrate how to write Flink programs for batch and streaming data.
Highly Scalable Java Programming for Multi-Core SystemJames Gan
This document discusses best practices for highly scalable Java programming on multi-core systems. It begins by outlining software challenges like parallelism, memory management, and storage management. It then introduces profiling tools like the Java Lock Monitor (JLM) and Multi-core SDK (MSDK) to analyze parallel applications. The document provides techniques like reducing lock scope and granularity, using lock stripping and striping, splitting hot points, and alternatives to exclusive locks. It also recommends reducing memory allocation and using immutable/thread local data. The document concludes by discussing lock-free programming and its advantages for scalability over locking.
This document discusses the Java Virtual Machine (JVM) memory model and just-in-time (JIT) compilation. It explains that the JVM uses dynamic compilation via a JIT to optimize bytecode at runtime. The JIT profiles code and performs optimizations like inlining, loop unrolling, and escape analysis. It also discusses how the JVM memory model allows for instruction reordering and caching but ensures sequential consistency through happens-before rules and volatile variables. The document provides examples of anomalies that can occur without synchronization and how tools like synchronized, locks, and atomic operations can be used to prevent issues.
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...Anis Nasir
This paper proposes a technique called Partial Key Grouping (PKG) to balance load in distributed stream processing engines. PKG splits keys between workers and assigns instances using the "power of two choices" algorithm. This achieves load balancing while maintaining low memory and aggregation overhead of O(1), unlike shuffle grouping which has O(W) overhead. Experiments on real datasets with Apache Storm show PKG improves throughput by 60% and reduces latency by 45% compared to key grouping and shuffle grouping.
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
This is the academic conference talk on Spark Streaming, where I introduce the concept of Discretized Streams and how it achieves large scale, efficient fault-tolerance streaming in a different way than traditional stream processing systems.
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time.
2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams.
3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...Anis Nasir
This document proposes two algorithms, D-Choices and W-Choices, to improve load balancing in distributed stream processing systems. The algorithms identify "heavy hitters" or frequent keys in the data stream and process them using more than two workers to better balance load. Evaluation shows the algorithms provide up to 150% higher throughput and 60% lower latency compared to traditional partitioning approaches.
Recent progress on distributing deep learningViet-Trung TRAN
This document summarizes recent progress in distributed deep learning. It discusses the state of the art in neural networks and deep learning, as well as factors driving advances in deep learning like big data and increased computing power. It then covers approaches for scaling deep learning through model parallelism, data parallelism, and distributed training frameworks. Several deep learning applications developed in Vietnam are presented as examples, including optical character recognition and predictive text. The document concludes with principles for machine learning system design in distributed settings.
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
Deep Turnover Forecast.
"It is very difficult to predict - especially the future." [Neils Bohr]
At Decathlon we have developed a model to forecast turnover by store, department (or sport) and week for the next 52 weeks. This forecast is used by our department managers to pilot their activity.
The model, inspired by DeepMind's WaveNet architecture, uses Deep Learning with a stack of several Dilated Causal Convolution layers.
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
Online learning, Vowpal Wabbit and Hadoop
Online learning has recently caught a lot of attention, following some competitions, and especially after Criteo released 11GB for the training set of a Kaggle contest.
Online learning allows to process massive data as the learner processes data in a sequential way using up a low amount of memory and limited CPU ressources. It is also particularly suited for handling time-evolving date.
Vowpal Wabbit has become quite popular: it is a handy, light and efficient command line tool allowing to do online learning on GB of data, even on a standard laptop with standard memory. After a reminder of the online learning principles, we present how to run Vowpal Wabbit on Hadoop in a distributed fashion.
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...Florian Lautenschlager
Chronix is a time series database designed specifically for anomaly detection in operational data. It offers several advantages over general purpose time series databases:
1) Chronix uses domain specific optimizations like optional timestamp compression, custom data records, and compression techniques tailored to the repetitive patterns in operational data.
2) It provides a programming interface to pre-compute representations of time series data and add domain-specific columns to speed up anomaly detection queries.
3) Chronix supports exploratory and correlating analyses through its multi-dimensional storage and ability to query on any combination of attributes. It also offers high-level domain-specific analysis functions evaluated server-side.
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
Academic project based on developing a LSTM distributing it on Spark and using Tensorflow for numerical operations.
Source code: https://github.com/EmanuelOverflow/LSTM-TensorSpark
[241]large scale search with polysemous codesNAVER D2
This document discusses using polysemous codes to perform large-scale search over visual signatures. Polysemous codes allow product quantization codes to be interpreted as both compact binary codes for efficient Hamming distance search and codes that preserve distance information for accurate nearest neighbor search. The key ideas are to learn an index assignment that maps similar product quantization codes to binary codes with smaller Hamming distance, and to directly optimize this assignment to match the distances between codebook centroids. This allows using a single code representation for both fast Hamming search and precise distance search, without increasing memory requirements. The document provides examples of applying polysemous codes to build a large graph connecting images based on visual similarity.
High Performance Systems Without Tears - Scala Days Berlin 2018Zahari Dichev
The document discusses techniques for improving performance in Scala applications by reducing object allocation and improving data locality. It describes how excessive object instantiation can hurt performance by increasing garbage collection work and introducing non-determinism. Extractor objects are presented as a tool for pattern matching that can improve brevity and expressiveness. Name-based extractors introduced in Scala 2.11 avoid object allocation. The talk also covers how caching hierarchies work to reduce memory access latency and the importance of data access patterns for effective cache utilization. Cache-oblivious algorithms are designed to optimize memory hierarchy usage without knowing cache details. Synchronization is noted to have performance costs as well in an example event log implementation.
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...Spark Summit
1. The document discusses BlinkDB and G-OLA, which are systems for supporting approximate answers to queries in SparkSQL.
2. BlinkDB uses techniques like bootstrap and Poissonized resampling to provide quick continuous error estimates for queries.
3. G-OLA enables continuous query execution on samples of data using delta update queries, which incrementally update results as new data arrives to provide answers with bounded errors within interactive time frames.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
The document discusses data stream classification and algorithms for handling data streams. It begins with an introduction to data stream characteristics and challenges. It then discusses approximation algorithms for data streams, including maintaining statistics over sliding windows. Classification algorithms for data streams discussed include Naive Bayes classifiers, perceptrons, and Hoeffding trees, which are decision trees adapted for data streams using the Hoeffding bound inequality to determine the optimal split attribute.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
All Pairs-Shortest Path (Fast Floyd-Warshall) Code Ehsan Sharifi
Shortest path algorithms are a family of algorithms designed to solve the shortest path problem. The shortest path problem is something most people have some intuitive familiarity with: given two points, A and B, what is the shortest path between them? In computer science, however, the all shortest path problem can take different forms and so different algorithms are needed to be able to solve them all. All shortest path, as an extension of single shortest path, has been investigated since the 60s, and plays a crucial role in many applications, including network optimization and routing, traffic information systems, databases, compilers, garbage collection, interactive verification systems, robotics, dataflow analysis, and document formatting.
In this project, we implement and evaluate a multi-core fast verison of Floyd-Warshall code.
Data Stream Outlier Detection Algorithm Hamza Aslam
This document presents a new data stream outlier detection algorithm called SODRNN, which is based on reverse k nearest neighbors. It uses a sliding window model to detect anomalies in the current window by performing outlier queries. The algorithm consists of a Stream Manager procedure that efficiently updates the window with insertions and deletions by scanning the window only once. It also includes a Query Manager procedure that can detect concept drift. Experimental results on both synthetic and real datasets show that SODRNN is effective and efficient at detecting outliers in data streams.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsBadrish Chandramouli
There is a growing interest in processing real-time queries over out-of-order streams in this big data era. This paper presents a comprehensive solution to meet this requirement. Our solution is based on Impatience sort, an online sorting technique that is based on an old technique called Patience sort. Impatience sort is tailored for incrementally sorting streaming datasets that present themselves as almost sorted, usually due to network delays and machine failures. With several optimizations, our solution can adapt to both input streams and query logic. Further, we develop a new Impatience framework that leverages Impatience sort to reduce the latency and memory usage of query execution, and supports a range of user latency requirements, without compromising on query completeness and throughput, while leveraging existing efficient in-order streaming engines and operators. We evaluate our proposed solution in Trill, a high-performance streaming engine, and demonstrate that our techniques significantly improve sorting performance and reduce memory usage – in some cases, by over an order of magnitude.
The Case for a Signal Oriented Data Stream Management SystemReza Rahimi
This document proposes a signal-oriented data stream management system called WaveScope. It discusses typical applications involving sensor networks, the data and programming model using a domain-specific language called WaveScript, and the system architecture involving query planning, optimization, and distributed execution. Key aspects include managing timing information across different timebases, optimizing queries using both database and signal processing techniques, and supporting archived historical data retrieval.
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
This is the academic conference talk on Spark Streaming, where I introduce the concept of Discretized Streams and how it achieves large scale, efficient fault-tolerance streaming in a different way than traditional stream processing systems.
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time.
2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams.
3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...Anis Nasir
This document proposes two algorithms, D-Choices and W-Choices, to improve load balancing in distributed stream processing systems. The algorithms identify "heavy hitters" or frequent keys in the data stream and process them using more than two workers to better balance load. Evaluation shows the algorithms provide up to 150% higher throughput and 60% lower latency compared to traditional partitioning approaches.
Recent progress on distributing deep learningViet-Trung TRAN
This document summarizes recent progress in distributed deep learning. It discusses the state of the art in neural networks and deep learning, as well as factors driving advances in deep learning like big data and increased computing power. It then covers approaches for scaling deep learning through model parallelism, data parallelism, and distributed training frameworks. Several deep learning applications developed in Vietnam are presented as examples, including optical character recognition and predictive text. The document concludes with principles for machine learning system design in distributed settings.
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
Deep Turnover Forecast.
"It is very difficult to predict - especially the future." [Neils Bohr]
At Decathlon we have developed a model to forecast turnover by store, department (or sport) and week for the next 52 weeks. This forecast is used by our department managers to pilot their activity.
The model, inspired by DeepMind's WaveNet architecture, uses Deep Learning with a stack of several Dilated Causal Convolution layers.
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
Online learning, Vowpal Wabbit and Hadoop
Online learning has recently caught a lot of attention, following some competitions, and especially after Criteo released 11GB for the training set of a Kaggle contest.
Online learning allows to process massive data as the learner processes data in a sequential way using up a low amount of memory and limited CPU ressources. It is also particularly suited for handling time-evolving date.
Vowpal Wabbit has become quite popular: it is a handy, light and efficient command line tool allowing to do online learning on GB of data, even on a standard laptop with standard memory. After a reminder of the online learning principles, we present how to run Vowpal Wabbit on Hadoop in a distributed fashion.
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...Florian Lautenschlager
Chronix is a time series database designed specifically for anomaly detection in operational data. It offers several advantages over general purpose time series databases:
1) Chronix uses domain specific optimizations like optional timestamp compression, custom data records, and compression techniques tailored to the repetitive patterns in operational data.
2) It provides a programming interface to pre-compute representations of time series data and add domain-specific columns to speed up anomaly detection queries.
3) Chronix supports exploratory and correlating analyses through its multi-dimensional storage and ability to query on any combination of attributes. It also offers high-level domain-specific analysis functions evaluated server-side.
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
Academic project based on developing a LSTM distributing it on Spark and using Tensorflow for numerical operations.
Source code: https://github.com/EmanuelOverflow/LSTM-TensorSpark
[241]large scale search with polysemous codesNAVER D2
This document discusses using polysemous codes to perform large-scale search over visual signatures. Polysemous codes allow product quantization codes to be interpreted as both compact binary codes for efficient Hamming distance search and codes that preserve distance information for accurate nearest neighbor search. The key ideas are to learn an index assignment that maps similar product quantization codes to binary codes with smaller Hamming distance, and to directly optimize this assignment to match the distances between codebook centroids. This allows using a single code representation for both fast Hamming search and precise distance search, without increasing memory requirements. The document provides examples of applying polysemous codes to build a large graph connecting images based on visual similarity.
High Performance Systems Without Tears - Scala Days Berlin 2018Zahari Dichev
The document discusses techniques for improving performance in Scala applications by reducing object allocation and improving data locality. It describes how excessive object instantiation can hurt performance by increasing garbage collection work and introducing non-determinism. Extractor objects are presented as a tool for pattern matching that can improve brevity and expressiveness. Name-based extractors introduced in Scala 2.11 avoid object allocation. The talk also covers how caching hierarchies work to reduce memory access latency and the importance of data access patterns for effective cache utilization. Cache-oblivious algorithms are designed to optimize memory hierarchy usage without knowing cache details. Synchronization is noted to have performance costs as well in an example event log implementation.
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...Spark Summit
1. The document discusses BlinkDB and G-OLA, which are systems for supporting approximate answers to queries in SparkSQL.
2. BlinkDB uses techniques like bootstrap and Poissonized resampling to provide quick continuous error estimates for queries.
3. G-OLA enables continuous query execution on samples of data using delta update queries, which incrementally update results as new data arrives to provide answers with bounded errors within interactive time frames.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
The document discusses data stream classification and algorithms for handling data streams. It begins with an introduction to data stream characteristics and challenges. It then discusses approximation algorithms for data streams, including maintaining statistics over sliding windows. Classification algorithms for data streams discussed include Naive Bayes classifiers, perceptrons, and Hoeffding trees, which are decision trees adapted for data streams using the Hoeffding bound inequality to determine the optimal split attribute.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
All Pairs-Shortest Path (Fast Floyd-Warshall) Code Ehsan Sharifi
Shortest path algorithms are a family of algorithms designed to solve the shortest path problem. The shortest path problem is something most people have some intuitive familiarity with: given two points, A and B, what is the shortest path between them? In computer science, however, the all shortest path problem can take different forms and so different algorithms are needed to be able to solve them all. All shortest path, as an extension of single shortest path, has been investigated since the 60s, and plays a crucial role in many applications, including network optimization and routing, traffic information systems, databases, compilers, garbage collection, interactive verification systems, robotics, dataflow analysis, and document formatting.
In this project, we implement and evaluate a multi-core fast verison of Floyd-Warshall code.
Data Stream Outlier Detection Algorithm Hamza Aslam
This document presents a new data stream outlier detection algorithm called SODRNN, which is based on reverse k nearest neighbors. It uses a sliding window model to detect anomalies in the current window by performing outlier queries. The algorithm consists of a Stream Manager procedure that efficiently updates the window with insertions and deletions by scanning the window only once. It also includes a Query Manager procedure that can detect concept drift. Experimental results on both synthetic and real datasets show that SODRNN is effective and efficient at detecting outliers in data streams.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsBadrish Chandramouli
There is a growing interest in processing real-time queries over out-of-order streams in this big data era. This paper presents a comprehensive solution to meet this requirement. Our solution is based on Impatience sort, an online sorting technique that is based on an old technique called Patience sort. Impatience sort is tailored for incrementally sorting streaming datasets that present themselves as almost sorted, usually due to network delays and machine failures. With several optimizations, our solution can adapt to both input streams and query logic. Further, we develop a new Impatience framework that leverages Impatience sort to reduce the latency and memory usage of query execution, and supports a range of user latency requirements, without compromising on query completeness and throughput, while leveraging existing efficient in-order streaming engines and operators. We evaluate our proposed solution in Trill, a high-performance streaming engine, and demonstrate that our techniques significantly improve sorting performance and reduce memory usage – in some cases, by over an order of magnitude.
The Case for a Signal Oriented Data Stream Management SystemReza Rahimi
This document proposes a signal-oriented data stream management system called WaveScope. It discusses typical applications involving sensor networks, the data and programming model using a domain-specific language called WaveScript, and the system architecture involving query planning, optimization, and distributed execution. Key aspects include managing timing information across different timebases, optimizing queries using both database and signal processing techniques, and supporting archived historical data retrieval.
Hardware Acceleration for Machine LearningCastLabKAIST
This document provides an overview of a lecture on hardware acceleration for machine learning. The lecture will cover deep neural network models like convolutional neural networks and recurrent neural networks. It will also discuss various hardware accelerators developed for machine learning, including those designed for mobile/edge and cloud computing environments. The instructor's background and the agenda topics are also outlined.
Chronix is a domain specific time series database designed for anomaly detection in operational data. It is optimized for the needs of anomaly detection by supporting domain specific data types, analysis algorithms, data models, and query languages. It aims to address limitations of general purpose time series databases by exploiting characteristics of operational data through features like optional pre-computation of extras, timestamp compression, domain specific records and compression techniques, and multi-dimensional storage. An evaluation using data from five industry projects found that Chronix has significantly smaller memory and storage footprints and faster data retrieval and analysis times compared to other time series databases.
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
This document discusses processing scientific mass spectrometry data in real-time using parallel and distributed computing techniques. It describes how a mass spectrometry experiment produces terabytes of data that currently takes over 24 hours to fully process. The document proposes using MapReduce and Apache Flink to parallelize the data processing across clusters to help speed it up towards real-time analysis. Initial tests show Flink can process the data 2-3 times faster than traditional Hadoop MapReduce. Finally, it discusses simulating real-time streaming of the data using Kafka and Flink Streaming to enable processing results within 10 seconds of the experiment completing.
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
In this talk, we will share the experiences of applying Cassandra with two real customers in China. In the first use case, we deployed Cassandra at Sany Group, a leading company of Machinery manufacturing, to manage the sensor data generated by construction machinery. By designing a specific schema and optimizing the write process, we successfully managed over 1.5 billion historical data records and achieved the online write throughput of 10k write operations per second with 5 servers. MapReduce is also used on Cassandra for valued-added services, e.g. operations management, machine failure prediction, and abnormal behavior mining. In the second use case, Cassandra is deployed in the China Meteorological Administration to manage the Meteorological data. We design a hybrid schema to support both slice query and time window based query efficiently. Also, we explored the optimized compaction and deletion strategy for meteorological data in this case.
Apache con 2020 use cases and optimizations of iotdbZhangZhengming
This document summarizes a presentation about IoTDB, an open source time series database optimized for IoT data. It discusses IoTDB's architecture, use cases, optimizations, and common questions. Key points include that IoTDB uses a time-oriented storage engine and tree-structured schema to efficiently store and query IoT sensor data, and that optimizations like schema design, memory allocation, and handling out-of-order data can improve performance. Common issues addressed relate to version compatibility, system load, and error conditions.
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward
SK telecom shares our experience of using Flink in building a solution for Predictive Maintenance (PdM). Our PdM solution named metatron PdM consists of (1) a Deep Neural Network (DNN)-based prediction model for precise prediction, and (2) a Flink-based runtime system which applies the model to a sliding window on sensor data streams. Efficient handling of multi-sensor streaming data for real-time prediction of equipment condition is a critical component of our product. In this talk, we first show why we choose Flink as a core engine for our streaming use case in which we generate real-time predictions using DNNs trained with Keras on top of TensorFlow and Theano. In addition, we present a comparative study of methods to exploit learning models on JVM such as directly using Python libraries on CPython embedded in JVM, using TensorFlow Java API (including Flink TensorFlow), and making RPC calls to TensorFlow Serving. We then explain how we implement the runtime system using Flink DataStream API, especially with event time, various window mechanisms, timestamp and watermark, custom source and sink, and checkpointing. Lastly, we present how we use the official Flink Docker image for solution delivery and the Flink metric system for monitoring and management of our solution. We hope our use case sets a good example of building a DNN-based streaming solution using Flink.
Convolutional neural networks for speech controlled prosthetic handsMohsen Jafarzadeh
Speech recognition is one of the key topics in artificial intelligence, as it is one of the most common forms of communication in humans. Researchers have developed many speech-controlled prosthetic hands in the past decades, utilizing conventional speech recognition systems that use a combination of neural network and hidden Markov model. Recent advancements in general-purpose graphics processing units (GPGPUs) enable intelligent devices to run deep neural networks in real-time. Thus, state-of-the-art speech recognition systems have rapidly shifted from the paradigm of composite subsystems optimization to the paradigm of end-to-end optimization. However, a low-power embedded GPGPU cannot run these speech recognition systems in real-time. In this paper, we show the development of deep convolutional neural networks (CNN) for speech control of prosthetic hands that run in real-time on a NVIDIA Jetson TX2 developer kit. First, the device captures and converts speech into 2D features (like spectrogram). The CNN receives the 2D features and classifies the hand gestures. Finally, the hand gesture classes are sent to the prosthetic hand motion control system. The whole system is written in Python with Keras, a deep learning library that has a TensorFlow backend. Our experiments on the CNN demonstrate the 91% accuracy and 2ms running time of hand gestures (text output) from speech commands, which can be used to control the prosthetic hands in real-time.
2019 First International Conference on Transdisciplinary AI (TransAI), Laguna Hills, California, USA, 2019, pp. 35-42
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
- Spark Streaming allows processing of live data streams using Spark's batch processing engine by dividing streams into micro-batches.
- A Spark Streaming application consists of input streams, transformations on those streams such as maps and filters, and output operations. The application runs continuously processing each micro-batch.
- Key aspects of operationalizing Spark Streaming jobs include checkpointing to ensure fault tolerance, optimizing throughput by increasing parallelism, and debugging using Spark UI.
This document provides an overview of using ClickHouse and Grafana for DNS analytics. Some key points:
- ClickHouse is a column-oriented database that is fast, scalable, and easy to use for analytics on large datasets like DNS logs.
- Grafana is used to visualize the DNS data by connecting it as a data source to ClickHouse.
- Examples show querying ClickHouse to analyze DNS data and identify top clients by ASN, response types, and flag combinations. Visualizations like histograms are also demonstrated.
- The installation process outlines adding the ClickHouse and Grafana repositories, installing the packages, and configuring the ClickHouse data source plugin for Grafana.
This document summarizes a research paper on Google's globally distributed database called Spanner. Spanner provides strong consistency and transactions across globally distributed data. It addresses the need for a scalable database to replace Google's sharded MySQL deployment. Spanner uses TrueTime to synchronize clocks across datacenters with bounded uncertainty. It assigns timestamps to transactions using this synchronized time to ensure consistency. Spanner supports different types of transactions like read-write, read-only, and snapshot reads through its consistency and concurrency control mechanisms. Evaluation results show Spanner can provide low latency, high throughput and availability even during leader failures.
This document discusses end-to-end processing of 3.7 million telemetry events per second using a lambda architecture at Symantec. It provides an overview of Symantec's security data lake infrastructure, the telemetry data processing architecture using Kafka, Storm and HBase, tuning targets for the infrastructure components, and performance benchmarks for Kafka, Storm and Hive.
This document provides an overview of deep learning and its applications. It discusses how deep learning can be used for image classification and how neural networks learn hierarchical representations from data. The document highlights some of the challenges of deep learning, such as the large amounts of data and computation required. It also covers how deep learning models can be deployed in production using services like Amazon Web Services to ensure low latency, high availability, and continuous learning.
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
This document discusses building a real-time analytics system like Google Analytics using Storm, Kafka, and GigaSpaces. It describes the key components needed: a spout to read page view data from Kafka, Trident bolts to calculate metrics like top URLs, active users, and geographic information, and a time series bolt to track page views over time. The architecture allows for highly scalable, low-latency analysis of streaming page view data in real-time.
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleSean Zhong
Gearpump is a Akka based realtime streaming engine, it use Actor to model everything. It has super performance and flexibility. It has performance of 18000000 messages/second and latency of 8ms on a cluster of 4 machines.
This document summarizes an update on OpenTSDB, an open source time series database. It discusses OpenTSDB's ability to store trillions of data points at scale using HBase, Cassandra, or Bigtable as backends. Use cases mentioned include systems monitoring, sensor data, and financial data. The document outlines writing and querying functionality and describes the data model and table schema. It also discusses new features in OpenTSDB 2.2 and 2.3 like downsampling, expressions, and data stores. Community projects using OpenTSDB are highlighted and the future of OpenTSDB is discussed.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
Apache Hadoop has emerged as the storage and processing platform of choice for Big Data. In this tutorial, I will give an overview of Apache Hadoop and its ecosystem, with specific use cases. I will explain the MapReduce programming framework in detail, and outline how it interacts with Hadoop Distributed File System (HDFS). While Hadoop is written in Java, MapReduce applications can be written using a variety of languages using a framework called Hadoop Streaming. I will give several examples of MapReduce applications using Hadoop Streaming.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
15. str.Where(e => e.User % 100<5);
Send(events)
...
Application
Receive(results)
On(Batch b) {
for i = 0 to b.Size {
if !(b.c_User[i]%100 < 5)
set b.bitvector[i]
}
next-operator.On(b)
}
Trill
19. • Lots of “signals” in stream data
• IoT workflows combine relational & signal logic
M
Group-by ID
U
Union
ID Time Value
0 0:42:19 67
1 0:42:22 80
2 0:42:22 85
0 0:42:23 69
2 0:42:24 85
Remove noise
Interpolate missing data
Find periodicity
Discard invalid data
Correlate live data w/ history
σ ⋈ DSP
DSPσ ⋈
19
Which tools to use
to build such apps?
20. Data Processing
expert
Digital Signal
Processing expert
Engines: stream engines, DBMS, MPP systems
Data model: (tempo)-relational
Language: declarative (SQL, LINQ, functional)
Scenarios: real-time, offline, progressive
Engines: MATLAB, R
Data model: array
Language: imperative (array languages, C)
Scenarios: mostly offline, real-time
How to reconcile
two worlds?
Our solution:
• high-performance (2 OOM faster)
• one query language
• familiar abstractions to both worlds
22. • Stream engine for relational
queries
• R for highly-optimized DSP
operations
• Problem: impedance mismatch
x2
+
+
x0
x1
y0
y1
y2
R
STREAM PROCESSING
SYSTEM
23. • Unified query model
• Non-uniform & uniform signals
• Type-safe mix of stream & signal operators
• Array-based extensibility framework
• DSP operator writer sees arrays
• Supports incremental computation
• “Walled garden” on top of Trill
• No changes in data model
• Inherits Trill’s efficient processing capability
(e.g., grouped computation)
TRILL DSP
24.
25. Time
Input
events
e1
e2
e3
e4
e5 Time
Aggregated
events
1 1 1 1212
STREAMABLE SIGNALSTREAMABLE
var signal = stream.Where(e => e.Value < 100).Count()
STREAMS
SIGNALS
• Transition to signal domain
• E.g., result of an aggregate query
• Using stream operators to build signal operators
• E.g., adding two signals as a temporal join of two streams
left.Join(right, (l, r) => l + r)
Type-safe operations
26. • Sampling with interpolation
Time
Input
events
misaligned missing
30 60 90 120 150 180 210
Time
Output
events
30 60 90 120 150 180 210
interpolated
var uniformSignal = signal.Sample(30, 0, ip => ip.Linear(60));
Interpolation window
STREAMS
SIGNALS
UNIFORM
27. • Expose arrays only inside the windowing operator
var query = uniformSignal
.Window(512, 256,
w => w.FFT().Select(a => f(a)).IFFT(),
a => a.Sum())
)
Uniform signal Uniform signal
UNWIN
AGGFFT f IFFTWIN
• DSP pipeline & arrays instantiated only once ➞ better data
management
28. • DSP experts write array-array
operators
• Incremental DSP operators
• Leverage Trill’s grouping power!
OLD NEW
WindowHop
FFT f IFFT
29. 4
8
16
32
64
128
256 230 179 128 76 25
HOP SIZE
TrillDSP (1 core) MATLAB
SparkR (16 cores) SciDB-R (16 cores)
Per sensor: Windowed FFT ➞ Function ➞ Inverse FFT ➞ Unwindow
NORMALIZED TIME TO TRILLDSP ON 16 CORES Pre-loaded datasets in
memory
• 100 groups in stream
Up to 2 OOM faster than
others
Performance benefits from:
• Efficient group processing,
group-aware DSP windowing
• Using circular arrays to manage
overlapping windows
• TrillDSP uses FFTW library
35. Speculation Level
Structural Index Level
Fields:
“id”
Logical positions:
“id” is the 3rd attribute
Physical positions:
“id” is at the 20th byte
Speculation
Fields:
“id”
Physical positions:
“id” is at the 20th byte
35
41. shards
• querying
• data movement
• keying
Operation Description
Query Applies unmodified query on each
(keyed) shard
Broadcast Duplicate each shard’s contents on
all shards
Multicast Copy tuples from each input shard
to zero or more specific result
shards
ReShard Load balance across shards
ReDistribute Move tuples so that same key
resides in same result shard
ReKey Changes key associated with each
row in each shard
…
…
…
…