Kevin Ongkowijaya interned at StarHub from May to August 2016 and performed outstanding work on an end-to-end data analytics project. He completed the challenging project involving multiple phases and skill sets with little supervision. Throughout the internship, Kevin demonstrated a proactive attitude, strong sense of ownership, and effectively picked up new technical skills. He was a team player who communicated well and was willing to work extra hours. The manager was very satisfied with Kevin's performance and believes he will contribute strongly in the big data analytics industry given the right opportunity.
Calidad Seis Sigma con R: Aplicación a la docenciaEmilio L. Cano
This document discusses using R software to support Six Sigma methodology. It introduces reproducible research approaches for statistical training, provides examples using Sweave documents to integrate R code and LaTeX, and outlines an EADAPU training program covering Six Sigma phases and tools. The document also describes using R for process mapping, loss function analysis, and measurement system analysis for quality improvement projects.
The Klima iOS app aims to get people into art galleries by providing useful and relatable information. It retrieves weather data through an API and uses the current conditions to suggest local art that features similar weather. The app then displays this art along with the current weather, gallery opening times, and website links. Technically, it uses the OpenWeatherMap API for weather data and the Nasjonal Museet API to execute queries to find matching art based on the weather conditions.
2015 FOSS4G Track: Visualization and Analysis of Spatiotemporal Data using Fr...GIS in the Rockies
The document discusses the National Renewable Energy Laboratory's (NREL) transition from proprietary to open source software for spatial analysis and application development. It describes NREL's geospatial data science team and their work analyzing renewable energy and energy efficiency data. The transition to open source provided benefits like flexibility, cost savings, and promoting integrated architecture, but also challenges like difficulty staffing roles that require expertise across multiple open source technologies. Examples of projects using open source include a renewable electricity futures study visualization and various spatial applications.
OGC SensorThings API Get Started Webinar Series #3 of 4. (Dec 10 2015)
Title: RESTful Pattern for IoT API
More to come:
#4: Connect Sensors and IoT Devices to SensorThings API (Dec 17th 2015)
Register our webinar here: http://sensorup.com/#signup
Rhea: Adaptively Sampling Authoritative Content from Social Activity StreamsPanagiotis Liakos
The document summarizes the Rhea algorithm for adaptively sampling authoritative content from social activity streams. Rhea forms a network of authoritative users as it processes the stream and samples only content from the top-K authoritative users based on an auth-value measure. It addresses challenges of maintaining user information efficiently, ranking users, and filtering irrelevant content. Experimental results on Twitter and StackOverflow data show Rhea outperforms white-list baselines in terms of precision, recall, and ranking accuracy of the sampled documents.
AN EMPIRICAL STUDY OF THE RELATION BETWEEN STRONG CHANGE COUPLING AND DEFECTS...Igor Wiese
This study investigated the relationship between strong change couplings and defects in the Apache Aries project. The researchers found that strong change couplings were moderately correlated with defects and the majority were associated with at least one defect. Models using historical and social metrics achieved high accuracy in identifying strong change couplings. The best metrics included discussion length, number of committers, committer experience, and number of defects and weeks between first and last commit. The models could correctly predict 45.67% of strong change couplings linked to post-release defects. The researchers concluded that strong change couplings influence code quality and aim to further investigate their impact and how to monitor and track the "damage" caused.
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...NAVER Engineering
발표자: 이연창(한양대 박사과정)
발표일: 2018.2.
We investigate how to address the shortcomings of the popular One-Class Collaborative Filtering (OCCF) methods in handling challenging “sparse” dataset in one-class setting (e.g., clicked or bookmarked), and propose a novel graph-theoretic OCCF approach, named as gOCCF, by exploiting both positive preferences (derived from rated items) as well as negative preferences (derived from unrated items). In capturing both positive and negative preferences as a bipartite graph, further, we apply the graph shattering theory to determine the right amount of negative preferences to use. Then, we develop a suite of novel graph-based OCCF methods based on the random walk with restart and belief propagation methods. Through extensive experiments using 3 real-life datasets, we show that our gOCCF effectively addresses the sparsity challenge and significantly outperforms all of 8 competing methods in accuracy on very sparse datasets while providing comparable accuracy to the best performing OCCF methods on less sparse datasets.
Kevin Ongkowijaya interned at StarHub from May to August 2016 and performed outstanding work on an end-to-end data analytics project. He completed the challenging project involving multiple phases and skill sets with little supervision. Throughout the internship, Kevin demonstrated a proactive attitude, strong sense of ownership, and effectively picked up new technical skills. He was a team player who communicated well and was willing to work extra hours. The manager was very satisfied with Kevin's performance and believes he will contribute strongly in the big data analytics industry given the right opportunity.
Calidad Seis Sigma con R: Aplicación a la docenciaEmilio L. Cano
This document discusses using R software to support Six Sigma methodology. It introduces reproducible research approaches for statistical training, provides examples using Sweave documents to integrate R code and LaTeX, and outlines an EADAPU training program covering Six Sigma phases and tools. The document also describes using R for process mapping, loss function analysis, and measurement system analysis for quality improvement projects.
The Klima iOS app aims to get people into art galleries by providing useful and relatable information. It retrieves weather data through an API and uses the current conditions to suggest local art that features similar weather. The app then displays this art along with the current weather, gallery opening times, and website links. Technically, it uses the OpenWeatherMap API for weather data and the Nasjonal Museet API to execute queries to find matching art based on the weather conditions.
2015 FOSS4G Track: Visualization and Analysis of Spatiotemporal Data using Fr...GIS in the Rockies
The document discusses the National Renewable Energy Laboratory's (NREL) transition from proprietary to open source software for spatial analysis and application development. It describes NREL's geospatial data science team and their work analyzing renewable energy and energy efficiency data. The transition to open source provided benefits like flexibility, cost savings, and promoting integrated architecture, but also challenges like difficulty staffing roles that require expertise across multiple open source technologies. Examples of projects using open source include a renewable electricity futures study visualization and various spatial applications.
OGC SensorThings API Get Started Webinar Series #3 of 4. (Dec 10 2015)
Title: RESTful Pattern for IoT API
More to come:
#4: Connect Sensors and IoT Devices to SensorThings API (Dec 17th 2015)
Register our webinar here: http://sensorup.com/#signup
Rhea: Adaptively Sampling Authoritative Content from Social Activity StreamsPanagiotis Liakos
The document summarizes the Rhea algorithm for adaptively sampling authoritative content from social activity streams. Rhea forms a network of authoritative users as it processes the stream and samples only content from the top-K authoritative users based on an auth-value measure. It addresses challenges of maintaining user information efficiently, ranking users, and filtering irrelevant content. Experimental results on Twitter and StackOverflow data show Rhea outperforms white-list baselines in terms of precision, recall, and ranking accuracy of the sampled documents.
AN EMPIRICAL STUDY OF THE RELATION BETWEEN STRONG CHANGE COUPLING AND DEFECTS...Igor Wiese
This study investigated the relationship between strong change couplings and defects in the Apache Aries project. The researchers found that strong change couplings were moderately correlated with defects and the majority were associated with at least one defect. Models using historical and social metrics achieved high accuracy in identifying strong change couplings. The best metrics included discussion length, number of committers, committer experience, and number of defects and weeks between first and last commit. The models could correctly predict 45.67% of strong change couplings linked to post-release defects. The researchers concluded that strong change couplings influence code quality and aim to further investigate their impact and how to monitor and track the "damage" caused.
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...NAVER Engineering
발표자: 이연창(한양대 박사과정)
발표일: 2018.2.
We investigate how to address the shortcomings of the popular One-Class Collaborative Filtering (OCCF) methods in handling challenging “sparse” dataset in one-class setting (e.g., clicked or bookmarked), and propose a novel graph-theoretic OCCF approach, named as gOCCF, by exploiting both positive preferences (derived from rated items) as well as negative preferences (derived from unrated items). In capturing both positive and negative preferences as a bipartite graph, further, we apply the graph shattering theory to determine the right amount of negative preferences to use. Then, we develop a suite of novel graph-based OCCF methods based on the random walk with restart and belief propagation methods. Through extensive experiments using 3 real-life datasets, we show that our gOCCF effectively addresses the sparsity challenge and significantly outperforms all of 8 competing methods in accuracy on very sparse datasets while providing comparable accuracy to the best performing OCCF methods on less sparse datasets.
Event Detection and Characterization in Dynamic GraphsShebuti Rayana
The document presents a framework for event detection and characterization in dynamic graphs. It proposes an ensemble approach that uses multiple algorithms for event detection, including eigen-behavior based detection, a probabilistic approach, and SPIRIT. The algorithms produce different scores and rankings that are merged through consensus methods. The approach is evaluated on two datasets: a cyber network dataset with ground truths and a New York Times corpus without ground truths. Major events are successfully detected in both datasets.
The document discusses the application of machine learning techniques to 21cm cosmology studies. It describes how artificial neural networks (ANNs) can be used as emulators to rapidly predict 21cm power spectra from cosmological parameters, bypassing the need for computationally expensive simulations. This allows ANNs to be combined with Markov chain Monte Carlo methods to efficiently estimate parameter posteriors. ANNs can also be applied to directly estimate parameters from 21cm power spectra or lightcones. The document outlines some open questions around fully characterizing uncertainties and obtaining rigorous posteriors when using ANN-based approaches in 21cm cosmology.
Self-managed and automatically reconfigurable stream processingVasia Kalavri
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I conclude with evaluation results, ongoing work, and and future challenges in this area.
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I will share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I will present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I will conclude with evaluation results, ongoing work, and and future challenges in this area.
Event Detection and Characterization in Dynamic GraphsShebuti Rayana
This document discusses event detection and characterization in dynamic graphs. It presents an ensemble approach that uses multiple algorithms for event detection, including eigen-behavior based detection (EBED), a probabilistic approach (PTSAD), and SPIRIT. The approaches are combined using consensus methods like rank merging and scoring to provide a "better" result than individual algorithms. The framework is evaluated on two datasets: a cyber network flow dataset and the New York Times news corpus, detecting events like elections, disasters and attacks. The document concludes by encouraging judging based on questions rather than answers.
1. The document discusses recommender systems for processing data streams in real time. It introduces different types of recommender systems including unpersonalized, collaborative filtering, and content-based filtering approaches.
2. It then discusses challenges specific to recommending news, such as the large volume of new articles published daily and changing relevance of content over time.
3. Finally, it addresses big data issues related to news recommendation and introduces a system that provides researchers access to large news datasets to test different recommendation approaches.
Tutorial: Context In Recommender SystemsYONG ZHENG
This document provides an overview of a tutorial on context-aware recommender systems. The tutorial will cover traditional recommendation techniques, context-aware recommendation which incorporates additional contextual information such as time and location, and context suggestion. It includes an agenda with topics, background information on recommender systems and evaluation metrics, and descriptions of techniques for context-aware recommendation including context filtering and modeling.
Fast Feature Selection for Learning to Rank - ACM International Conference on...Andrea Gigli
My talk on fast feature selection filter algorithms at the ACM International Conference on the Theory of Information Retrieval (ICTIR 2016) held in Newark, DE, US
On Unified Stream Reasoning - The RDF Stream Processing realmDaniele Dell'Aglio
The presentation of my talk at WU Vienna on 18/2/2016. I discuss the problem of unifying existing solutions to process semantic streams - with a particular focus on the ones that perform continuous query answering over RDF streams
Keynote of HOP-Rec @ RecSys 2018
Presenter: Jheng-Hong Yang
These slides aim to be a complementary material for the short paper: HOP-Rec @ RecSys18. It explains the intuition and some abstract idea behind the descriptions and mathematical symbols by illustrating some plots and figures.
Forecasting time series powerful and simpleIvo Andreev
Time series are a sequence of data points positioned in order of time. Time series forecasting has two main purposes - to understand the mechanisms that lead to rise or fall, and to predict future values. Very often it analyses trends, cyclical events, seasonality and has unique importance in Economics and Business. The quality of predictions can be evaluated only in future due to temporal dependencies on previous data points and there are many model types for approximation. In this session we are going to talk about challenges, ways of improvement and technology stack like ML.NET, ARIMA, Python, Azure ML, Regression and FB Prophet
Incremental View Maintenance for openCypher QueriesGábor Szárnyas
Presented at the Fourth openCypher Implementers Meeting
Numerous graph use cases require continuous evaluation of queries over a constantly changing data set, e.g. fraud detection in financial systems, recommendations, and checking integrity constraints. For relational systems, incremental view maintenance has been researched for three decades, resulting in a wide body of literature. The property graph data model and the openCypher language, however, are recent developments, and therefore lack established techniques to perform efficient view maintenance. In this talk, we give an overview of the view maintenance problem for property graphs, discuss why it is particularly difficult and present an approach that tackles a meaningful subset of the language.
Mba om 14_statistical_qualitycontrolmethodsNiranjana K.R.
This document provides an overview of statistical quality control techniques including:
- Describing categories of statistical quality control and how to measure quality characteristics.
- Explaining sources of variation, process capability, and how to set control limits for control charts.
- Detailing different types of control charts for variables and attributes including x-bar, R, p, and c charts.
- Defining three sigma and six sigma process capability and how they relate to acceptable defect levels.
- Discussing challenges in measuring quality in service organizations and potential metrics that could be monitored.
This document discusses randomized data structures and algorithms. It begins by motivating randomized data structures as a way to transform average case runtimes into expected runtimes that are not dependent on specific inputs. It then provides examples of randomized data structures like treaps and randomized skip lists that provide efficient operations like insertion, deletion, and search in expected logarithmic time. It also discusses how randomization can be applied in algorithms like primality testing.
This document discusses randomized data structures and algorithms. It begins by motivating randomized data structures by noting that some data structures like binary search trees have average case performance but worst case inputs. Randomizing the data structure removes dependency on inputs and provides expected case performance. The document then discusses treaps and randomized skip lists as examples of randomized data structures that provide efficient expected case performance for operations like insertion, deletion, and search. It also covers topics like randomized number generation, primality testing, and how randomization can transform average case runtimes into expected case runtimes.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
More Related Content
Similar to Less is More: Building Selective Anomaly Ensembles with Application to Event Detection in Temporal Graphs
Event Detection and Characterization in Dynamic GraphsShebuti Rayana
The document presents a framework for event detection and characterization in dynamic graphs. It proposes an ensemble approach that uses multiple algorithms for event detection, including eigen-behavior based detection, a probabilistic approach, and SPIRIT. The algorithms produce different scores and rankings that are merged through consensus methods. The approach is evaluated on two datasets: a cyber network dataset with ground truths and a New York Times corpus without ground truths. Major events are successfully detected in both datasets.
The document discusses the application of machine learning techniques to 21cm cosmology studies. It describes how artificial neural networks (ANNs) can be used as emulators to rapidly predict 21cm power spectra from cosmological parameters, bypassing the need for computationally expensive simulations. This allows ANNs to be combined with Markov chain Monte Carlo methods to efficiently estimate parameter posteriors. ANNs can also be applied to directly estimate parameters from 21cm power spectra or lightcones. The document outlines some open questions around fully characterizing uncertainties and obtaining rigorous posteriors when using ANN-based approaches in 21cm cosmology.
Self-managed and automatically reconfigurable stream processingVasia Kalavri
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I conclude with evaluation results, ongoing work, and and future challenges in this area.
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I will share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I will present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I will conclude with evaluation results, ongoing work, and and future challenges in this area.
Event Detection and Characterization in Dynamic GraphsShebuti Rayana
This document discusses event detection and characterization in dynamic graphs. It presents an ensemble approach that uses multiple algorithms for event detection, including eigen-behavior based detection (EBED), a probabilistic approach (PTSAD), and SPIRIT. The approaches are combined using consensus methods like rank merging and scoring to provide a "better" result than individual algorithms. The framework is evaluated on two datasets: a cyber network flow dataset and the New York Times news corpus, detecting events like elections, disasters and attacks. The document concludes by encouraging judging based on questions rather than answers.
1. The document discusses recommender systems for processing data streams in real time. It introduces different types of recommender systems including unpersonalized, collaborative filtering, and content-based filtering approaches.
2. It then discusses challenges specific to recommending news, such as the large volume of new articles published daily and changing relevance of content over time.
3. Finally, it addresses big data issues related to news recommendation and introduces a system that provides researchers access to large news datasets to test different recommendation approaches.
Tutorial: Context In Recommender SystemsYONG ZHENG
This document provides an overview of a tutorial on context-aware recommender systems. The tutorial will cover traditional recommendation techniques, context-aware recommendation which incorporates additional contextual information such as time and location, and context suggestion. It includes an agenda with topics, background information on recommender systems and evaluation metrics, and descriptions of techniques for context-aware recommendation including context filtering and modeling.
Fast Feature Selection for Learning to Rank - ACM International Conference on...Andrea Gigli
My talk on fast feature selection filter algorithms at the ACM International Conference on the Theory of Information Retrieval (ICTIR 2016) held in Newark, DE, US
On Unified Stream Reasoning - The RDF Stream Processing realmDaniele Dell'Aglio
The presentation of my talk at WU Vienna on 18/2/2016. I discuss the problem of unifying existing solutions to process semantic streams - with a particular focus on the ones that perform continuous query answering over RDF streams
Keynote of HOP-Rec @ RecSys 2018
Presenter: Jheng-Hong Yang
These slides aim to be a complementary material for the short paper: HOP-Rec @ RecSys18. It explains the intuition and some abstract idea behind the descriptions and mathematical symbols by illustrating some plots and figures.
Forecasting time series powerful and simpleIvo Andreev
Time series are a sequence of data points positioned in order of time. Time series forecasting has two main purposes - to understand the mechanisms that lead to rise or fall, and to predict future values. Very often it analyses trends, cyclical events, seasonality and has unique importance in Economics and Business. The quality of predictions can be evaluated only in future due to temporal dependencies on previous data points and there are many model types for approximation. In this session we are going to talk about challenges, ways of improvement and technology stack like ML.NET, ARIMA, Python, Azure ML, Regression and FB Prophet
Incremental View Maintenance for openCypher QueriesGábor Szárnyas
Presented at the Fourth openCypher Implementers Meeting
Numerous graph use cases require continuous evaluation of queries over a constantly changing data set, e.g. fraud detection in financial systems, recommendations, and checking integrity constraints. For relational systems, incremental view maintenance has been researched for three decades, resulting in a wide body of literature. The property graph data model and the openCypher language, however, are recent developments, and therefore lack established techniques to perform efficient view maintenance. In this talk, we give an overview of the view maintenance problem for property graphs, discuss why it is particularly difficult and present an approach that tackles a meaningful subset of the language.
Mba om 14_statistical_qualitycontrolmethodsNiranjana K.R.
This document provides an overview of statistical quality control techniques including:
- Describing categories of statistical quality control and how to measure quality characteristics.
- Explaining sources of variation, process capability, and how to set control limits for control charts.
- Detailing different types of control charts for variables and attributes including x-bar, R, p, and c charts.
- Defining three sigma and six sigma process capability and how they relate to acceptable defect levels.
- Discussing challenges in measuring quality in service organizations and potential metrics that could be monitored.
This document discusses randomized data structures and algorithms. It begins by motivating randomized data structures as a way to transform average case runtimes into expected runtimes that are not dependent on specific inputs. It then provides examples of randomized data structures like treaps and randomized skip lists that provide efficient operations like insertion, deletion, and search in expected logarithmic time. It also discusses how randomization can be applied in algorithms like primality testing.
This document discusses randomized data structures and algorithms. It begins by motivating randomized data structures by noting that some data structures like binary search trees have average case performance but worst case inputs. Randomizing the data structure removes dependency on inputs and provides expected case performance. The document then discusses treaps and randomized skip lists as examples of randomized data structures that provide efficient expected case performance for operations like insertion, deletion, and search. It also covers topics like randomized number generation, primality testing, and how randomization can transform average case runtimes into expected case runtimes.
Similar to Less is More: Building Selective Anomaly Ensembles with Application to Event Detection in Temporal Graphs (17)
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
2. Rayana & Akoglu 2Less is More: Building Selective Anomaly Ensembles
Network intrusion
At time point t
Time tick 7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20
Score
Time tick
Event Detection
3. Rayana & Akoglu 3Less is More: Building Selective Anomaly Ensembles
Emerging Topic in Social Media
Nepal Earth Quake 2015
tweets, retweets with
• #Nepal
• #NepalEarthQuake
• #NepalEarthQuakeRelief
• …
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10 12 14 16 18 20
Score
Time tick
Event Detection
25th April 2015
4. Rayana & Akoglu 4
Given a sequence of graphs {G1, G2, … , Gt, …, GT}
Find time points t’ at which Gt’ changes significantly
from Gt’-1
Less is More: Building Selective Anomaly Ensembles
time
similarity/distance scores
5. Rayana & Akoglu 5Less is More: Building Selective Anomaly Ensembles
Numerous algorithms for event detection
no “winner” algorithm across datasets
Idea: ensemble approach
Combine strength of accurate detectors
Alleviate weakness of inaccurate detectors
Improved accuracy, reduced noise
More robust performance
Better than individual base detectors
T. G. Dietterich. Ensemble methods in machine learning. Springer, 2000
J. Ghosh and A. Acharya. Cluster ensembles: Theory and applications. 2013.
6. Rayana & Akoglu 6
Idea: ensemble approach
Challenge: building anomaly ensembles –
a fully unsupervised task
No labels to guide for detector accuracy
No objective function inherent to task
Combining all the results may deteriorate the
overall ensemble accuracy [Rayana&Akoglu’14]
▪ some detectors may be inaccurate
Less is More: Building Selective Anomaly Ensembles
We build SELECTive anomaly ensembles
- identify (in)accurate detectors
- in unsupervised fashion
7. Rayana & Akoglu 7Less is More: Building Selective Anomaly Ensembles
EventDetection
8. Rayana & Akoglu 8Less is More: Building Selective Anomaly Ensembles
Eigen-behaviors
Parametric modeling
SPIRIT
Z-score
1 – norm.
(sum
p-value)
projection
Subspace Method
Moving Average
SPE
Agg.
p-value
time ticks
EventDetection(Cybernet)
feature: degree
10. Rayana & Akoglu 10Less is More: Building Selective Anomaly Ensembles
Graphs over time node feature time series
Base detectors
Anomalous Subspace (ASED) [Lakhina et al. ’04]
SPIRIT [Papadimitriou et al. ’05]
Eigen-behavior based (EBED) [Akoglu et al. ’10]
Parametric modeling (PTSAD) [Rayana&Akoglu ’14]
▪ Models: Poisson, ZIP, Bernoulli+ZTP, Markov+ZTP
▪ Model selection: likelihood ratio test
Moving average (MAED)
Nodes
Features
(egonet)
Time
11. Rayana & Akoglu 11Less is More: Building Selective Anomaly Ensembles
ASED SPIRIT EBED PTSAD MAED
Base detector SELECTion
Rank based
• Inverse Rank
• Kemeny-Young [Kemeny’59]
•RobustRankAggregation
[Kolde+ ‘12]
Score based
• Unification [Zimek+ ‘11]
- avg & max
• Mixture Model [Gao+ ‘06]
- avg & max
Consensus SELECTion & final ensemble
12. Rayana & Akoglu 12
Vertical SELECTion (SELECT-V)
Exploits correlation among the rank
lists
Horizontal SELECTion (SELECT-H)
Exploits element wise order statistics to
filter out inaccurate detectors
Less is More: Building Selective Anomaly Ensembles
14. Rayana & Akoglu 14Less is More: Building Selective Anomaly Ensembles
P1
target
avg
P2 P3 P4 P5
Pseudo ground truth
P3 is most correlated to the target
15. Rayana & Akoglu 15Less is More: Building Selective Anomaly Ensembles
P1
target
avg
P2 P3 P4 P5
P3
Ensemble
avg
p
16. Rayana & Akoglu 16Less is More: Building Selective Anomaly Ensembles
P1 P2
P3
P4 P5
Ensemble
avg
p
P1 is most correlated to p
If corr(avg(E,P1), target) > corr(p, target)
accept P1
else
discard P1
17. Rayana & Akoglu 17Less is More: Building Selective Anomaly Ensembles
P1 P2
P3
P4 P5
Ensemble
avg
p
P1
Update until this list is empty
18. Rayana & Akoglu 18Less is More: Building Selective Anomaly Ensembles
P2P3
P4 P5
Ensemble
P1
Discarded
19. Rayana & Akoglu 19Less is More: Building Selective Anomaly Ensembles
S1 S2 S3
…
Sm
1
1
1
0
.
.
1
0
1
0
.
.
0
0
1
1
.
.
1
0
1
0
.
.
M1 M2 M3
…
Mm
Mixture Modeling
• 1 (outliers)
• 0 (inliers)
1
0
1
0
.
.
Majority
Voting
O
Order statistics to choose
accurate lists
Given m lists, for each
pseudo outlier:
r = [r(1), …,r(m)], s.t. r(1) ≤ … ≤ r(m)
Under uniform null,
prob. r̂(l) ≤ r(l):
(at least l ranks drawn uniformly
from [0, 1] must be ϵ [0, r(l)])Pseudo
outliers
20. Rayana & Akoglu 20
Example with 20 detectors
last 5 likely inaccurate
Less is More: Building Selective Anomaly Ensembles
21. Rayana & Akoglu 22
Full Ensemble (Full) [Rayana&Akoglu‘14]
Assemble all the detector/consensus
results
Diversity-based Ensemble (DivE)
[Schubert et al. 2012]
Select diverse (less correlated) detector/
consensus results to assemble
Less is More: Building Selective Anomaly Ensembles
22. Rayana & Akoglu 23
Data Set names duration #nodes #edges rate
1. EnronInc 4 years ~80K ~350K 1 day
2. RealityMining 50 weeks ~18K ~33k 1 week
3. TwitterSecurity 4 months ~130K ~441K 1 day
4. TwitterWCup 1 month ~54K ~274K 5 mins
5. NYTNews 7.5 years ~320K ~2980K 1 week
Less is More: Building Selective Anomaly Ensembles
• Ground truth for datasets 1-4
• Qualitative evaluation for NYTNews
23. Rayana & Akoglu 24Less is More: Building Selective Anomaly Ensembles
24. Rayana & Akoglu 25Less is More: Building Selective Anomaly Ensembles
25. Rayana & Akoglu 26Less is More: Building Selective Anomaly Ensembles
26. Rayana & Akoglu 27Less is More: Building Selective Anomaly Ensembles
27. Rayana & Akoglu 28Less is More: Building Selective Anomaly Ensembles
35. Rayana & Akoglu 38
Columbia Disaster
9/11
attack
New York City
World Trade
Center
Washington (DC)
Afghanistan
Bin Laden,
Osama
Al Qaeda
Manhattan (NY)
Bush,
George W
White HouseCongress
New York City
World Trade
Center
Washington (DC)
Afghanistan
Bin Laden,
Osama
Al Qaeda
Manhattan
(NY)
Bush,
George W
White HouseCongress
Time tick 89 Time tick 90
Less is More: Building Selective Anomaly Ensembles
36. Rayana & Akoglu 39Less is More: Building Selective Anomaly Ensembles
A new Anomaly Ensemble
SELECTive:
▪ Discard inaccurate detectors
▪ unsupervised
Heterogeneous
▪ different detectors
▪ different consensus
2-phases:
▪ No bias towards detectors & consensus
SELECT outperforms
▪ Full (no selection)
▪ DivE (diversity ensemble)
5 large datasets (4 w/ ground truth)
Hurt by inaccurate detectors
37. Rayana & Akoglu 40Less is More: Building Selective Anomaly Ensembles
Event Detection
srayana@cs.stonybrook.edu
http://www.cs.stonybrook.edu/~datalab/
Editor's Notes
My work focuses on
discovering patterns and detecting anomalies in real-world data,
using graph analytics techniques, and
developing effective and efficient tools to do so .