Real-time topic detection on Twitter streams is challenging due to the massive volume and speed of tweets. The authors propose a streaming non-negative matrix factorization (NMF) algorithm that can detect topics in real-time from Twitter data. The algorithm uses stochastic gradient descent to update topic distributions incrementally as tweets arrive. It also automatically detects and filters "hijacked" topics that do not follow the expected word frequency distributions through statistical testing. Experiments on a large Twitter dataset show the streaming NMF can process tweets at the full speed of the Japanese Twitter firehose while detecting meaningful topics more accurately than alternative methods.
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
Online learning, Vowpal Wabbit and Hadoop
Online learning has recently caught a lot of attention, following some competitions, and especially after Criteo released 11GB for the training set of a Kaggle contest.
Online learning allows to process massive data as the learner processes data in a sequential way using up a low amount of memory and limited CPU ressources. It is also particularly suited for handling time-evolving date.
Vowpal Wabbit has become quite popular: it is a handy, light and efficient command line tool allowing to do online learning on GB of data, even on a standard laptop with standard memory. After a reminder of the online learning principles, we present how to run Vowpal Wabbit on Hadoop in a distributed fashion.
Data Science at the Command Line
Don't forget to think about this good old tool that is so handy, interactive and efficient, before rushing to Hadoop, Spark, Storm, etc.
You can do streaming, you can make joins on csv, and even machine learning at the command line... and have a lot of fun.
Date: March 4, 2016
Venue: Trondheim, Norway. Doctoral Seminar at NTNU
Please cite, link to or credit this presentation when using it or part of it in your work.
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...MLconf
Andrew recently joined Lucidworks to head up their Advisory practice, and is a Committer and PMC member on the Apache Mahout project.
Abstract summary
Apache Mahout: Distributed Matrix Math for Machine Learning:
Machine learning and statistics tools like R and Scikit-learn are declarative, flexible, and extensible, but they scale poorly. “Big Data” tools such as Apache Spark, Apache Flink, and H2O distribute well, but have rudimentary functionality for machine learning and are not easily extensible. In this talk we present Apache Mahout, which provides a Scala-based, R-like DSL for doing linear algebra on distributed systems, letting practitioners quickly implement algorithms on distributed matrices. We will highlight new features in version 0.13 including the hybrid CPU/GPU-optimized engine, and a new framework for user-contributed methods and algorithms similar to R’s CRAN.
We will cover some history of Mahout, introduce the R-Like Scala DSL, provide an overview of how Mahout is able to operate on matrices distributed across multiple computers, and how it takes advantage of GPUs on each computer in a cluster creating a hybrid distributed/GPU-accelerated environment; then demonstrate the kinds of normally complex or unfeasible problems users can easily solve with Mahout; show an integration which allows Mahout to leverage the visualization packages of projects such as R, Python, and D3; and lastly explain how to develop algorithms and submit them to the Mahout project for other users to use.
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
Structured Streaming is a new API in Spark 2.0 that simplifies the end to end development of continuous applications. One such continuous application is online model updates: Online models are incrementally updated with new data and can be continuously queried while being updated. As a result, they can be fast to train and leverage new data faster than offline algorithms. In this talk, we give a brief introduction the area of online learning and describe how online model updates can be built using structured streaming APIs. The end result is a robust pipeline for updating models that is scalable, fast and fault tolerant.
How to access & analyse Twitter big data. Full working example using Storm and RedStorm in Ruby & JRuby. Code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
Online learning, Vowpal Wabbit and Hadoop
Online learning has recently caught a lot of attention, following some competitions, and especially after Criteo released 11GB for the training set of a Kaggle contest.
Online learning allows to process massive data as the learner processes data in a sequential way using up a low amount of memory and limited CPU ressources. It is also particularly suited for handling time-evolving date.
Vowpal Wabbit has become quite popular: it is a handy, light and efficient command line tool allowing to do online learning on GB of data, even on a standard laptop with standard memory. After a reminder of the online learning principles, we present how to run Vowpal Wabbit on Hadoop in a distributed fashion.
Data Science at the Command Line
Don't forget to think about this good old tool that is so handy, interactive and efficient, before rushing to Hadoop, Spark, Storm, etc.
You can do streaming, you can make joins on csv, and even machine learning at the command line... and have a lot of fun.
Date: March 4, 2016
Venue: Trondheim, Norway. Doctoral Seminar at NTNU
Please cite, link to or credit this presentation when using it or part of it in your work.
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...MLconf
Andrew recently joined Lucidworks to head up their Advisory practice, and is a Committer and PMC member on the Apache Mahout project.
Abstract summary
Apache Mahout: Distributed Matrix Math for Machine Learning:
Machine learning and statistics tools like R and Scikit-learn are declarative, flexible, and extensible, but they scale poorly. “Big Data” tools such as Apache Spark, Apache Flink, and H2O distribute well, but have rudimentary functionality for machine learning and are not easily extensible. In this talk we present Apache Mahout, which provides a Scala-based, R-like DSL for doing linear algebra on distributed systems, letting practitioners quickly implement algorithms on distributed matrices. We will highlight new features in version 0.13 including the hybrid CPU/GPU-optimized engine, and a new framework for user-contributed methods and algorithms similar to R’s CRAN.
We will cover some history of Mahout, introduce the R-Like Scala DSL, provide an overview of how Mahout is able to operate on matrices distributed across multiple computers, and how it takes advantage of GPUs on each computer in a cluster creating a hybrid distributed/GPU-accelerated environment; then demonstrate the kinds of normally complex or unfeasible problems users can easily solve with Mahout; show an integration which allows Mahout to leverage the visualization packages of projects such as R, Python, and D3; and lastly explain how to develop algorithms and submit them to the Mahout project for other users to use.
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
Structured Streaming is a new API in Spark 2.0 that simplifies the end to end development of continuous applications. One such continuous application is online model updates: Online models are incrementally updated with new data and can be continuously queried while being updated. As a result, they can be fast to train and leverage new data faster than offline algorithms. In this talk, we give a brief introduction the area of online learning and describe how online model updates can be built using structured streaming APIs. The end result is a robust pipeline for updating models that is scalable, fast and fault tolerant.
How to access & analyse Twitter big data. Full working example using Storm and RedStorm in Ruby & JRuby. Code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/
This ppt gives an overview of the recent MIT paper on Sparse Fourier transform using which data can be processed 10 or 100 times faster than the traditional Fast Fourier Transform. This is possible as the FFT has a complexity of O(nlogn), where the sparse FT has a potentially lower complexity of O(klogn) in the sparse spectrum case.
Tutorial: The Role of Event-Time Analysis Order in Data StreamingVincenzo Gulisano
Slides for our tutorial, titled “The Role of Event-Time Analysis Order in Data Streaming”, presented at the 14th ACM International Conference on Distributed and Event-Based Systems (DEBS) conference. We have recorded the tutorial, and you can find the videos at the following links:
Part 1: https://youtu.be/SW_WS6ULsdY
Part 2: https://youtu.be/bq3ECNvPwOU
You can find this slides, as well as the code examples, at https://github.com/vincenzo-gulisano/debs2020_tutorial_event_time and at SlideS
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
ODSC 2019: Sessionisation via stochastic periods for root event identificationKuldeep Jiwani
In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Associating a chain of events in order of time helps in doing a root event analysis. In certain cases a time ordered correlation and root event identification is good enough to automatically identify signatures of various malicious actors and take appropriate corrective actions to stop cyber attacks, stop malicious social campaigns, etc.
Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model "Stochastic periods" and learn their probability distributions in an unsupervised manner.
The main focus of this talk will be to showcase applied data science techniques to discover stochastic periods. There are many ways to obtain periods in data, so the journey would begin by a walk through of existing techniques like FFT (Fast Fourier Transform) then discuss about Gaussian Mixture Models. After highlighting the short comings of these techniques we will succinctly explain one of the most general non-parametric Bayesian approaches to solve this problem. Without going too deep in the complex math, we will get back to applied data science and discuss a much simpler technique that can solve the same problem if certain assumptions are satisfied.
In this talk we will demonstrate some time based pattern we discovered while working on a security analytics use case that uses Sessionisation. In the talk we will demonstrate such patterns based on an open source malware attack datasets that is available publicly.
Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling, Bayesian non-parametric methods
Now in touch:The linear control system functions in MATLAB.
CONTENTS:
INTRODUCTION OF MATLAB
CONTROL SYSTEM TOOLBOX
TRANSFER FUNCTION
Poles & Zeroes
Multiplication Of Transfer Functions
Closed-loop Transfer Function
TIME RESPONSE OF A CONTROL SYSTEM
Impulse
Step
Ramp
STATE SPACE REPRESENTATION
State space to transfer function
Transfer function to state space
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
Event description:
Why Tangent Works chooses Julia: The Two Language Problem
TIM: Automatic Model Building for Energy Industry
Julia and its major differences to other technical computing languages (R, Matlab, ...)
- Why is vectorized code fast?
- Why is it not as fast as it could be?
Speaker:
Ján Dolinský, Tangent Works (www.tangent.works)
Language of the event: Julia, Slovak & English
------------------------------------
PyData Bratislava [Python Data Enthusiasts and Users, Data Scientists & Statisticians of all levels from Slovakia]
------------------------------------
--
This meetup group is for Data Scientists, Statisticians, Economists and Data Enthusiasts using Python for data analysis and data visualization. The goals are to provide Python enthusiasts a place to share ideas and learn from each other about how best to apply the language and tools to ever-evolving challenges in the vast realm of data management, processing, analytics, and visualization.
--
PyData is a group for users and developers of data analysis tools to share ideas and learn from each other. We gather to discuss how best to apply Python tools, as well as those using R and Julia, to meet the evolving challenges in data management, processing, analytics, and visualization. PyData groups, events, and conferences aim to provide a venue for users acrossall the various domains of data analysis to share their experiences and their techniques. PyData is organized by NumFOCUS.org, a 501(c)3 non-profit in the United States.
------------------------------------
Our Facebook group here: https://www.facebook.com/groups/1813599648877946/
Our Twitter account here: https://twitter.com/PyDataBA
Our LinkedIn group here: https://www.linkedin.com/groups/13506080
All materials from previous meetups on GitHub here: https://github.com/GapData/PyDataBratislava
Recordings of previous meetups on our YouTube here: https://www.youtube.com/watch?v=XYpKpmapqjI&list=PLISV6olKXnd9pE-KPtPgwwLe6qPXvb9K7
------------------------------------
Organizers:
GapData Institute (https://www.gapdata.org/) (GDI) is a nonprofit nonpartisan research institution harnessing power of data & wisdom of economics for public good.
|| Data. Think. Change. ||
--
NumFOCUS (http://www.numfocus.org/) is a 501(c)(3) nonprofit that supports and promotes world-class, innovative, open source scientific computing. The mission of NumFOCUS is to promote sustainable high-level programming languages, open code development, and reproducible scientific research. We accomplish this mission through our educational programs and events as well as through fiscal sponsorship of open source data science projects. We aim to increase collaboration and communication within the scientific computing community.
Simple representations for learning: factorizations and similarities Gael Varoquaux
Real-life data seldom comes in the ideal form for statistical learning.
This talk focuses on high-dimensional problems for signals and
discrete entities: when dealing with many, correlated, signals or
entities, it is useful to extract representations that capture these
correlations.
Matrix factorization models provide simple but powerful representations. They are used for recommender systems across discrete entities such as users and products, or to learn good dictionaries to represent images. However they entail large computing costs on very high-dimensional data, databases with many products or high-resolution images. I will present an
algorithm to factorize huge matrices based on stochastic subsampling that gives up to 10-fold speed-ups [1].
With discrete entities, the explosion of dimensionality may be due to variations in how a smaller number of categories are represented. Such a problem of "dirty categories" is typical of uncurated data sources. I will discuss how encoding this data based on similarities recovers a useful category structure with no preprocessing. I will show how it interpolates between one-hot encoding and techniques used in character-level natural language processing.
[1] Stochastic subsampling for factorizing huge matrices, A Mensch, J Mairal, B Thirion, G Varoquaux, IEEE Transactions on Signal Processing 66 (1), 113-128
[2] Similarity encoding for learning with dirty categorical variables. P Cerda, G Varoquaux, B Kégl Machine Learning (2018): 1-18
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
Is Machine Learning Code for 100 Rows or a Billion the Same?: We have built an automatically distributed, implicitly parallel data science platform for running large scale machine learning applications. By abstracting away the computer science required to scale machine learning models, The Ufora platform lets data scientists focus on building data science models in simple scripting code, without having to worry about building large-scale distributed systems, their race conditions, fault-tolerance, etc. This automatic approach requires solving some interesting challenges, like optimal data layout for different ML models. For example, when a data scientist says “do a linear regression on this 100GB dataset”, Ufora needs to figure out how to automatically distribute and lay out that data across a cluster of machines in the cluster in order to minimize travel over the wire. Running a GBM against the same dataset might require a completely different layout of that data. This talk will cover how the platform works, in terms of data and thread distribution, how it generates parallel processes out of single-threaded programs, and more.
This ppt gives an overview of the recent MIT paper on Sparse Fourier transform using which data can be processed 10 or 100 times faster than the traditional Fast Fourier Transform. This is possible as the FFT has a complexity of O(nlogn), where the sparse FT has a potentially lower complexity of O(klogn) in the sparse spectrum case.
Tutorial: The Role of Event-Time Analysis Order in Data StreamingVincenzo Gulisano
Slides for our tutorial, titled “The Role of Event-Time Analysis Order in Data Streaming”, presented at the 14th ACM International Conference on Distributed and Event-Based Systems (DEBS) conference. We have recorded the tutorial, and you can find the videos at the following links:
Part 1: https://youtu.be/SW_WS6ULsdY
Part 2: https://youtu.be/bq3ECNvPwOU
You can find this slides, as well as the code examples, at https://github.com/vincenzo-gulisano/debs2020_tutorial_event_time and at SlideS
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
ODSC 2019: Sessionisation via stochastic periods for root event identificationKuldeep Jiwani
In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Associating a chain of events in order of time helps in doing a root event analysis. In certain cases a time ordered correlation and root event identification is good enough to automatically identify signatures of various malicious actors and take appropriate corrective actions to stop cyber attacks, stop malicious social campaigns, etc.
Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model "Stochastic periods" and learn their probability distributions in an unsupervised manner.
The main focus of this talk will be to showcase applied data science techniques to discover stochastic periods. There are many ways to obtain periods in data, so the journey would begin by a walk through of existing techniques like FFT (Fast Fourier Transform) then discuss about Gaussian Mixture Models. After highlighting the short comings of these techniques we will succinctly explain one of the most general non-parametric Bayesian approaches to solve this problem. Without going too deep in the complex math, we will get back to applied data science and discuss a much simpler technique that can solve the same problem if certain assumptions are satisfied.
In this talk we will demonstrate some time based pattern we discovered while working on a security analytics use case that uses Sessionisation. In the talk we will demonstrate such patterns based on an open source malware attack datasets that is available publicly.
Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling, Bayesian non-parametric methods
Now in touch:The linear control system functions in MATLAB.
CONTENTS:
INTRODUCTION OF MATLAB
CONTROL SYSTEM TOOLBOX
TRANSFER FUNCTION
Poles & Zeroes
Multiplication Of Transfer Functions
Closed-loop Transfer Function
TIME RESPONSE OF A CONTROL SYSTEM
Impulse
Step
Ramp
STATE SPACE REPRESENTATION
State space to transfer function
Transfer function to state space
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
Event description:
Why Tangent Works chooses Julia: The Two Language Problem
TIM: Automatic Model Building for Energy Industry
Julia and its major differences to other technical computing languages (R, Matlab, ...)
- Why is vectorized code fast?
- Why is it not as fast as it could be?
Speaker:
Ján Dolinský, Tangent Works (www.tangent.works)
Language of the event: Julia, Slovak & English
------------------------------------
PyData Bratislava [Python Data Enthusiasts and Users, Data Scientists & Statisticians of all levels from Slovakia]
------------------------------------
--
This meetup group is for Data Scientists, Statisticians, Economists and Data Enthusiasts using Python for data analysis and data visualization. The goals are to provide Python enthusiasts a place to share ideas and learn from each other about how best to apply the language and tools to ever-evolving challenges in the vast realm of data management, processing, analytics, and visualization.
--
PyData is a group for users and developers of data analysis tools to share ideas and learn from each other. We gather to discuss how best to apply Python tools, as well as those using R and Julia, to meet the evolving challenges in data management, processing, analytics, and visualization. PyData groups, events, and conferences aim to provide a venue for users acrossall the various domains of data analysis to share their experiences and their techniques. PyData is organized by NumFOCUS.org, a 501(c)3 non-profit in the United States.
------------------------------------
Our Facebook group here: https://www.facebook.com/groups/1813599648877946/
Our Twitter account here: https://twitter.com/PyDataBA
Our LinkedIn group here: https://www.linkedin.com/groups/13506080
All materials from previous meetups on GitHub here: https://github.com/GapData/PyDataBratislava
Recordings of previous meetups on our YouTube here: https://www.youtube.com/watch?v=XYpKpmapqjI&list=PLISV6olKXnd9pE-KPtPgwwLe6qPXvb9K7
------------------------------------
Organizers:
GapData Institute (https://www.gapdata.org/) (GDI) is a nonprofit nonpartisan research institution harnessing power of data & wisdom of economics for public good.
|| Data. Think. Change. ||
--
NumFOCUS (http://www.numfocus.org/) is a 501(c)(3) nonprofit that supports and promotes world-class, innovative, open source scientific computing. The mission of NumFOCUS is to promote sustainable high-level programming languages, open code development, and reproducible scientific research. We accomplish this mission through our educational programs and events as well as through fiscal sponsorship of open source data science projects. We aim to increase collaboration and communication within the scientific computing community.
Simple representations for learning: factorizations and similarities Gael Varoquaux
Real-life data seldom comes in the ideal form for statistical learning.
This talk focuses on high-dimensional problems for signals and
discrete entities: when dealing with many, correlated, signals or
entities, it is useful to extract representations that capture these
correlations.
Matrix factorization models provide simple but powerful representations. They are used for recommender systems across discrete entities such as users and products, or to learn good dictionaries to represent images. However they entail large computing costs on very high-dimensional data, databases with many products or high-resolution images. I will present an
algorithm to factorize huge matrices based on stochastic subsampling that gives up to 10-fold speed-ups [1].
With discrete entities, the explosion of dimensionality may be due to variations in how a smaller number of categories are represented. Such a problem of "dirty categories" is typical of uncurated data sources. I will discuss how encoding this data based on similarities recovers a useful category structure with no preprocessing. I will show how it interpolates between one-hot encoding and techniques used in character-level natural language processing.
[1] Stochastic subsampling for factorizing huge matrices, A Mensch, J Mairal, B Thirion, G Varoquaux, IEEE Transactions on Signal Processing 66 (1), 113-128
[2] Similarity encoding for learning with dirty categorical variables. P Cerda, G Varoquaux, B Kégl Machine Learning (2018): 1-18
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
Is Machine Learning Code for 100 Rows or a Billion the Same?: We have built an automatically distributed, implicitly parallel data science platform for running large scale machine learning applications. By abstracting away the computer science required to scale machine learning models, The Ufora platform lets data scientists focus on building data science models in simple scripting code, without having to worry about building large-scale distributed systems, their race conditions, fault-tolerance, etc. This automatic approach requires solving some interesting challenges, like optimal data layout for different ML models. For example, when a data scientist says “do a linear regression on this 100GB dataset”, Ufora needs to figure out how to automatically distribute and lay out that data across a cluster of machines in the cluster in order to minimize travel over the wire. Running a GBM against the same dataset might require a completely different layout of that data. This talk will cover how the platform works, in terms of data and thread distribution, how it generates parallel processes out of single-threaded programs, and more.
With the rapid growth in users on social networks, there is a corresponding increase in user-generated content, in turn resulting in information overload. On Twitter, for example, users tend to receive un- interested information due to their non-overlapping interests from the people whom they follow. In this paper we present a Semantic Web ap- proach to filter public tweets matching interests from personalized user profiles. Our approach includes automatic generation of multi-domain and personalized user profiles, filtering Twitter stream based on the gen- erated profiles and delivering them in real-time. Given that users inter- ests and personalization needs change with time, we also discuss how our application can adapt with these changes.
RFID in Robot-Assisted Indoor Navigation for the Visually ImpairedVladimir Kulyukin
We describe how Radio Frequency Identification
(RFID) can be used in robot-assisted indoor navigation for
the visually impaired. We present a robotic guide for the
visually impaired that was deployed and tested both with
and without visually impaired participants in two indoor
environments. We describe how we modified the standard
potential fields algorithms to achieve navigation at moderate
walking speeds and to avoid oscillation in narrow spaces.
The experiments illustrate that passive RFID tags deployed
in the environment can act as reliable stimuli that trigger local
navigation behaviors to achieve global navigation objectives.
ShopTalk: Independent Blind Shopping Through Verbal Route Directions and Barc...Vladimir Kulyukin
Independent shopping in modern grocery stores that carry thousands of products is a great challenge for people
with visual impairments. ShopTalk is a proof-of-concept wearable system designed to assist visually impaired shoppers
with finding shelved products in grocery stores. Using synthetic verbal route directions and descriptions of the store layout,
ShopTalk leverages the everyday orientation and mobility skills of independent visually impaired travelers to direct
them to aisles with target products. Inside aisles, an off-the-shelf barcode scanner is used in conjunction with a software
data structure, called a barcode connectivity matrix, to locate target product on shelves. Two experiments were performed
at a real world supermarket. A successful earlier single-subject experiment is summarized and a new experiment involving
ten visually impaired participants is presented. In both experiments, ShopTalk was successfully used to guide visually impaired
shoppers to multiple products located in aisles on shelves. ShopTalk is a feasible system for guiding visually impaired
shoppers who are skilled, independent travelers. Its design does not require any hardware instrumentation of the
store and leads to low installation and maintenance costs.
A Software Tool for Rapid Acquisition of Streetwise Geo-Referenced MapsVladimir Kulyukin
RapGeoRef, a software tool for rapid acquistion of streetwise geo-referenced maps, is
presented. A evaluation study of RapGeoRef was performed on the USU campus with four students.
The participants were told about the purpose of the tool and shown a demo. They were
then asked to construct a geo-referenced database for an area of the USU campus. Upon
completion, they were given the NASA TLX questionnaire to assess the subjective
workload.All participants were able
to complete the task in one day. The analysis of the NASA TLX questionnaire revealed
that the temporal demand was much more prominent to the participants than either
mental or physical effort.
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...Two Sigma
The authors present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions.
Signals and Systems is an introduction to analog and digital signal processing, a topic that forms an integral part of engineering systems in many diverse areas, including seismic data processing, communications, speech processing, image processing, defense electronics, consumer electronics, and consumer products.
Digital Signal Processing[ECEG-3171]-Ch1_L03Rediet Moges
This Digital Signal Processing Lecture material is the property of the author (Rediet M.) . It is not for publication,nor is it to be sold or reproduced.
#Africa#Ethiopia
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
1. Real-Time Top-R Topic Detection on
Twitter with Topic Hijack Filtering
Kohei Hayashi (National Institute of Informatics)
August 24, 2016
Joint work with
• Takanori Maehara (Shizuoka Univ)
• Masashi Toyoda (Univ Tokyo)
• Ken-ichi Kawarabayashi (National Institute of
Informatics)
1 / 21
2. Twitter: The Rapid Stream of Texts
SNS with short messages (tweets)
Volume 41M users
Diversity Covering any topics: news, politics, TV, ...
Speed 270K tweets/min
A promising data source for topic detection
• May discover breaking news and events even faster
than news media 2 / 21
4. Noise Filtering
Many spam tweets generated by not human
• e.g. “tweet buttons”
Exaggerate co-occurrence and “hijack” important topics
4 / 21
5. Contributions
A streaming topic detection algorithm based on
non-negative matrix factorization (NMF)
1 Highly scalable:
Able to deal with a 20M×1M sparse matrix/sec
2 Automatic topic hijacking detection & elimination
Technical Points
1 Reformulate NMF in a stochastic manner
• Stochastic gradient descent (SGD) updates with
O(NNZ) time
2 Use of statistical testing
• Assume normal topics follow power law
5 / 21
7. Topic Detection by NMF
Tweets
User-word
co-occurrence
user word
Frequency
table
0
0
Topics
user
word
7 / 21
8. Problem
min
U∈RI×R
+ ,V∈RJ×R
+
fλ(X; U, V),
fλ(X; U, V) =
1
2
X − UV 2
F +
λ
2
( U 2
F + V 2
F)
Batch Algorithm
Repeat until convergence:
1 U ← [U − η Ufλ(X; U, V)]+
2 V ← [V − η Vfλ(X; U, V)]+
Guaranteed converging to stationary points
8 / 21
9. In Twitter, observe X(t)
for each time t
• Consider to keep track of ¯X
(t)
= 1
t
t
s=1 X(s)
Constructing ¯X
(t)
and solving fλ( ¯X
(t)
; U, V): time
consuming
Key idea: decompose fλ for each t
¯X
(t)
− UV 2
F =
1
t s
X(s)
− UV 2
F + const
Stochastic Formulation
If X(t)
is i.i.d. random variable,
E[X(t)
] − UV 2
F = E X(t)
− UV 2
F + var[X(t)
]
Now we can use SGD
9 / 21
10. Streaming Algorithm
For t = 1, . . . , T:
1 AU ← U Ufλ, AV ← V Vfλ (metrics)
2 U(t)
← [U(t−1)
− ηt Ufλ(X(t)
; U, V(t−1)
)A−1
U ]+
3 V(t)
← [V(t−1)
− ηt Vfλ(X(t)
; U(t)
, V)A−1
V ]+
Guaranteed converging to the stationary points of
fλ( ¯X
(t)
; U, V) with some {ηt} and i.i.d. {X(t)
}
10 / 21
11. Comparing with the Batch NMF ...
Much faster
• Update U and V only once!
• Update cost: depends on NNZ(X(t)
) NNZ( ¯X
(t)
)
Memory efficient
• Able to discard X(1)
, . . . , X(t−1)
Smoothing effect
U(t)
= [(1 − ηt)U(t−1)
+ ηtX(t)
V(t−1)
A−1
U ]+
= [(1 − ηt)U(t−1)
+ ηt argmin
U
fλ(X(t)
; U, V(t−1)
)]+
• The ηt-weighted average between the prev solution
and the 1-step NMF solution of X(t)
• Mitigates the sparsity of X(t)
11 / 21
13. Problem Setting
Goal: Find hijacked topics
Idea: Check the distributions of all topics
Hypothesis
The word dist of a hijacked topic should be different
from the word dist of a normal topic
• NMF estimates topic-specific word dists as V
Xij ∝ p(useri, wordj)
∝
r
p(topicr)p(useri|topicr)p(wordj|topicr)
∝
r
uirvjr
13 / 21
14. Defining Normal/Hijacked Topics
Normal topics:
• Many users involve
⇒ Mixing many different vocabularies
⇒ Heavy-tailed (Zipf’s law)
Definition: A Normal Topic
Topic r is normal if
p(rank(word) | topicr)
= power(α)
14 / 21
15. Defining Normal/Hijacked Topics (Cont’d)
Hijacked topics:
• The same word set is repeatedly used
⇒ Uniform probs (almost) no tail
Definition: A Hijacked Topic
Topic r is hijacked if
p(rank(word) | topicr)
= step(rankmin)
15 / 21
16. Log-likelihood Ratio Test
L(rankmin) =
j
log
step(rankj | rankmin)
power(rankj | ˆα)
Theorem (Asymptotic normality [Vuong’89])
Let N be # of observed words. Then, L(rankmin)/
√
N converges in
distribution to N(0, σ2
) where
σ2
=
1
N
N
j=1
log
step(rankj | rankmin)
power(rankj | ˆα)
2
−
1
N
L(rankmin)
2
.
For r = 1, . . . , R:
• Estimate ˆα = argmaxα log power(rank(uk) | α)
• For rankmin = 1, . . . , 140:
• Compute L(rankmin)
• Topic r is hijacked if p-val 0.05 16 / 21
18. Data
Japanese Twitter stream
• April 15–16, 2013
• 417K users
• 1.98M words
• 15.3M tweets
• 69.4M co-occurrences
• Generated X(t)
for each 10K co-occurrences
18 / 21
19. Runtime
NMF 1% 10% 100%
Batch 27.0m 1.9h 16.4h
Online [Cao+ 07] 5.6h 9.2h 17.8h
Dynamic [Saha+ 12] 16.7h 24h 24h
Streaming [this work] 4.0m 21.7m 3.6h
w/ Filter [this work] 5.3m 24.1m 3.8h
Streaming NMF: ×5–250 faster!
• 67K tweets/m
• The real-time speed of all JP tweets! (the 2nd most
popular language in Twitter)
Filtering cost is ignorable
19 / 21
20. Perplexity
• Similarity between a topic dist and a target dist
• Used Y! headlines1
for the target term dist
• Lower is better
NMF 1% 10% 100%
Batch 9.01E+09 6.32E+06 4.51E+04
Online [Cao+ 07] 1.06E+04 3.25E+04 1.83E+04
Dynamic [Saha+ 12] 6.53E+04 N/A N/A
Streaming [this work] 5.65E+07 3.25E+04 8.71E+03
w/ Filter [this work] 5.25E+09 2.40E+04 7.90E+03
Streaming NMF: best at 10% 100% data
Topic hijacking filtering improves perplexity
1
http://news.yahoo.co.jp/list
20 / 21
21. Summary
Proposed the streaming algorithm for Twitter topic
detection
• Works in real time
(would handle all JP tweets in theory)
• Filters spam topics automatically
Code: github.com/hayasick/streamingNMF
Thank you!
21 / 21
22. Integrated Twitter Topic Detection System
For t = 1, . . . , T:
1 Generate X(t)
from tweets /∈ Blacklist
2 Update U, V by SGD
3 With some intervals,
Detect Topic Hijacking and update Blacklist
O(Nt)
O(NtR2
+ R3
)
O(J)
22 / 21