This document discusses examples of working with streaming data. It provides five examples:
1) Using news articles and stock data to predict stock market changes based on events.
2) An interactive music conducting system using motion sensors.
3) Analyzing social media streams to identify users' locations.
4) Identifying real-world events from social media streams.
5) Detecting mental health disorders by analyzing patterns in social media users' posts and interactions. The document outlines frameworks and methodologies for each example application of streaming data analysis.
A detailed roadmap through the Analyze phase of the DMAIC methodology that navigates the user through the various tools and concepts for leading a Six Sigma project.
Rinse and Repeat : The Spiral of Applied Machine LearningAnna Chaney
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
Measure Phase Roadmap (Level 3) with Matt Hansen at StatStuffMatt Hansen
A detailed roadmap through the Measure phase of the DMAIC methodology that navigates the user through the various tools and concepts for leading a Six Sigma project.
A detailed roadmap through the Analyze phase of the DMAIC methodology that navigates the user through the various tools and concepts for leading a Six Sigma project.
Rinse and Repeat : The Spiral of Applied Machine LearningAnna Chaney
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
Measure Phase Roadmap (Level 3) with Matt Hansen at StatStuffMatt Hansen
A detailed roadmap through the Measure phase of the DMAIC methodology that navigates the user through the various tools and concepts for leading a Six Sigma project.
Machine Learning for Forecasting: From Data to DeploymentAnant Agarwal
Forecasting is everywhere. This talk covers:
• Fundamental concepts of time series
• Data preprocessing (imputation and outlier analysis)
• Feature engineering and EDA for time series
• Statistical and machine learning algorithms
• Model evaluation through backtesting
• Model explanation using SHAP
• Model monitoring and deployment considerations
This was the presentation for the Microsoft Community Technology Update of 2016. The idea was to introduce to people the concept of Machine Learning and its easy to get started if you are keen. My objective was also to communicate how some of the algorithms work and they require no more than basic understanding of Math to get going, sometimes not even that.
The algorithms we covered were, Support Vector Machines (SVM), Decision Tree using R2D3 and Neural Networks for classification. We used the Tensorflow Playground to help understand the Neural Network and Deep Learning concepts.
I gave an analogy of how Machine Learning process is like making a smoothie where your algorithm is a recipe, your data are your ingredients, your computer is your blender and your smoothie is the model that you developed. I used the same example to convey the concept of Training Validation and Testing. Coverage of Type 1 and Type 2 errors together with the metrics of Recall and Precision was covered as well. Finally I closed the session with what are some good resources to get started with Machine Learning for all skill levels. There are references to websites, courses, kaggle competition, podcasts, cheat sheets and books.
LinkedIn talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Kinjal Basu from LinkedIn discussed Online Parameter Selection for web-based Ranking vis Bayesian Optimization
The ACM RecSys Challenge 2016 was focussing on the problem of job recommendations: given a user, return a ranked list of jobs that the user is likely to be interested in. More than 100 teams actively participated and submitted solutions. All the winning teams used an ensemble of recommender strategies (e.g. learning to rank approaches, matrix factorization techniques, etc.). More details: http://2016.recsyschallenge.com/
Machine learning and Internet of Things, the future of medical preventionPierre Gutierrez
Title:
"Machine learning and Internet of Things, the future of medical prevention"
Abstract:
In this talk, Pierre Gutierrez, a data scientist at Dataiku, will discuss Dataiku's experiences using machine learning on IOT data. We will talk about the challenges processing and cleaning IoT data, and how to successfully train a model that can be deployed in production. We will illustrate our talk with two examples from our previous work. Creating algorithm for early epilepsy seizure detection based on wearable tech and Detecting people activity through sensor data.
Invited lecture at Emory University Rollins School of Public Health. We presented our InSTEDD global early warning and response social platform; Evolve (http://instedd.org/evolve) with live demonstration.
Machine Learning for Forecasting: From Data to DeploymentAnant Agarwal
Forecasting is everywhere. This talk covers:
• Fundamental concepts of time series
• Data preprocessing (imputation and outlier analysis)
• Feature engineering and EDA for time series
• Statistical and machine learning algorithms
• Model evaluation through backtesting
• Model explanation using SHAP
• Model monitoring and deployment considerations
This was the presentation for the Microsoft Community Technology Update of 2016. The idea was to introduce to people the concept of Machine Learning and its easy to get started if you are keen. My objective was also to communicate how some of the algorithms work and they require no more than basic understanding of Math to get going, sometimes not even that.
The algorithms we covered were, Support Vector Machines (SVM), Decision Tree using R2D3 and Neural Networks for classification. We used the Tensorflow Playground to help understand the Neural Network and Deep Learning concepts.
I gave an analogy of how Machine Learning process is like making a smoothie where your algorithm is a recipe, your data are your ingredients, your computer is your blender and your smoothie is the model that you developed. I used the same example to convey the concept of Training Validation and Testing. Coverage of Type 1 and Type 2 errors together with the metrics of Recall and Precision was covered as well. Finally I closed the session with what are some good resources to get started with Machine Learning for all skill levels. There are references to websites, courses, kaggle competition, podcasts, cheat sheets and books.
LinkedIn talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Kinjal Basu from LinkedIn discussed Online Parameter Selection for web-based Ranking vis Bayesian Optimization
The ACM RecSys Challenge 2016 was focussing on the problem of job recommendations: given a user, return a ranked list of jobs that the user is likely to be interested in. More than 100 teams actively participated and submitted solutions. All the winning teams used an ensemble of recommender strategies (e.g. learning to rank approaches, matrix factorization techniques, etc.). More details: http://2016.recsyschallenge.com/
Machine learning and Internet of Things, the future of medical preventionPierre Gutierrez
Title:
"Machine learning and Internet of Things, the future of medical prevention"
Abstract:
In this talk, Pierre Gutierrez, a data scientist at Dataiku, will discuss Dataiku's experiences using machine learning on IOT data. We will talk about the challenges processing and cleaning IoT data, and how to successfully train a model that can be deployed in production. We will illustrate our talk with two examples from our previous work. Creating algorithm for early epilepsy seizure detection based on wearable tech and Detecting people activity through sensor data.
Invited lecture at Emory University Rollins School of Public Health. We presented our InSTEDD global early warning and response social platform; Evolve (http://instedd.org/evolve) with live demonstration.
This exhaustive and vibrant PowerPoint has around 90 slides and explains in detail all the must know concepts of Management in Healthcare. These slides have enough information to use it for 3 hour seminar (2 sessions) on Modern Management Techniques and its application in Healthcare. The session can be further extended if the concepts are explained with appropriate examples.
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
Describing a predictive data mining model can provide a competitive advantage for solving business problems with a model. The SSA approach can also provide reasons for the forecast for each record. This can help drive investigations into fields and interactions during a data mining project, as well as identifying "data drift" between the original training data, and the current scoring data. I am working on open source version of SSA, first in R.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Adjusting primitives for graph : SHORT REPORT / NOTES
Examples of working with streaming data
1. Examples of Working
with Streaming Data
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com
2. Hello
陳宜欣 Yi-Shin Chen
Currently
Associate professor at NTHU CS
Director of IDEA Lab
Education
Ph.D. in Computer Science, USC, USA
M.B.A. in Information Management, NCU, TW
B.B.A. in Information Management, NCU, TW
Courses
Introduction to Database Systems
Advanced Database Systems
Data Mining: Concepts, Techniques, and
Applications
2
5. Streaming Data
Continuous flow
E.g.,
Infinite length
Impractical to store and use all historical data
Concept drift
Not wise to use all historical data
Stock Volume
Sensor Data
Social Stream
6. 6
Continuous Queries
Stream DB
Acquisition
Process
Raw data &
Transformation of
Raw Stream
Transformation of
Raw Stream
Continuous
Query
Process
Crowd Wisdom
Rules/Patterns
Continuously Provide Feedback
Three major approaches for continuous queries
•Fast on-line classification/clustering
•Sliding window
•Range aggregation
8. Framework of Off-line Training Module
Acquisition
Process
Acquisition
Process
Crowd Wisdom
Rules/Patterns
9. Alignment
Industry:
Finance
Industry:
Textile
Industry:
Car
………
….
𝑏𝑒𝑙𝑜𝑛𝑔 𝑛 = [𝑃 𝑓𝑖𝑛𝑎𝑛𝑐𝑒, 𝑃𝑡𝑒𝑥𝑡𝑖𝑙𝑒, … … , 𝑃 𝑐𝑎𝑟]
於2011年4月在上海車展首度現身的Luxgen
Neora概念車,不但是國產自主品牌Luxgen自
創立以來,首度推出的第一輛概念車款……
𝑏𝑒𝑙𝑜𝑛𝑔 𝑛 = [0, 0, … … , 3]
Comp-
anies
Related
words
Comp-
anies
Related
words
Comp-
anies
Related
words
𝑃𝑓𝑖𝑛𝑎𝑛𝑐𝑒 =0 𝑃𝑡𝑒𝑥𝑡𝑖𝑙𝑒 =0 𝑃𝑐𝑎𝑟 = 3
10. Itemset Production
日本+地震 日本+救災
日本+地震 日本+淹水
日本+地震 日本+影響
日本+地震 日本+預估
日本+地震 日本+破壞
日本+購買 日本+旅遊
…
…
…
…
…
…
…
…
…
…
…
…
The confidence of
日本+地震:
The number of 日本+地震
appears in all transactions:
𝑢 𝑠
The number of 日本 appears
in all transactions:
𝑛 𝑝
The confidence of 日本+地
震 :
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 =
𝑢 𝑠
𝑛 𝑝
=
5
6
Group
11. Representative Itemset Selection
Select itemsets based on high confidence as a
candidate of representative itemset.
𝑤𝑒𝑖𝑔ℎ𝑡 = 𝑥 ∗ 𝑡𝑓𝑖𝑑𝑓1 + 𝑦 ∗ 𝑡𝑓𝑖𝑑𝑓2 + 𝑧 ∗ 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒
日本+地震 日本+預估 核能+外洩 危機+發生
日本 地震 預估 核能 外洩 危機 發生
0.22 0.25 0.03 0.18 0.2 0.10 0.001
日本+地震 日本+預估 核能+外洩 危機+發生
0.833 1 0.667 0.667
日本,地震,核能,外洩
Concept
12. Concept Verification
By considering:
The daily frequency of concept 𝐶𝑗
The concept index 𝐶𝐼𝑗 of 𝐶𝑗
Regression model based on price within sliding windows
If p-value reject 𝐻0, the concept 𝐶𝑗 will be
considered as an influential event
13. On-line Prediction Module
Regression prediction
Use most frequent event.
Adjust regression prediction
Include other events which is not the most frequent.
Pheromone prediction
Include the past influence.
Continuous
Query
Process
14. Experimental Data
Stock data
Industry index from TWSE.
2012-01-01 to 2012-05-11
News data
Crawl the news form website.
Yahoo!, udn, Libertytimes, PCHome, etc.
Total 13 websites.
2012-01-01 to 2012-05-11
More than 150,000 news.
All the news is in Traditional Chinese.
15. Experimental Setup
Four methods to predict the market:
Pheromone prediction model
Adjust regression prediction model
Regression prediction model
Blind test.
Prediction
policy: fall rise
NSM
(no significant move)
16. Performance
Accuracy of four methods:
Methods Average
Accuracy
Pheromone 0.5784574
Adjust
regression
0.5323214
Regression 0.5134457
Blind test 0.3045479
17. Performance
Is it work on the whole market?
It catches our attention on using event to predict the
whole market by aggregate all the industry into all.
Type Accuracy
Pheromone 0.6315789
Adjust Regression 0.6896511
Regression 0.5714285
19. Motivation
Diversify human computer interaction
technology with multimedia
Music education
Music experiment
Amateur and professional conductors
Composers
Personal amusement
19
20. Devices
Build an interactive conducting system using motion
Microsoft Kinect
20
3D Depth Sensors
22. Conducting Data (Data Streams)
Cartesian coordinate (x,y,z)
30 Frames per second under 320x240 resolution
delay 33 ms (1/30 second)
Human eyes can process 10 to 12 frames per second [2]
delay ≈ 100 ms (1/10 second)
22
+Y
+X
Z
Sensor Direction
-X
-Y
23. Framework
23
Conducting Data
Received
Beat Pattern
Recognition
Whole Measure
Volume Identify Instrument Emphasis
Relative height of hand Tilt Z-Mapping
Volume Adjustment
According to
Instrument Emphasis
Tempo Adjustment
According to
Instrument Emphasis
YesStop Gesture
Recognition
Initial System
PlayStatus = False
Is
PlayStatus
true
No
Is
Stop
true
Is
Start
true
Yes
PlayStatus
= False
No Yes
PlayStatus
= True No
Start Gesture
Recognition
Acquisition
Process
Crowd Wisdom
Rules/Patterns
Offline Analysis
Continuous
Query
Process
24. Experiments
24
Evaluation
Beat pattern and measure recognition
Volume control and instrument emphasis recognition
Response time
Experimental Setup
Participants
1 professional
8 had no experience
Practice
30 minutes
25. Beat Pattern and Measure Recognition Evaluation
25
0.7826
0.86480.8438
0.8821
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Professional No Experiece
RecognitionRate
Recall
Precision
28. Goal
Identify the location of a particular Twitter
user at a given time
Using exclusively the content of his/her tweets
28
29. Major Challenges
Twitter Challenges
Tweets are noisy
Extensive use of non-standard vocabulary
Bots and spammers
Geo-locational Challenges
Users might have several associated locations
Toponyms
Scarce information
False profile information
29
31. Experimental Setup
Original Dataset 1.53 M Twitter users and 13 M tweets
3,314 Twitter users and 2.2 M tweets
104,054 geo-tagged tweets
Although we collected and processed data carefully, it still
needed to be validated
• Use of Local Experts
– People familiar with the geography of the country
Original
Tweets
Subject
Identification
Location
Discovery Tweets
Toponyms
Removal
Timeline
Sorting
Final
Results
329,814 57,153 18,662 9,093 6,928 2,165
35. Introduction
By analyzing social streams, it can benefit in
Emergency control
Crowd opinion analysis
Unreported events detection
Motivation: event identification from social
streams
35
37. Methodology – Keyword Selection
Well-noticed criterion
Compared to the past, if a word suddenly be
mentioned by many users, it is well-noticed
Time Frame – a unit of time period
Sliding Window – a certain number of past time frames
time
tf0 tf1 tf2 tf3 tf4
37
38. Methodology –
Event Candidate Recognition
Idea: group one keyword with its most relevant
keywords into one event candidate
38
boston
explosion confirm
prayerbombing boston-
marathon
threat
iraq
jfk
hospital
victim afghanistan
bomb
america
39. Methodology –
Evolving Social Graph Analysis
Information decay:
Vertex weight, edge weight
Decay mechanism
Concept-Based Evolving Graph Sequences (cEGS):
a sequence of directed graphs that demonstrate
information propagation
tf1 tf2 tf3
39
40. Experiment
Testing
Events identified in November 2013
Evaluated by 7 human experts
40
Average precision 86.64%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Nov_2
Nov_3
Nov_4
Nov_5
Nov_6
Nov_7
Nov_8
Nov_10
Nov_11
Nov_12
Nov_13
Nov_14
Nov_15
Nov_16
Nov_17
Nov_18
Nov_19
Nov_22
Nov_23
Nov_24
Nov_25
Nov_26
Nov_27
Nov_28
Nov_29
Nov_30
Precision
Date
42. Introduction
18.1% people suffer from mental disorder in United States (*)
Using Social Network to research on Mental Disorder
National Insititute of Mental Helath:
http://www.nimh.nih.gov/health/statistics/prevalence/index.shtml
Analyze
43. Background
Bipolar Disorder:
*Unstable and impulsive emotions
Cycling between Maniac and Depression
episodes
Borderline Personality Disorder:
*Unstable and impulsive emotions
Impaired social interactions
53. Basic Guidelines
Identify the common and differences between
the experimental and control groups
Word/pattern frequency
Emotion related data (e.g., flipping rates, occurrence rates)
Social interaction (e.g., retweet, reply)
Lifestyle (e.g., online time, stay-up or not)
Age and gender
Features
53
54. Apply Classifiers (Online)
By utilize the extracted features
Various classifiers
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Random forest
54
Continuous
Query
Process
Similarly, in order to measure the effectiveness of our method, the results of the Hometown dataset were split into “Factual” and “Empty | Fictional”
-The first category refers to those profiles in which the user has explicitly stated his location
as a valid point. Belonging to the second category, are those profiles whose location is listed as empty, fictional, or overbroad
-WMAE: Workers MAE
-Tw MAE: Tweet MAE
-Workers would usually agree on the city , but not on the area as a result of their perception.
On a general basis, the error distance remained low. Also for reallocated tweets
TW mae remain low as compared to the area of united states 3.1 million square miles