SlideShare a Scribd company logo
Data Stream Analytics
Sumit Misra
RS Software (India) Limited
References:
1. Mining of Massive Data Sets, 2014 : Leskovec, Rajaraman, Ullman
2. Data Streams – Models and Algorithms, 2007 : Charu C Aggarwal
Content
• Business Need
• Models & Mining
• Standing Queries
• Conclusion
June 2018
Sumit Misra 2
BUSINESS NEED OF STREAM
ANALYTICS
How important is the space
June 2018
Sumit Misra 3
Platform Business : Viral Effect
June 2018
Application in Cloud
Browser Tab App
Mobile
App
Billions of Users
Big Data
Sumit Misra 4
1.32 billion active
Facebook users
per day
Source: https://www.wordstream.com/blog/ws/2017/11/07/facebook-statistics
4 M / min
UPI Transactions
June 2018
Sumit Misra 5
Other than Facebook no application worldwide
has seen such exponential adoption in early years
of launch
Complexity
June 2018
Sumit Misra 6
Prescriptive Analytics
Predictive Analytics
Descriptive Analytics
Offline Hybrid
Low Medium
Medium High
High
Very
High
Online
High
Very
High
Super
High
Map
June 2018
Sumit Misra 7
Prescriptive
Predictive
Descriptive
Offline Hybrid Online
Will X buy the car
What items are
bought together
Detect Plagiarism
Trending of
tweets
Page Ranking
Clustering
Stock Price
Prediction
Personalized
Ads / News
Entity
Resolution
Filter Spam
emails
Fraud detection
of ecommerce
purchases
Amazon, Netflix
Recommends
Google predicts
the query string
and provides
alternatives
Business Questions : Customer Focus
• Google : How many new users have logged
since last month
• Uber: What should be the dynamic price that
would not put off customer
June 2018
Sumit Misra 8
DATA STREAM : MODEL & MINING
There is no one answer or easy answer to easy questions in data streams
June 2018
Sumit Misra 9
Considerations and Approach
• Constraint
– We cannot store all the data from the past
– The stream objects can only be scanned once
– Usually we need to answer it in seconds as it provides
an opportunity for the business
• General Approach
– Obtain a close approximate answer within a short
time
June 2018
Sumit Misra 10
Stream Processor
Stream Model
Standing
Queries
Limited
Working
Storage
Archival
Storage
Ad-hoc Queries
Output Streams
Input Streams
1,5,2,7,4,5,2,8,9
t,p,n,h,a,d,o,l,p,v,e,
1,1,0,1,0,0,1,0,1,0
What is the highest
number so far?
How many new logins in past week?
Time
June 2018
Sumit Misra 11
Mining for …
June 2018
Sumit Misra 12
CLASSIFICATION
GOOD 15004 / MIN
FRAUD 04 / MIN
GOOD 16098 / MIN
FRAUD 06 / MIN
GOOD 6011 / MIN
FRAUD 01 / MIN
6am – 2pm 2pm – 10pm 10pm – 6am
Mining for …
June 2018
Sumit Misra 13
FREQUENT ITEM SET
A 6
G 7
W 5
A, G 2
G, W 3
A, W 2
A, G, W 2
A 7
G 6
W 2
A, G 4
G, W 1
A, W 0
A, G, W 0
A 1
G 1
W 1
A, G 1
G, W 1
A, W 1
A, G, W 1
Mining for …
June 2018
Sumit Misra 14
CLUSTERS
Time Windows
June 2018
Sumit Misra 15
LANDMARK
SLIDING
Synopsis
June 2018
Sumit Misra 16
LANDMARK
Count 74
Mean (µ) 8.5
Std Dev (σ) 2.6
Max 30
Min 3
µ/σ 3.3
Count 243
Mean (µ) 9.2
Std Dev (σ) 2.1
Max 32
Min 2
µ/σ 4.4
Count 198
Mean (µ) 6.1
Std Dev (σ) 4.1
Max 46
Min 1
µ/σ 1.5
10pm – 6am 6am – 2pm 2pm – 10pm
Burst of Data
June 2018
Sumit Misra 17
Sudden online purchase due
to heavy discount
Sudden cash dispensing
fearing bank strike
How to reduce load?
How to gracefully degrade?
Source: http://intelligence.slice.com/blog/2017/online-candy-sales-hop-up-71-percent-for-easter-but-sales-trail-behind-the-holiday-season
Reduce Loads : Sampling
• How do you sample if you can store only
1/10th of POS data, but
1. Want to process purchases that contains Item A
2. Want to process purchases by users that contain
2 or more of Item A
• If you pick 1 in 10 sales items (query) it will
work for 1 but not for 2.
• For 2, you need to pick 1 in 10 users.
June 2018
HASH
Sumit Misra 18
Reduce Loads : Sampling
• What if the rate of arrivals increase and you
cannot keep 1/10th?
• Strategy: Find a sample size which will ensure
[|pe-pa|> ] < δ
– pa : Actual Fraction
– pe : Estimated Fraction
–  : Relative Error
– δ: Error Bound
June 2018
Hoeffding
Inequality
Sumit Misra 19
Hoeffding Inequality
Probability [|pe-pa|> ] < δ
If there are N sample tokens and each token Xi can take
value from interval [ai, bi]
If Y = Σ Xi, where Xi is value of a sample token, for all α > 0
Pr[|Y – E(Y)| > Nα] < (2.exp (-
2(Nα)2
Σ(Ri)2
))
Where Ri = (bi – ai)
March 2016
Sumit Misra 20
If R=1, i.e. value interval is [0,1]
If (Nα =β)
Pr[|Y-E(Y)>β] < 2*exp(-2*β2)
Fraud 1.7%, Sample N=100
Pr[|Y-E(Y)| > Nα] < 0.6%
Research and Innovations
• Innovative synopsis data structures (list, trees,
graphs, clusters, …), optimized for
– Latency of search response
– Time complexity of update
– Space complexity of storage
June 2018
Sumit Misra 21
Classification / Filtering
• How do you filter out an email id as a good one and
not a spam?
– Typical email address is 20 bytes
– Strategy if we have 1 GB memory = 8 billion bits
– Map hash of good addresses to a bit location
June 2018
Sumit Misra 22
email
Hash to 8 billion memory address
and mark 1 on location of hash
49..01 0
49..02 1
49..03 0
49..04 0
49..05 0
49..06 1
49..07 0
Pre-process
email
Hash to memory address and
check if it has a 1 or 0
Refer / Read
Assuming very large spam percentage, this will eliminate about 80% of spams
Bloom Filter
(1970)
Classification / Filtering
• False Positive is act of wrongly classifying spam as
non-spam
• In the example:
– Using one hash function FP probability is 11.75%
– Using two has functions FP probability is 4.89%
• Intuition:
– This technique may also be used to count shopping
cart that has a set of items (A, B, C)
June 2018
Sumit Misra 23
Bloom Filter
(1970)
Frequent Recent Items
• For a large universal set (N), want to find top 5
most frequent recent items (Swiggy Orders, Uber
Requests, TV channels viewed)
– Have a set of counters where the item-ids hash
– Counters store counts which exponentially (α: [0..1])
decays with time (Δ)
– Report items mapped to top 5 counts
June 2018
Sumit Misra 24
1+CαΔ
Hash (Item ID)
4621 4622 4623 4624 4625 4626
***
***
Frequent Recent Items
• For a large universal set (N), want to find top 5
most frequent recent items (Swiggy Orders, Uber
Requests, TV channels viewed)
– Have a set of counters where the item-ids hash
– Counters store counts which exponentially (α: [0..1])
decays with time (Δ)
– Report items mapped to top 5 counts
• Metrics
– Latency of search of top 5 : O(1)
– Time complexity of update : O(|w|) + O(N.log(N))
– Space complexity : O(N)
June 2018
Sumit Misra 25
Pruning can reduce
this part
Measuring Data Uniformity : Estimating Moments
• Moments for Stream : 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1
– Occurrences: 1(x4), 2(x3), 3(x4), 4(x1)
– 0th Moment = 40+30+40+10 = 4 (Distinct Items)
– 1st Moment = 41+31+41+11 = 12 (Steam Length)
– 2nd Moment = 42+32+42+12 = 42 (Surprise Number)
26
Sumit Misra June 2018
Distinct
Items
Stream
Length
Surprise
Number
1 :: 3 :: 10
Measuring Data Uniformity : Estimating Moments
• Surprise Number?
– Stream Length is 100 and has 11 Distinct Items
– Range of 2nd Moment
• 10x92+1X102 = 910 ---------- Evenly Distributed
• 10x12+902 = 8110 --------- Unevenly Distributed
27
Sumit Misra June 2018
Even Uneven
STANDING QUERIES
How to answer to known queries while handling data streams
June 2018
Sumit Misra 28
Standing Queries
• Approximate Answers
– Distinct Items = Flajolet-Martin Algorithm
– Counting 1s in last N bits of bit-stream = Datar-
Gionis-Indyk-Motwani Algorithm
– 2nd order Moments = Alon-Matias-Szegedy
Algorithm
June 2018
Sumit Misra 29
Distinct Items : Removing the repetitions
June 2018
Source: www.github.com
Sumit Misra 30
Distinct Items
• Lets work out an algorithm
– Stream: 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1...
– Hash Function: f(x) = (3x + 1) mod 5
– Stream post hash: 4, 5, 2, 4, 2, 5, 3, 5, 4, 2, 5, 4
– Convert to binary and count # trailing 0 --- R
– It is: 2, 0, 1, 2, 1, 0, 0, 0, 2, 1, 0, 2
• Max R = 2
• Probabilistic Estimate of Distinct Element M = 2R
= 22 = 4
Flajolet-Martin
Algorithm (1983)
June 2018
Sumit Misra 31
How many buyers are from the age
group of 18-28 years
June 2018
Source: www.dailyrecord.co.uk
Sumit Misra 32
Counting 1’s in a Window
• Window divided into buckets represented by
– Timestamp of right end
– Size : Number of 1’s in a bucket (pow. of 2)
• Estimating 1’s in last k=10 bits
– Exact Count = 5
– Estimate = Sum size of all but last and add ½ of the last bucket
– = (1) + (1) + (2) + (42) = 6
0 1 1 1 0 1 1 0 0 1 0 1 1 0
1 1 0 1 1 0 0 0 1
0
1
*
*
Timestamps
Bucket (41,8) (49,4) (55,4) (59,2) (61,1) (62,1)
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
k = 10 bits
June 2018
How many 1’s in last 10 (k) bits?
Sumit Misra 33
DGIM (2002)
Estimating Moments …
• Alon-Matias-Szegedy (AMS) Algorithm
– Example Stream: a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
– n=15; a(x5), b(x4), c(x3), d(x3); 2nd Moment = 59
– Estimate = n * (2 * X.value − 1)
– Pick random positions X1(3rd), X2(8th), X3(13th)
– X1.element = “c”, X1.value = 3 (# of “c” in the set from 3rd place)
– Estimate for X1= 15*(2*3-1) = 75
– Estimates for X2 = 15*(2*2-1) = 45, (# “d” beyond 8th place = 2)
– Estimate for X3 = 45, (# “a” beyond 13th place = 2)
– Average for X1, X2, X3 = 165/3 = 55 (close to 59)
34
Sumit Misra March 2016
… Estimating Moments …
• Dealing with Infinite streams
– Challenge is not the “n” as we store one variable
per randomly chosen position
– Challenge is selecting the position
35
At start After sometime
After significant time
Sumit Misra March 2016
… Estimating Moments …
• Strategy for position selection assuming we
have space to store “s” variables and we have
seen “n” elements
– First “s” positions are chosen
– When (n+1)th token arrives, randomly select
positions to keep
– If (n+1) is selected discard from existing n position
randomly and insert new element with value 1
36
Sumit Misra March 2016
Finding needle in haystack … with
perseverance 
May 2018
Source: https://plus.google.com/+GaiaCostantino
Sumit Misra 37
… in form of equation …
FINDING NEEDLE IN HAYSTACK
• Very large number of tuples [𝑛 → ∞]
• Very low probability of finding what you want
PERSEVERANCE
• Equals large number of trials
PROBABILITY OF MISSING =
June 2018
Sumit Misra 38
lim
𝑛→∞
1 −
1
𝑛
𝑛
= 𝑒−1
= 0.368 = 36.8%
lim
𝑛→∞
1 −
1
𝑛
𝑛
CONCLUSION
To sum it up …
June 2018
Sumit Misra 39
Conclusion
• Mining Data Stream has moved from research
area to every day use for commercial purposes
• With single scan constraint algorithms suitable
for static data mining cannot be applied as-is
• Finding distinct items, count of items in last n-
occurrences, uniformity of item values, etc.
represents a few use-cases
• This remains an active area of research as it can
be applied across disciplines and hence needs
due attention
June 2018
Sumit Misra 40
Questions
Answers
Ebola Virus: Predict spread using
mobile phone data from West Africa
June 2018
Mobile phone data from West Africa is being used to map population movements
and predict how the Ebola virus might spread
Source: http://www.bbc.com/news/business-29617831
Sumit Misra 42
Mar 23, 2014 (WHO reported after 29 deaths, started Dec ‘13)
Thanks
sumit@rssoftware.co.in
+91.9830336049

More Related Content

Similar to Data-Stream-Analytics.pptx

Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 
flat_presentation_time_evolving_OD_matrix_estimation
flat_presentation_time_evolving_OD_matrix_estimationflat_presentation_time_evolving_OD_matrix_estimation
flat_presentation_time_evolving_OD_matrix_estimationLuís Moreira-Matias
 
Data Science applications in business
Data Science applications in businessData Science applications in business
Data Science applications in business
Vladyslav Yakovenko
 
Human activity spatio-temporal indicators using mobile phone data
Human activity spatio-temporal indicators using mobile phone dataHuman activity spatio-temporal indicators using mobile phone data
Human activity spatio-temporal indicators using mobile phone data
University of Salerno
 
Streaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same GameStreaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same Game
Numenta
 
Semantic-based Process Analysis
Semantic-based Process AnalysisSemantic-based Process Analysis
Semantic-based Process Analysis
Mauro Dragoni
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
jins0618
 
RSC: Mining and Modeling Temporal Activity in Social Media
RSC: Mining and Modeling Temporal Activity in Social MediaRSC: Mining and Modeling Temporal Activity in Social Media
RSC: Mining and Modeling Temporal Activity in Social Media
Alceu Ferraz Costa
 
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...
Ji Hyung Moon
 
Multi state churn analysis with a subscription product
Multi state churn analysis with a subscription productMulti state churn analysis with a subscription product
Multi state churn analysis with a subscription product
Vienna Data Science Group
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
Lidia Pivovarova
 
2 funda.ppt
2 funda.ppt2 funda.ppt
2 funda.ppt
02LabiqaIslam
 
Managing your black Friday logs - CloudConf.IT
Managing your black Friday logs - CloudConf.ITManaging your black Friday logs - CloudConf.IT
Managing your black Friday logs - CloudConf.IT
David Pilato
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
Long Nguyen
 
Towards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of JuliaTowards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of Julia
Maurizio Tomasi
 
FFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationFFWD - Fast Forward With Degradation
FFWD - Fast Forward With Degradation
Rolando Brondolin
 
Introduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdfIntroduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdf
TulasiramKandula1
 
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStoraDBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
LinaCovington707
 

Similar to Data-Stream-Analytics.pptx (20)

Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
flat_presentation_time_evolving_OD_matrix_estimation
flat_presentation_time_evolving_OD_matrix_estimationflat_presentation_time_evolving_OD_matrix_estimation
flat_presentation_time_evolving_OD_matrix_estimation
 
9cmplx
9cmplx9cmplx
9cmplx
 
Data Science applications in business
Data Science applications in businessData Science applications in business
Data Science applications in business
 
Human activity spatio-temporal indicators using mobile phone data
Human activity spatio-temporal indicators using mobile phone dataHuman activity spatio-temporal indicators using mobile phone data
Human activity spatio-temporal indicators using mobile phone data
 
Streaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same GameStreaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same Game
 
Semantic-based Process Analysis
Semantic-based Process AnalysisSemantic-based Process Analysis
Semantic-based Process Analysis
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
 
RSC: Mining and Modeling Temporal Activity in Social Media
RSC: Mining and Modeling Temporal Activity in Social MediaRSC: Mining and Modeling Temporal Activity in Social Media
RSC: Mining and Modeling Temporal Activity in Social Media
 
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...
(KO) 온라인 뉴스 댓글 플랫폼을 흐리는 어뷰저 분석기 / (EN) Online ...
 
Multi state churn analysis with a subscription product
Multi state churn analysis with a subscription productMulti state churn analysis with a subscription product
Multi state churn analysis with a subscription product
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
2 funda.ppt
2 funda.ppt2 funda.ppt
2 funda.ppt
 
Managing your black Friday logs - CloudConf.IT
Managing your black Friday logs - CloudConf.ITManaging your black Friday logs - CloudConf.IT
Managing your black Friday logs - CloudConf.IT
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
Towards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of JuliaTowards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of Julia
 
Leip103
Leip103Leip103
Leip103
 
FFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationFFWD - Fast Forward With Degradation
FFWD - Fast Forward With Degradation
 
Introduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdfIntroduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdf
 
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStoraDBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
 

Recently uploaded

How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 

Recently uploaded (20)

How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 

Data-Stream-Analytics.pptx

  • 1. Data Stream Analytics Sumit Misra RS Software (India) Limited References: 1. Mining of Massive Data Sets, 2014 : Leskovec, Rajaraman, Ullman 2. Data Streams – Models and Algorithms, 2007 : Charu C Aggarwal
  • 2. Content • Business Need • Models & Mining • Standing Queries • Conclusion June 2018 Sumit Misra 2
  • 3. BUSINESS NEED OF STREAM ANALYTICS How important is the space June 2018 Sumit Misra 3
  • 4. Platform Business : Viral Effect June 2018 Application in Cloud Browser Tab App Mobile App Billions of Users Big Data Sumit Misra 4 1.32 billion active Facebook users per day Source: https://www.wordstream.com/blog/ws/2017/11/07/facebook-statistics 4 M / min
  • 5. UPI Transactions June 2018 Sumit Misra 5 Other than Facebook no application worldwide has seen such exponential adoption in early years of launch
  • 6. Complexity June 2018 Sumit Misra 6 Prescriptive Analytics Predictive Analytics Descriptive Analytics Offline Hybrid Low Medium Medium High High Very High Online High Very High Super High
  • 7. Map June 2018 Sumit Misra 7 Prescriptive Predictive Descriptive Offline Hybrid Online Will X buy the car What items are bought together Detect Plagiarism Trending of tweets Page Ranking Clustering Stock Price Prediction Personalized Ads / News Entity Resolution Filter Spam emails Fraud detection of ecommerce purchases Amazon, Netflix Recommends Google predicts the query string and provides alternatives
  • 8. Business Questions : Customer Focus • Google : How many new users have logged since last month • Uber: What should be the dynamic price that would not put off customer June 2018 Sumit Misra 8
  • 9. DATA STREAM : MODEL & MINING There is no one answer or easy answer to easy questions in data streams June 2018 Sumit Misra 9
  • 10. Considerations and Approach • Constraint – We cannot store all the data from the past – The stream objects can only be scanned once – Usually we need to answer it in seconds as it provides an opportunity for the business • General Approach – Obtain a close approximate answer within a short time June 2018 Sumit Misra 10
  • 11. Stream Processor Stream Model Standing Queries Limited Working Storage Archival Storage Ad-hoc Queries Output Streams Input Streams 1,5,2,7,4,5,2,8,9 t,p,n,h,a,d,o,l,p,v,e, 1,1,0,1,0,0,1,0,1,0 What is the highest number so far? How many new logins in past week? Time June 2018 Sumit Misra 11
  • 12. Mining for … June 2018 Sumit Misra 12 CLASSIFICATION GOOD 15004 / MIN FRAUD 04 / MIN GOOD 16098 / MIN FRAUD 06 / MIN GOOD 6011 / MIN FRAUD 01 / MIN 6am – 2pm 2pm – 10pm 10pm – 6am
  • 13. Mining for … June 2018 Sumit Misra 13 FREQUENT ITEM SET A 6 G 7 W 5 A, G 2 G, W 3 A, W 2 A, G, W 2 A 7 G 6 W 2 A, G 4 G, W 1 A, W 0 A, G, W 0 A 1 G 1 W 1 A, G 1 G, W 1 A, W 1 A, G, W 1
  • 14. Mining for … June 2018 Sumit Misra 14 CLUSTERS
  • 15. Time Windows June 2018 Sumit Misra 15 LANDMARK SLIDING
  • 16. Synopsis June 2018 Sumit Misra 16 LANDMARK Count 74 Mean (µ) 8.5 Std Dev (σ) 2.6 Max 30 Min 3 µ/σ 3.3 Count 243 Mean (µ) 9.2 Std Dev (σ) 2.1 Max 32 Min 2 µ/σ 4.4 Count 198 Mean (µ) 6.1 Std Dev (σ) 4.1 Max 46 Min 1 µ/σ 1.5 10pm – 6am 6am – 2pm 2pm – 10pm
  • 17. Burst of Data June 2018 Sumit Misra 17 Sudden online purchase due to heavy discount Sudden cash dispensing fearing bank strike How to reduce load? How to gracefully degrade? Source: http://intelligence.slice.com/blog/2017/online-candy-sales-hop-up-71-percent-for-easter-but-sales-trail-behind-the-holiday-season
  • 18. Reduce Loads : Sampling • How do you sample if you can store only 1/10th of POS data, but 1. Want to process purchases that contains Item A 2. Want to process purchases by users that contain 2 or more of Item A • If you pick 1 in 10 sales items (query) it will work for 1 but not for 2. • For 2, you need to pick 1 in 10 users. June 2018 HASH Sumit Misra 18
  • 19. Reduce Loads : Sampling • What if the rate of arrivals increase and you cannot keep 1/10th? • Strategy: Find a sample size which will ensure [|pe-pa|> ] < δ – pa : Actual Fraction – pe : Estimated Fraction –  : Relative Error – δ: Error Bound June 2018 Hoeffding Inequality Sumit Misra 19
  • 20. Hoeffding Inequality Probability [|pe-pa|> ] < δ If there are N sample tokens and each token Xi can take value from interval [ai, bi] If Y = Σ Xi, where Xi is value of a sample token, for all α > 0 Pr[|Y – E(Y)| > Nα] < (2.exp (- 2(Nα)2 Σ(Ri)2 )) Where Ri = (bi – ai) March 2016 Sumit Misra 20 If R=1, i.e. value interval is [0,1] If (Nα =β) Pr[|Y-E(Y)>β] < 2*exp(-2*β2) Fraud 1.7%, Sample N=100 Pr[|Y-E(Y)| > Nα] < 0.6%
  • 21. Research and Innovations • Innovative synopsis data structures (list, trees, graphs, clusters, …), optimized for – Latency of search response – Time complexity of update – Space complexity of storage June 2018 Sumit Misra 21
  • 22. Classification / Filtering • How do you filter out an email id as a good one and not a spam? – Typical email address is 20 bytes – Strategy if we have 1 GB memory = 8 billion bits – Map hash of good addresses to a bit location June 2018 Sumit Misra 22 email Hash to 8 billion memory address and mark 1 on location of hash 49..01 0 49..02 1 49..03 0 49..04 0 49..05 0 49..06 1 49..07 0 Pre-process email Hash to memory address and check if it has a 1 or 0 Refer / Read Assuming very large spam percentage, this will eliminate about 80% of spams Bloom Filter (1970)
  • 23. Classification / Filtering • False Positive is act of wrongly classifying spam as non-spam • In the example: – Using one hash function FP probability is 11.75% – Using two has functions FP probability is 4.89% • Intuition: – This technique may also be used to count shopping cart that has a set of items (A, B, C) June 2018 Sumit Misra 23 Bloom Filter (1970)
  • 24. Frequent Recent Items • For a large universal set (N), want to find top 5 most frequent recent items (Swiggy Orders, Uber Requests, TV channels viewed) – Have a set of counters where the item-ids hash – Counters store counts which exponentially (α: [0..1]) decays with time (Δ) – Report items mapped to top 5 counts June 2018 Sumit Misra 24 1+CαΔ Hash (Item ID) 4621 4622 4623 4624 4625 4626 *** ***
  • 25. Frequent Recent Items • For a large universal set (N), want to find top 5 most frequent recent items (Swiggy Orders, Uber Requests, TV channels viewed) – Have a set of counters where the item-ids hash – Counters store counts which exponentially (α: [0..1]) decays with time (Δ) – Report items mapped to top 5 counts • Metrics – Latency of search of top 5 : O(1) – Time complexity of update : O(|w|) + O(N.log(N)) – Space complexity : O(N) June 2018 Sumit Misra 25 Pruning can reduce this part
  • 26. Measuring Data Uniformity : Estimating Moments • Moments for Stream : 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1 – Occurrences: 1(x4), 2(x3), 3(x4), 4(x1) – 0th Moment = 40+30+40+10 = 4 (Distinct Items) – 1st Moment = 41+31+41+11 = 12 (Steam Length) – 2nd Moment = 42+32+42+12 = 42 (Surprise Number) 26 Sumit Misra June 2018 Distinct Items Stream Length Surprise Number 1 :: 3 :: 10
  • 27. Measuring Data Uniformity : Estimating Moments • Surprise Number? – Stream Length is 100 and has 11 Distinct Items – Range of 2nd Moment • 10x92+1X102 = 910 ---------- Evenly Distributed • 10x12+902 = 8110 --------- Unevenly Distributed 27 Sumit Misra June 2018 Even Uneven
  • 28. STANDING QUERIES How to answer to known queries while handling data streams June 2018 Sumit Misra 28
  • 29. Standing Queries • Approximate Answers – Distinct Items = Flajolet-Martin Algorithm – Counting 1s in last N bits of bit-stream = Datar- Gionis-Indyk-Motwani Algorithm – 2nd order Moments = Alon-Matias-Szegedy Algorithm June 2018 Sumit Misra 29
  • 30. Distinct Items : Removing the repetitions June 2018 Source: www.github.com Sumit Misra 30
  • 31. Distinct Items • Lets work out an algorithm – Stream: 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1... – Hash Function: f(x) = (3x + 1) mod 5 – Stream post hash: 4, 5, 2, 4, 2, 5, 3, 5, 4, 2, 5, 4 – Convert to binary and count # trailing 0 --- R – It is: 2, 0, 1, 2, 1, 0, 0, 0, 2, 1, 0, 2 • Max R = 2 • Probabilistic Estimate of Distinct Element M = 2R = 22 = 4 Flajolet-Martin Algorithm (1983) June 2018 Sumit Misra 31
  • 32. How many buyers are from the age group of 18-28 years June 2018 Source: www.dailyrecord.co.uk Sumit Misra 32
  • 33. Counting 1’s in a Window • Window divided into buckets represented by – Timestamp of right end – Size : Number of 1’s in a bucket (pow. of 2) • Estimating 1’s in last k=10 bits – Exact Count = 5 – Estimate = Sum size of all but last and add ½ of the last bucket – = (1) + (1) + (2) + (42) = 6 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 1 0 1 * * Timestamps Bucket (41,8) (49,4) (55,4) (59,2) (61,1) (62,1) 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 k = 10 bits June 2018 How many 1’s in last 10 (k) bits? Sumit Misra 33 DGIM (2002)
  • 34. Estimating Moments … • Alon-Matias-Szegedy (AMS) Algorithm – Example Stream: a, b, c, b, d, a, c, d, a, b, d, c, a, a, b – n=15; a(x5), b(x4), c(x3), d(x3); 2nd Moment = 59 – Estimate = n * (2 * X.value − 1) – Pick random positions X1(3rd), X2(8th), X3(13th) – X1.element = “c”, X1.value = 3 (# of “c” in the set from 3rd place) – Estimate for X1= 15*(2*3-1) = 75 – Estimates for X2 = 15*(2*2-1) = 45, (# “d” beyond 8th place = 2) – Estimate for X3 = 45, (# “a” beyond 13th place = 2) – Average for X1, X2, X3 = 165/3 = 55 (close to 59) 34 Sumit Misra March 2016
  • 35. … Estimating Moments … • Dealing with Infinite streams – Challenge is not the “n” as we store one variable per randomly chosen position – Challenge is selecting the position 35 At start After sometime After significant time Sumit Misra March 2016
  • 36. … Estimating Moments … • Strategy for position selection assuming we have space to store “s” variables and we have seen “n” elements – First “s” positions are chosen – When (n+1)th token arrives, randomly select positions to keep – If (n+1) is selected discard from existing n position randomly and insert new element with value 1 36 Sumit Misra March 2016
  • 37. Finding needle in haystack … with perseverance  May 2018 Source: https://plus.google.com/+GaiaCostantino Sumit Misra 37
  • 38. … in form of equation … FINDING NEEDLE IN HAYSTACK • Very large number of tuples [𝑛 → ∞] • Very low probability of finding what you want PERSEVERANCE • Equals large number of trials PROBABILITY OF MISSING = June 2018 Sumit Misra 38 lim 𝑛→∞ 1 − 1 𝑛 𝑛 = 𝑒−1 = 0.368 = 36.8% lim 𝑛→∞ 1 − 1 𝑛 𝑛
  • 39. CONCLUSION To sum it up … June 2018 Sumit Misra 39
  • 40. Conclusion • Mining Data Stream has moved from research area to every day use for commercial purposes • With single scan constraint algorithms suitable for static data mining cannot be applied as-is • Finding distinct items, count of items in last n- occurrences, uniformity of item values, etc. represents a few use-cases • This remains an active area of research as it can be applied across disciplines and hence needs due attention June 2018 Sumit Misra 40
  • 42. Ebola Virus: Predict spread using mobile phone data from West Africa June 2018 Mobile phone data from West Africa is being used to map population movements and predict how the Ebola virus might spread Source: http://www.bbc.com/news/business-29617831 Sumit Misra 42 Mar 23, 2014 (WHO reported after 29 deaths, started Dec ‘13)

Editor's Notes

  1. Application of Big Data in Business Solutions Abstract: The proliferation of platform economy and e/m-commerce has resulted in availability of very large corpus of data. Big Data is now being used to gain insight from these data corpus; machine learning is used to build predictive models from these data streams and adjust the models at high frequency and finally detecting outliers to utilize it for either leveraging a business opportunity or containing a risk. This session would discuss various aspects of the usage of the big data technologies that are being used for business solutions.
  2. This is mining data stream