This document discusses data stream analytics and mining. It covers the business need for stream analytics to analyze high-volume, continuous data streams. It describes models and mining techniques for data streams, including standing queries to continuously query streams. Key challenges in data streams are addressed, such as limited memory and single-pass processing. Specific algorithms are summarized, such as for estimating distinct items, counts, and moments over data streams. The conclusion emphasizes that data stream mining has become important for many applications due to its ability to analyze continuous, high-volume data streams with limited memory and single-pass constraints.
Atomic algorithm and the servers' s use to find the Hamiltonian cyclesIJERA Editor
Inspired by the movement of the particles in the atom, I demonstrated in [5] the existence of a polynomial
algorithm of the order
O(n
3
)
for finding Hamiltonian cycles in a graph with basis
E= {x0,...
, xn− 1
}
. In this
article I will give an improvement in space and in time of the algorithm says: we know that there exist several
methods to find the Hamiltonian cycles such as the Monte Carlo method, Dynamic programming, or DNA
computing. Unfortunately they are either expensive or slow to execute it. Hence the idea to use multiple servers
to solve this problem : Each point
xi
in the graph will be considered as a server, and each server
xi
will
communicate with each other server
x j
with which it is connected . And finally the server
x0
will receive
and display the Hamiltonian cycles if they exist
ODSC 2019: Sessionisation via stochastic periods for root event identificationKuldeep Jiwani
In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Associating a chain of events in order of time helps in doing a root event analysis. In certain cases a time ordered correlation and root event identification is good enough to automatically identify signatures of various malicious actors and take appropriate corrective actions to stop cyber attacks, stop malicious social campaigns, etc.
Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model "Stochastic periods" and learn their probability distributions in an unsupervised manner.
The main focus of this talk will be to showcase applied data science techniques to discover stochastic periods. There are many ways to obtain periods in data, so the journey would begin by a walk through of existing techniques like FFT (Fast Fourier Transform) then discuss about Gaussian Mixture Models. After highlighting the short comings of these techniques we will succinctly explain one of the most general non-parametric Bayesian approaches to solve this problem. Without going too deep in the complex math, we will get back to applied data science and discuss a much simpler technique that can solve the same problem if certain assumptions are satisfied.
In this talk we will demonstrate some time based pattern we discovered while working on a security analytics use case that uses Sessionisation. In the talk we will demonstrate such patterns based on an open source malware attack datasets that is available publicly.
Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling, Bayesian non-parametric methods
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinGuido Schmutz
Time series data is everywhere: IoT, sensor data or financial transactions. The industry has moved to databases like Cassandra to handle the high velocity and high volume of data that is now common place. In this talk I will present how we have used Cassandra to store time series data. I will highlight both the Cassandra data model as well as the architecture we put in place for collecting and ingesting data into Cassandra, using Apache Kafka and Apache Storm.
Atomic algorithm and the servers' s use to find the Hamiltonian cyclesIJERA Editor
Inspired by the movement of the particles in the atom, I demonstrated in [5] the existence of a polynomial
algorithm of the order
O(n
3
)
for finding Hamiltonian cycles in a graph with basis
E= {x0,...
, xn− 1
}
. In this
article I will give an improvement in space and in time of the algorithm says: we know that there exist several
methods to find the Hamiltonian cycles such as the Monte Carlo method, Dynamic programming, or DNA
computing. Unfortunately they are either expensive or slow to execute it. Hence the idea to use multiple servers
to solve this problem : Each point
xi
in the graph will be considered as a server, and each server
xi
will
communicate with each other server
x j
with which it is connected . And finally the server
x0
will receive
and display the Hamiltonian cycles if they exist
ODSC 2019: Sessionisation via stochastic periods for root event identificationKuldeep Jiwani
In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Associating a chain of events in order of time helps in doing a root event analysis. In certain cases a time ordered correlation and root event identification is good enough to automatically identify signatures of various malicious actors and take appropriate corrective actions to stop cyber attacks, stop malicious social campaigns, etc.
Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model "Stochastic periods" and learn their probability distributions in an unsupervised manner.
The main focus of this talk will be to showcase applied data science techniques to discover stochastic periods. There are many ways to obtain periods in data, so the journey would begin by a walk through of existing techniques like FFT (Fast Fourier Transform) then discuss about Gaussian Mixture Models. After highlighting the short comings of these techniques we will succinctly explain one of the most general non-parametric Bayesian approaches to solve this problem. Without going too deep in the complex math, we will get back to applied data science and discuss a much simpler technique that can solve the same problem if certain assumptions are satisfied.
In this talk we will demonstrate some time based pattern we discovered while working on a security analytics use case that uses Sessionisation. In the talk we will demonstrate such patterns based on an open source malware attack datasets that is available publicly.
Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling, Bayesian non-parametric methods
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinGuido Schmutz
Time series data is everywhere: IoT, sensor data or financial transactions. The industry has moved to databases like Cassandra to handle the high velocity and high volume of data that is now common place. In this talk I will present how we have used Cassandra to store time series data. I will highlight both the Cassandra data model as well as the architecture we put in place for collecting and ingesting data into Cassandra, using Apache Kafka and Apache Storm.
Simple representations for learning: factorizations and similarities Gael Varoquaux
Real-life data seldom comes in the ideal form for statistical learning.
This talk focuses on high-dimensional problems for signals and
discrete entities: when dealing with many, correlated, signals or
entities, it is useful to extract representations that capture these
correlations.
Matrix factorization models provide simple but powerful representations. They are used for recommender systems across discrete entities such as users and products, or to learn good dictionaries to represent images. However they entail large computing costs on very high-dimensional data, databases with many products or high-resolution images. I will present an
algorithm to factorize huge matrices based on stochastic subsampling that gives up to 10-fold speed-ups [1].
With discrete entities, the explosion of dimensionality may be due to variations in how a smaller number of categories are represented. Such a problem of "dirty categories" is typical of uncurated data sources. I will discuss how encoding this data based on similarities recovers a useful category structure with no preprocessing. I will show how it interpolates between one-hot encoding and techniques used in character-level natural language processing.
[1] Stochastic subsampling for factorizing huge matrices, A Mensch, J Mairal, B Thirion, G Varoquaux, IEEE Transactions on Signal Processing 66 (1), 113-128
[2] Similarity encoding for learning with dirty categorical variables. P Cerda, G Varoquaux, B Kégl Machine Learning (2018): 1-18
Human activity spatio-temporal indicators using mobile phone dataUniversity of Salerno
In the context of Smart Cities, monitoring the dynamic of the presence of people is a crucial aspect for the well-being of an urban area. We use mobile phone data as a proxy for the total number of people (Carpita & Simonetto 2014), with the specific aim of computing spatio-temporal region specific indicators. Telecom Italia Mobile (TIM), which is the largest operator in Italy, thanks to a research agreement with the Statistical Office of the Municipality of Brescia, provided to us about two years (April 2014 to June 2016) of High-Frequency Daily Mobile Phone Density Profiles (DMPDPs) in the form of a regular grid polygon each 15 minutes. Densities have to be rescaled in order to express the total amount of people rather than just TIM users. Separately
for selected regions in the province of Brescia, characterized by being either working or residential areas, we group similar DMPDPs and we characterize groups by their spatial and temporal components. In doing so, we propose a mixed-approach procedure.
The widespread adoption of Information Technology systems and their
capability to trace data about process executions has made available Information
Technology data for the analysis of process executions. Meanwhile, at business
level, static and procedural knowledge, which can be exploited to analyze and rea-
son on data, is often available. In this paper we aim at providing an approach that,
combining static and procedural aspects, business and data levels and exploiting
semantic-based techniques allows business analysts to infer knowledge and use it
to analyze system executions. The proposed solution has been implemented using
current scalable Semantic Web technologies, that offer the possibility to keep the
advantages of semantic-based reasoning with non-trivial quantities of data.
RSC: Mining and Modeling Temporal Activity in Social MediaAlceu Ferraz Costa
Presentation of the KDD 2015 paper describing the RSC model:
RSC: Mining and Modeling Temporal Activity in Social Media
Alceu Ferraz Costa, Yuto Yamaguchi, Agma Juci Machado Traina, Caetano Traina Jr., and Christos Faloutsos
The 21st SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2015
(KO) <<2019 데이터야놀자>> 에서 발표한 내용입니다.
사회적으로 문제가 되고 있는 어뷰징 분석과 이를 방지할 수 있는 방법에 대해 정리해보았습니다.
(EN) Presented in Datayanolja 2019 (Domestic data conference).
Dealing with one of the big social issues: manipulation of public opinions in online news platform.
Main contributions are two-fold: 1) identifying manipulators 2) suggesting possible 3 solutions
* notice: The material is written in Korean.
The consumer product landscape, particularly among e-commerce firms, includes a bevy of subscription-based business models. Internet and mobile phone subscriptions are now commonplace and joining the ranks are dietary supplements, meals, clothing, cosmetics and personal grooming products.
Standard metrics to diagnose a healthy consumer-brand relationship typically include customer purchase frequency and ultimately, retention of the customer demonstrated by regular purchases. If a brand notices that a customer isn’t purchasing, it may consider targeting the customer with discount offers or deploying a tailored messaging campaign in the hope that the customer will return and not “churn”.The churn diagnosis, however, becomes more complicated for subscription-based products, many of which offer multiple delivery frequencies and the ability to pause a subscription. Brands with subscription-based products need to have some reliable measure of churn propensity so they can further isolate the factors that lead to churn and preemptively identify at-risk customers.
Managing your black Friday logs - CloudConf.ITDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Towards new solutions for scientific computing: the case of JuliaMaurizio Tomasi
The year 2018 marks the consolidation of Julia (https://julialang.org/), a programming language designed for scientific computing, as the first stable version (1.0) has been released, in August 2018. Among its main features, expressiveness and high execution speeds are the most prominent: the performance of Julia code is similar to
statically compiled languages, yet Julia provides a nice interactive shell and fully supports Jupyter; moreover, it can transparently call external codes written in C, Fortran, and even Python and R without the need of wrappers. The usage of Julia in the astronomical community is growing, and a GitHub oganization named JuliaAstro takes care of coordinating the development of packages. In this ADASS talk we present the features and shortcomings of this language, and discuss its application in astronomy and astrophysics.
DBMS Architecture
Query Executor
Buffer Manager
Storage Manager
Storage
Transaction Manager
Logging &
Recovery
Lock Manager
Buffers Lock Tables
Main Memory
User/Web Forms/Applications/DBA
query transaction
Query Optimizer
Query Rewriter
Query Parser
Files & Access Methods
Past lectures
Today’s
lecture
Many query plans to execute a SQL query
3
S UTR
SR
T
U
SR
T
U
• Even more plans: multiple algorithms to execute each operation
SR
T
U
Sort-merge
Sort-merge
hash join
Table-scan
index-scan
Table-scanindex-scan
• Compute the join of R(A,B) S(B,C) T(C,D) U(D,E)
Explain command in PostgreSQL
4
SELECT C.STATE, SUM(O.NETAMOUNT), SUM(O.TOTALAMOUNT)
FROM CUSTOMERS C
JOIN CUST_HIST CH ON C.CUSTOMERID = CH.CUSTOMERID
JOIN ORDERS O ON CH.ORDERID = O.ORDERID
GROUP BY C.STATE
Communication between operators:
iterator model
• Each physical operator implements three functions:
– Open: initializes the data structures.
– GetNext: returns the next tuple in the result.
– Close: ends the operation and frees the resources.
• It enables pipelining
• Other option: compute the result of the operator in full
and store it in disk or memory:
– inefficient.
5
Query optimization: picking the fastest plan
• Optimal approach plan
– enumerate each possible plan
– measure its performance by running it
– pick the fastest one
– What’s wrong?
• Rule-based optimization
– Use a set of pre-defined rules to generate a fast plan
• e.g., If there is an index over a table, use it for scan and join.
6
Review Definitions
• Statistics on table R:
– T(R): Number of tuples in R
– B(R): Number of blocks in R
• B(R) = T(R ) / block size
– V(R,A): Number of distinct values of attribute A in R
7
Plans to select tuples from R: σA=a(R)
• We have an unclustered index on R
• Plans:
– (Unclustered) indexed-based scan
– Table-scan (sequential access)
• Statistics on R
– B(R)=5000, T(R)=200,000
– V(R,A) = 2, one value appears in 95% of tuples.
• Unclustered indexed scan vs. table-scan ?
8
Presenter
Presentation Notes
Unclustered indexed scan vs. table-scan ?
table-scan is the winner!
Query optimization methods
• Rule-based optimizer fails
– It uses static rules
– The rules do not consider the distribution of the data.
• Cost-based optimization
– predict the cost of each plan
– search the plan space to find the fastest one
– do it efficiently
• Optimization itself should be fast!
9
Cost-based optimization
• Plan space
– which plans to consider?
– it is time consuming to explore all alternatives.
• Cost estimator
– how to estimate the cost of each plan without executing it?
– we would like to have accurate estimation
• Search algorithm
– how to search the plan space fast?
– we would like to avoid checking inefficient plans
10
Space of query plans: basic operators
• Selection
– algorithms: sequential, index-based
– Ordering
• Projection
– Ordering
• Join
– algorithms: nested loop, sort-merge, hash
– orderi ...
Simple representations for learning: factorizations and similarities Gael Varoquaux
Real-life data seldom comes in the ideal form for statistical learning.
This talk focuses on high-dimensional problems for signals and
discrete entities: when dealing with many, correlated, signals or
entities, it is useful to extract representations that capture these
correlations.
Matrix factorization models provide simple but powerful representations. They are used for recommender systems across discrete entities such as users and products, or to learn good dictionaries to represent images. However they entail large computing costs on very high-dimensional data, databases with many products or high-resolution images. I will present an
algorithm to factorize huge matrices based on stochastic subsampling that gives up to 10-fold speed-ups [1].
With discrete entities, the explosion of dimensionality may be due to variations in how a smaller number of categories are represented. Such a problem of "dirty categories" is typical of uncurated data sources. I will discuss how encoding this data based on similarities recovers a useful category structure with no preprocessing. I will show how it interpolates between one-hot encoding and techniques used in character-level natural language processing.
[1] Stochastic subsampling for factorizing huge matrices, A Mensch, J Mairal, B Thirion, G Varoquaux, IEEE Transactions on Signal Processing 66 (1), 113-128
[2] Similarity encoding for learning with dirty categorical variables. P Cerda, G Varoquaux, B Kégl Machine Learning (2018): 1-18
Human activity spatio-temporal indicators using mobile phone dataUniversity of Salerno
In the context of Smart Cities, monitoring the dynamic of the presence of people is a crucial aspect for the well-being of an urban area. We use mobile phone data as a proxy for the total number of people (Carpita & Simonetto 2014), with the specific aim of computing spatio-temporal region specific indicators. Telecom Italia Mobile (TIM), which is the largest operator in Italy, thanks to a research agreement with the Statistical Office of the Municipality of Brescia, provided to us about two years (April 2014 to June 2016) of High-Frequency Daily Mobile Phone Density Profiles (DMPDPs) in the form of a regular grid polygon each 15 minutes. Densities have to be rescaled in order to express the total amount of people rather than just TIM users. Separately
for selected regions in the province of Brescia, characterized by being either working or residential areas, we group similar DMPDPs and we characterize groups by their spatial and temporal components. In doing so, we propose a mixed-approach procedure.
The widespread adoption of Information Technology systems and their
capability to trace data about process executions has made available Information
Technology data for the analysis of process executions. Meanwhile, at business
level, static and procedural knowledge, which can be exploited to analyze and rea-
son on data, is often available. In this paper we aim at providing an approach that,
combining static and procedural aspects, business and data levels and exploiting
semantic-based techniques allows business analysts to infer knowledge and use it
to analyze system executions. The proposed solution has been implemented using
current scalable Semantic Web technologies, that offer the possibility to keep the
advantages of semantic-based reasoning with non-trivial quantities of data.
RSC: Mining and Modeling Temporal Activity in Social MediaAlceu Ferraz Costa
Presentation of the KDD 2015 paper describing the RSC model:
RSC: Mining and Modeling Temporal Activity in Social Media
Alceu Ferraz Costa, Yuto Yamaguchi, Agma Juci Machado Traina, Caetano Traina Jr., and Christos Faloutsos
The 21st SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2015
(KO) <<2019 데이터야놀자>> 에서 발표한 내용입니다.
사회적으로 문제가 되고 있는 어뷰징 분석과 이를 방지할 수 있는 방법에 대해 정리해보았습니다.
(EN) Presented in Datayanolja 2019 (Domestic data conference).
Dealing with one of the big social issues: manipulation of public opinions in online news platform.
Main contributions are two-fold: 1) identifying manipulators 2) suggesting possible 3 solutions
* notice: The material is written in Korean.
The consumer product landscape, particularly among e-commerce firms, includes a bevy of subscription-based business models. Internet and mobile phone subscriptions are now commonplace and joining the ranks are dietary supplements, meals, clothing, cosmetics and personal grooming products.
Standard metrics to diagnose a healthy consumer-brand relationship typically include customer purchase frequency and ultimately, retention of the customer demonstrated by regular purchases. If a brand notices that a customer isn’t purchasing, it may consider targeting the customer with discount offers or deploying a tailored messaging campaign in the hope that the customer will return and not “churn”.The churn diagnosis, however, becomes more complicated for subscription-based products, many of which offer multiple delivery frequencies and the ability to pause a subscription. Brands with subscription-based products need to have some reliable measure of churn propensity so they can further isolate the factors that lead to churn and preemptively identify at-risk customers.
Managing your black Friday logs - CloudConf.ITDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Towards new solutions for scientific computing: the case of JuliaMaurizio Tomasi
The year 2018 marks the consolidation of Julia (https://julialang.org/), a programming language designed for scientific computing, as the first stable version (1.0) has been released, in August 2018. Among its main features, expressiveness and high execution speeds are the most prominent: the performance of Julia code is similar to
statically compiled languages, yet Julia provides a nice interactive shell and fully supports Jupyter; moreover, it can transparently call external codes written in C, Fortran, and even Python and R without the need of wrappers. The usage of Julia in the astronomical community is growing, and a GitHub oganization named JuliaAstro takes care of coordinating the development of packages. In this ADASS talk we present the features and shortcomings of this language, and discuss its application in astronomy and astrophysics.
DBMS Architecture
Query Executor
Buffer Manager
Storage Manager
Storage
Transaction Manager
Logging &
Recovery
Lock Manager
Buffers Lock Tables
Main Memory
User/Web Forms/Applications/DBA
query transaction
Query Optimizer
Query Rewriter
Query Parser
Files & Access Methods
Past lectures
Today’s
lecture
Many query plans to execute a SQL query
3
S UTR
SR
T
U
SR
T
U
• Even more plans: multiple algorithms to execute each operation
SR
T
U
Sort-merge
Sort-merge
hash join
Table-scan
index-scan
Table-scanindex-scan
• Compute the join of R(A,B) S(B,C) T(C,D) U(D,E)
Explain command in PostgreSQL
4
SELECT C.STATE, SUM(O.NETAMOUNT), SUM(O.TOTALAMOUNT)
FROM CUSTOMERS C
JOIN CUST_HIST CH ON C.CUSTOMERID = CH.CUSTOMERID
JOIN ORDERS O ON CH.ORDERID = O.ORDERID
GROUP BY C.STATE
Communication between operators:
iterator model
• Each physical operator implements three functions:
– Open: initializes the data structures.
– GetNext: returns the next tuple in the result.
– Close: ends the operation and frees the resources.
• It enables pipelining
• Other option: compute the result of the operator in full
and store it in disk or memory:
– inefficient.
5
Query optimization: picking the fastest plan
• Optimal approach plan
– enumerate each possible plan
– measure its performance by running it
– pick the fastest one
– What’s wrong?
• Rule-based optimization
– Use a set of pre-defined rules to generate a fast plan
• e.g., If there is an index over a table, use it for scan and join.
6
Review Definitions
• Statistics on table R:
– T(R): Number of tuples in R
– B(R): Number of blocks in R
• B(R) = T(R ) / block size
– V(R,A): Number of distinct values of attribute A in R
7
Plans to select tuples from R: σA=a(R)
• We have an unclustered index on R
• Plans:
– (Unclustered) indexed-based scan
– Table-scan (sequential access)
• Statistics on R
– B(R)=5000, T(R)=200,000
– V(R,A) = 2, one value appears in 95% of tuples.
• Unclustered indexed scan vs. table-scan ?
8
Presenter
Presentation Notes
Unclustered indexed scan vs. table-scan ?
table-scan is the winner!
Query optimization methods
• Rule-based optimizer fails
– It uses static rules
– The rules do not consider the distribution of the data.
• Cost-based optimization
– predict the cost of each plan
– search the plan space to find the fastest one
– do it efficiently
• Optimization itself should be fast!
9
Cost-based optimization
• Plan space
– which plans to consider?
– it is time consuming to explore all alternatives.
• Cost estimator
– how to estimate the cost of each plan without executing it?
– we would like to have accurate estimation
• Search algorithm
– how to search the plan space fast?
– we would like to avoid checking inefficient plans
10
Space of query plans: basic operators
• Selection
– algorithms: sequential, index-based
– Ordering
• Projection
– Ordering
• Join
– algorithms: nested loop, sort-merge, hash
– orderi ...
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
3. BUSINESS NEED OF STREAM
ANALYTICS
How important is the space
June 2018
Sumit Misra 3
4. Platform Business : Viral Effect
June 2018
Application in Cloud
Browser Tab App
Mobile
App
Billions of Users
Big Data
Sumit Misra 4
1.32 billion active
Facebook users
per day
Source: https://www.wordstream.com/blog/ws/2017/11/07/facebook-statistics
4 M / min
5. UPI Transactions
June 2018
Sumit Misra 5
Other than Facebook no application worldwide
has seen such exponential adoption in early years
of launch
6. Complexity
June 2018
Sumit Misra 6
Prescriptive Analytics
Predictive Analytics
Descriptive Analytics
Offline Hybrid
Low Medium
Medium High
High
Very
High
Online
High
Very
High
Super
High
7. Map
June 2018
Sumit Misra 7
Prescriptive
Predictive
Descriptive
Offline Hybrid Online
Will X buy the car
What items are
bought together
Detect Plagiarism
Trending of
tweets
Page Ranking
Clustering
Stock Price
Prediction
Personalized
Ads / News
Entity
Resolution
Filter Spam
emails
Fraud detection
of ecommerce
purchases
Amazon, Netflix
Recommends
Google predicts
the query string
and provides
alternatives
8. Business Questions : Customer Focus
• Google : How many new users have logged
since last month
• Uber: What should be the dynamic price that
would not put off customer
June 2018
Sumit Misra 8
9. DATA STREAM : MODEL & MINING
There is no one answer or easy answer to easy questions in data streams
June 2018
Sumit Misra 9
10. Considerations and Approach
• Constraint
– We cannot store all the data from the past
– The stream objects can only be scanned once
– Usually we need to answer it in seconds as it provides
an opportunity for the business
• General Approach
– Obtain a close approximate answer within a short
time
June 2018
Sumit Misra 10
12. Mining for …
June 2018
Sumit Misra 12
CLASSIFICATION
GOOD 15004 / MIN
FRAUD 04 / MIN
GOOD 16098 / MIN
FRAUD 06 / MIN
GOOD 6011 / MIN
FRAUD 01 / MIN
6am – 2pm 2pm – 10pm 10pm – 6am
13. Mining for …
June 2018
Sumit Misra 13
FREQUENT ITEM SET
A 6
G 7
W 5
A, G 2
G, W 3
A, W 2
A, G, W 2
A 7
G 6
W 2
A, G 4
G, W 1
A, W 0
A, G, W 0
A 1
G 1
W 1
A, G 1
G, W 1
A, W 1
A, G, W 1
16. Synopsis
June 2018
Sumit Misra 16
LANDMARK
Count 74
Mean (µ) 8.5
Std Dev (σ) 2.6
Max 30
Min 3
µ/σ 3.3
Count 243
Mean (µ) 9.2
Std Dev (σ) 2.1
Max 32
Min 2
µ/σ 4.4
Count 198
Mean (µ) 6.1
Std Dev (σ) 4.1
Max 46
Min 1
µ/σ 1.5
10pm – 6am 6am – 2pm 2pm – 10pm
17. Burst of Data
June 2018
Sumit Misra 17
Sudden online purchase due
to heavy discount
Sudden cash dispensing
fearing bank strike
How to reduce load?
How to gracefully degrade?
Source: http://intelligence.slice.com/blog/2017/online-candy-sales-hop-up-71-percent-for-easter-but-sales-trail-behind-the-holiday-season
18. Reduce Loads : Sampling
• How do you sample if you can store only
1/10th of POS data, but
1. Want to process purchases that contains Item A
2. Want to process purchases by users that contain
2 or more of Item A
• If you pick 1 in 10 sales items (query) it will
work for 1 but not for 2.
• For 2, you need to pick 1 in 10 users.
June 2018
HASH
Sumit Misra 18
19. Reduce Loads : Sampling
• What if the rate of arrivals increase and you
cannot keep 1/10th?
• Strategy: Find a sample size which will ensure
[|pe-pa|> ] < δ
– pa : Actual Fraction
– pe : Estimated Fraction
– : Relative Error
– δ: Error Bound
June 2018
Hoeffding
Inequality
Sumit Misra 19
20. Hoeffding Inequality
Probability [|pe-pa|> ] < δ
If there are N sample tokens and each token Xi can take
value from interval [ai, bi]
If Y = Σ Xi, where Xi is value of a sample token, for all α > 0
Pr[|Y – E(Y)| > Nα] < (2.exp (-
2(Nα)2
Σ(Ri)2
))
Where Ri = (bi – ai)
March 2016
Sumit Misra 20
If R=1, i.e. value interval is [0,1]
If (Nα =β)
Pr[|Y-E(Y)>β] < 2*exp(-2*β2)
Fraud 1.7%, Sample N=100
Pr[|Y-E(Y)| > Nα] < 0.6%
21. Research and Innovations
• Innovative synopsis data structures (list, trees,
graphs, clusters, …), optimized for
– Latency of search response
– Time complexity of update
– Space complexity of storage
June 2018
Sumit Misra 21
22. Classification / Filtering
• How do you filter out an email id as a good one and
not a spam?
– Typical email address is 20 bytes
– Strategy if we have 1 GB memory = 8 billion bits
– Map hash of good addresses to a bit location
June 2018
Sumit Misra 22
email
Hash to 8 billion memory address
and mark 1 on location of hash
49..01 0
49..02 1
49..03 0
49..04 0
49..05 0
49..06 1
49..07 0
Pre-process
email
Hash to memory address and
check if it has a 1 or 0
Refer / Read
Assuming very large spam percentage, this will eliminate about 80% of spams
Bloom Filter
(1970)
23. Classification / Filtering
• False Positive is act of wrongly classifying spam as
non-spam
• In the example:
– Using one hash function FP probability is 11.75%
– Using two has functions FP probability is 4.89%
• Intuition:
– This technique may also be used to count shopping
cart that has a set of items (A, B, C)
June 2018
Sumit Misra 23
Bloom Filter
(1970)
24. Frequent Recent Items
• For a large universal set (N), want to find top 5
most frequent recent items (Swiggy Orders, Uber
Requests, TV channels viewed)
– Have a set of counters where the item-ids hash
– Counters store counts which exponentially (α: [0..1])
decays with time (Δ)
– Report items mapped to top 5 counts
June 2018
Sumit Misra 24
1+CαΔ
Hash (Item ID)
4621 4622 4623 4624 4625 4626
***
***
25. Frequent Recent Items
• For a large universal set (N), want to find top 5
most frequent recent items (Swiggy Orders, Uber
Requests, TV channels viewed)
– Have a set of counters where the item-ids hash
– Counters store counts which exponentially (α: [0..1])
decays with time (Δ)
– Report items mapped to top 5 counts
• Metrics
– Latency of search of top 5 : O(1)
– Time complexity of update : O(|w|) + O(N.log(N))
– Space complexity : O(N)
June 2018
Sumit Misra 25
Pruning can reduce
this part
27. Measuring Data Uniformity : Estimating Moments
• Surprise Number?
– Stream Length is 100 and has 11 Distinct Items
– Range of 2nd Moment
• 10x92+1X102 = 910 ---------- Evenly Distributed
• 10x12+902 = 8110 --------- Unevenly Distributed
27
Sumit Misra June 2018
Even Uneven
28. STANDING QUERIES
How to answer to known queries while handling data streams
June 2018
Sumit Misra 28
29. Standing Queries
• Approximate Answers
– Distinct Items = Flajolet-Martin Algorithm
– Counting 1s in last N bits of bit-stream = Datar-
Gionis-Indyk-Motwani Algorithm
– 2nd order Moments = Alon-Matias-Szegedy
Algorithm
June 2018
Sumit Misra 29
30. Distinct Items : Removing the repetitions
June 2018
Source: www.github.com
Sumit Misra 30
31. Distinct Items
• Lets work out an algorithm
– Stream: 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1...
– Hash Function: f(x) = (3x + 1) mod 5
– Stream post hash: 4, 5, 2, 4, 2, 5, 3, 5, 4, 2, 5, 4
– Convert to binary and count # trailing 0 --- R
– It is: 2, 0, 1, 2, 1, 0, 0, 0, 2, 1, 0, 2
• Max R = 2
• Probabilistic Estimate of Distinct Element M = 2R
= 22 = 4
Flajolet-Martin
Algorithm (1983)
June 2018
Sumit Misra 31
32. How many buyers are from the age
group of 18-28 years
June 2018
Source: www.dailyrecord.co.uk
Sumit Misra 32
33. Counting 1’s in a Window
• Window divided into buckets represented by
– Timestamp of right end
– Size : Number of 1’s in a bucket (pow. of 2)
• Estimating 1’s in last k=10 bits
– Exact Count = 5
– Estimate = Sum size of all but last and add ½ of the last bucket
– = (1) + (1) + (2) + (42) = 6
0 1 1 1 0 1 1 0 0 1 0 1 1 0
1 1 0 1 1 0 0 0 1
0
1
*
*
Timestamps
Bucket (41,8) (49,4) (55,4) (59,2) (61,1) (62,1)
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
k = 10 bits
June 2018
How many 1’s in last 10 (k) bits?
Sumit Misra 33
DGIM (2002)
34. Estimating Moments …
• Alon-Matias-Szegedy (AMS) Algorithm
– Example Stream: a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
– n=15; a(x5), b(x4), c(x3), d(x3); 2nd Moment = 59
– Estimate = n * (2 * X.value − 1)
– Pick random positions X1(3rd), X2(8th), X3(13th)
– X1.element = “c”, X1.value = 3 (# of “c” in the set from 3rd place)
– Estimate for X1= 15*(2*3-1) = 75
– Estimates for X2 = 15*(2*2-1) = 45, (# “d” beyond 8th place = 2)
– Estimate for X3 = 45, (# “a” beyond 13th place = 2)
– Average for X1, X2, X3 = 165/3 = 55 (close to 59)
34
Sumit Misra March 2016
35. … Estimating Moments …
• Dealing with Infinite streams
– Challenge is not the “n” as we store one variable
per randomly chosen position
– Challenge is selecting the position
35
At start After sometime
After significant time
Sumit Misra March 2016
36. … Estimating Moments …
• Strategy for position selection assuming we
have space to store “s” variables and we have
seen “n” elements
– First “s” positions are chosen
– When (n+1)th token arrives, randomly select
positions to keep
– If (n+1) is selected discard from existing n position
randomly and insert new element with value 1
36
Sumit Misra March 2016
37. Finding needle in haystack … with
perseverance
May 2018
Source: https://plus.google.com/+GaiaCostantino
Sumit Misra 37
38. … in form of equation …
FINDING NEEDLE IN HAYSTACK
• Very large number of tuples [𝑛 → ∞]
• Very low probability of finding what you want
PERSEVERANCE
• Equals large number of trials
PROBABILITY OF MISSING =
June 2018
Sumit Misra 38
lim
𝑛→∞
1 −
1
𝑛
𝑛
= 𝑒−1
= 0.368 = 36.8%
lim
𝑛→∞
1 −
1
𝑛
𝑛
40. Conclusion
• Mining Data Stream has moved from research
area to every day use for commercial purposes
• With single scan constraint algorithms suitable
for static data mining cannot be applied as-is
• Finding distinct items, count of items in last n-
occurrences, uniformity of item values, etc.
represents a few use-cases
• This remains an active area of research as it can
be applied across disciplines and hence needs
due attention
June 2018
Sumit Misra 40
42. Ebola Virus: Predict spread using
mobile phone data from West Africa
June 2018
Mobile phone data from West Africa is being used to map population movements
and predict how the Ebola virus might spread
Source: http://www.bbc.com/news/business-29617831
Sumit Misra 42
Mar 23, 2014 (WHO reported after 29 deaths, started Dec ‘13)
Application of Big Data in Business Solutions
Abstract: The proliferation of platform economy and e/m-commerce has resulted in availability of very large corpus of data. Big Data is now being used to gain insight from these data corpus; machine learning is used to build predictive models from these data streams and adjust the models at high frequency and finally detecting outliers to utilize it for either leveraging a business opportunity or containing a risk. This session would discuss various aspects of the usage of the big data technologies that are being used for business solutions.