PCY, A-Priori and SON algorithm are implemented by Pseudo mode Hadoop on the Ta-Feng Grocery dataset to find frequent itemsets. And the underlying association rules are found based on the discovered frequent itemsets. Written by Chengeng Ma.
The Presentation is regarding the Market Basket Analysis Concept which is done practically with the real world data from a small Canteen. This is completely a real time data on which the analysis results are drawn.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
The Presentation is regarding the Market Basket Analysis Concept which is done practically with the real world data from a small Canteen. This is completely a real time data on which the analysis results are drawn.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
This is the course introduction about common sense reasoning. The complete course material can be found as OpenCourseWare at: http://ocw.upm.es/course/common-sense-reasoning
Installing Anaconda Distribution of PythonJatin Miglani
This document is useful when use with Video session I have recorded today with execution, This is document no. 1 of course "Introduction of Data Science using Python". Which is a prerequisite of Artificial Intelligence course at Ethans Tech.
Disclaimer: Some of the Images and content have been taken from Multiple online sources and this presentation is intended only for Knowledge Sharing
These slides were designed for Apache Hadoop + Apache Apex workshop (University program).
Audience was mainly from third year engineering students from Computer, IT, Electronics and telecom disciplines.
I tried to keep it simple for beginners to understand. Some of the examples are using context from India. But, in general this would be good starting point for the beginners.
Advanced users/experts may not find this relevant.
This is the class project of Stanford Univ. online course Mining Massive Datasets. The mission is to write a Hadoop program that calculates Pagerank for 2002 Google Programming Contest Web Graph Data.
This is the course introduction about common sense reasoning. The complete course material can be found as OpenCourseWare at: http://ocw.upm.es/course/common-sense-reasoning
Installing Anaconda Distribution of PythonJatin Miglani
This document is useful when use with Video session I have recorded today with execution, This is document no. 1 of course "Introduction of Data Science using Python". Which is a prerequisite of Artificial Intelligence course at Ethans Tech.
Disclaimer: Some of the Images and content have been taken from Multiple online sources and this presentation is intended only for Knowledge Sharing
These slides were designed for Apache Hadoop + Apache Apex workshop (University program).
Audience was mainly from third year engineering students from Computer, IT, Electronics and telecom disciplines.
I tried to keep it simple for beginners to understand. Some of the examples are using context from India. But, in general this would be good starting point for the beginners.
Advanced users/experts may not find this relevant.
This is the class project of Stanford Univ. online course Mining Massive Datasets. The mission is to write a Hadoop program that calculates Pagerank for 2002 Google Programming Contest Web Graph Data.
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
This talk introduces the main techniques of Recommender Systems and Topic Modeling.
Then, we present a case of how we've combined those techniques to build Smart Canvas (www.smartcanvas.com), a service that allows people to bring, create and curate content relevant to their organization, and also helps to tear down knowledge silos.
We present some of Smart Canvas features powered by its recommender system, such as:
- Highlight relevant content, explaining to the users which of his topics of interest have generated each recommendation.
- Associate tags to users’ profiles based on topics discovered from content they have contributed. These tags become searchable, allowing users to find experts or people with specific interests.
- Recommends people with similar interests, explaining which topics brings them together.
We give a deep dive into the design of our large-scale recommendation algorithms, giving special attention to our content-based approach that uses topic modeling techniques (like LDA and NMF) to discover people’s topics of interest from unstructured text, and social-based algorithms using a graph database connecting content, people and teams around topics.
Our typical data pipeline that includes the ingestion millions of user events (using Google PubSub and BigQuery), the batch processing of the models (with PySpark, MLib, and Scikit-learn), the online recommendations (with Google App Engine, Titan Graph Database and Elasticsearch), and the data-driven evaluation of UX and algorithms through A/B testing experimentation. We also touch topics about non-functional requirements of a software-as-a-service like scalability, performance, availability, reliability and multi-tenancy and how we addressed it in a robust architecture deployed on Google Cloud Platform.
This presentation won me the best presentation award at my University Tech fest "Allegretto" in 2008.
I have also presented this seminar as a part of B.Tech curriculum in 7th Semester.
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Big Data and Fast Data - Lambda Architecture in ActionGuido Schmutz
Big Data (volume) and real-time information processing (velocity) are two important aspects of Big Data systems. At first sight, these two aspects seem to be incompatible. Are traditional software architectures still the right choice? Do we need new, revolutionary architectures to tackle the requirements of Big Data?
This presentation discusses the idea of the so-called lambda architecture for Big Data, which acts on the assumption of a bisection of the data-processing: in a batch-phase a temporally bounded, large dataset is processed either through traditional ETL or MapReduce. In parallel, a real-time, online processing is constantly calculating the values of the new data coming in during the batch phase. The combination of the two results, batch and online processing is giving the constantly up-to-date view.
This talk presents how such an architecture can be implemented using Oracle products such as Oracle NoSQL, Hadoop and Oracle Event Processing as well as some selected products from the Open Source Software community. While this session mostly focuses on the software architecture of BigData and FastData systems, some lessons learned in the implementation of such a system are presented as well.
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
The task of “data profiling”—assessing the overall content and quality of a data set—is a core aspect of the analytic experience. Traditionally, profiling was a fairly cut-and-dried task: load the raw numbers into a stat package, run some basic descriptive statistics, and report the output in a summary file or perhaps a simple data visualization. However, data volumes can be so large today that traditional tools and methods for computing descriptive statistics become intractable; even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must be used. In this talk Sean will cover technical challenges in keeping data profiling agile in the Big Data era. He will discuss both research results and real-world best practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches.
Sean is Trifacta’s Chief Technical Officer. He completed his Ph.D. at Stanford University, where his research focused on user interfaces for database systems. At Stanford, Sean led development of new tools for data transformation and discovery, such as Data Wrangler. He previously worked as a data analyst at Citadel Investment Group.
This week's session covers new work from Justin Thaler (GWU) et al on Lasso/Jolt.
Lasso is a new lookup argument (more on this below) with a dramatically faster prover. Our initial implementation provides roughly a 10x speedup over the lookup argument in the popular, well-engineered halo2 toolchain; we expect improvements of around 40x when optimizations are complete. To demonstrate, we’re releasing the open source implementation, written in Rust. We invite the community to help us make Lasso as fast and robust as possible.
The second, accompanying innovation to Lasso is Jolt, a new approach to zkVM (zero knowledge virtual machine) design that builds on Lasso. Jolt realizes the “lookup singularity” – a vision initially laid out by Barry Whitehat of the Ethereum Foundation for simpler tooling and lightweight, lookup-centric circuits (more on why this matters below). Relative to existing zkVMs, we expect Jolt to achieve similar or better performance – and importantly, a more streamlined and accessible developer experience. With Jolt, it will be easier for developers to write fast SNARKs in their high-level language of choice.
Lasso: https://people.cs.georgetown.edu/jthaler/Lasso-paper.pdf
Jolt: https://people.cs.georgetown.edu/jthaler/Jolt-paper.pdf
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
Codemotion Rome 2015 - I Big Data sono indubbiamente tra i temi più "caldi" del panorama tecnologico attuale. Ad oggi nel mondo sono stati prodotti circa 5 Exabytes di dati che costituiscono una potenziale fonte di "intelligenza" che è possibile sfruttare, grazie alle tecnologie più recenti, in svariati ambiti che spaziano dalla medicina alla sociologia passando per il marketing. Il talk si propone, tramite una gita virtuale nello spazio, di introdurre i concetti, le tecniche e gli strumenti che consentono di iniziare a sfruttare il potenziale dei Big Data nel lavoro quotidiano.
Building graphs to discover information by David Martínez at Big Data Spain 2015Big Data Spain
The basic challenge of a data scientist is to unveil information from raw data. Traditional machine learning algorithms have treated “pure” data analytics situations that should comply with a set of restrictions, such as access to labels, a clear prediction objective… However, the reality in practice shows that, due to the wide spread of data science nowadays, the exception is the norm and it is usual to encounter situations that depend on gathering information from raw data which lacks any kind of structure, or objective that classic approaches assume. In these situations, building a graph that encodes the information we are trying to unveil is the most intuitive place to start or even the only one feasible when we lack any field knowledge or previously stated aim. Unfortunately, building a graph when the number of nodes is huge from scratch is a challenging task computationally, and requires some approximations to make it feasible. In this review, we will talk about the most standard way of building those graphs in practice, and how to exploit them to solve data science tasks.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-11.html#spch11.2
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
How does Twitter track the top trending topics?
How does Amazon keep track of the top-selling items for the day?
How many cabs have been booked this month using your App?
Is the password that a new user is choosing a common/compromised password?
Modern web-scale systems process billions of transactions and generate terabytes of data every single day. In order to find answers to questions against this data, one would initiate a multi-minute query against a NoSQL datastore or kick off a batch job written in a distributed processing framework such as Spark or Flink. However, these jobs are throughput-heavy and not suited for realtime low-latency queries. However, you and your customers would like to have all this information "right now".
At the end of this talk, you'll realize that you can power these low-latency queries and with incredibly low memory footprint "IF" you are willing to accept answers that are, say, 96-99% accurate. This talk introduces some of the go-to probabilistic data structures that are used by organisations with large amounts of data - specifically Bloom filter, Count Min Sketch and HyperLogLog.
Similar to Hadoop implementation for algorithms apriori, pcy, son (20)
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Hadoop implementation for algorithms apriori, pcy, son
1. Hadoop implementation for
algorithms (A-Priori, PCY & SON)
working on Frequent Itemsets
Chengeng Ma
Stony Brook University
2016/03/26
“Diapers and Beer”
2. 1. Fundamentals about frequent itemsets
• If you are familiar with Frequent
Itemsets, please directly go to page 13,
because the following pages are
fundamental knowledge and methods
about this topic, which you can find
more details in the book, Mining of
Massive Dataset, written by Jure
Leskovec, Anand Rajaramanand
Jeffrey D. Ullman.
3. Frequent itemsets is about things frequently
bought together
• For large supermarkets, e.g.,
Walmart, Stop & Shop, Carrefour …,
finding out what stuffs are usually
bought together not only can
better serve their customers, but
also is very important for their
selling strategies.
• For example, people buy hot dogs
& mustard together. Then a
supermarketcan offer a sale on hot
dog but raise the price of mustard.
4. Except for marketing strategies, it is also
widely used for other domains, e.g.,
• Related concepts, finding a set of
words frequently appears in a lot
of documents, blogs, Tweets;
• Plagiarism checking, finding a
set of papers that shares a lot of
sentences;
• Biomarkers, finding a set of
biomarkers that frequently
appear together for a specific
disease.
5. Why and Who studies
frequent itemsets?
• Online shops like Amazon, Ebay, …,
can recommend stuffsto all their
customers, as long as the
computation resources allow.
• They not only can learn what stuffs
are popular;
• but also study customers’ personal
interests and recommend individually.
• So they want to find similar
customers and similar products.
• This opportunity of online shops is
called long tail effect.
6. Why and Who studies
frequent itemsets?
• The brick-and-mortar retailers,
however, have limited resources.
• When they advertise a sale, they
spend money and space on the
advertised stuff.
• They cannot recommend to their
customers individually, so they
have to focus on the most popular
goods.
• So they want to focus on
frequently bought stuffs and
frequent itemsets.
7. What is the difficulty if the work is just counting?
• Suppose N different products, if you
want to find frequent single items,
you just need a hash table of N key
value pairs.
• If you want to find frequent pairs
then the size of the hash table
becomes N(N-1)/2.
• If you want to find frequent length-
k-itemsets, you need 𝐶 𝑁
𝑘
key value
pairs to store your counts.
• When k gets large, it grows as
𝑁 𝑘
/𝑘, it’s crazy!
8. Who is the angel to save us?
• Monotonicity of items: if a set I is
frequent,then all the subsets of I is
frequent.
• If a set J is not frequent,then all the
sets that contains J cannot be frequent.
• Ck: set of candidate itemsets of length k
• Lk: set of frequentitemsetsof length k
• In the real world, reading all the single
items usually does not stress main
memory too much. Even a giant
supermarket Corp. cannot sell more than
one million stuffs.
• Since the candidatesof frequenttriples
(C3) are based on frequentpairs (L2)
insteadof all possible pairs (C2), the
memory stress for finding frequent
triples and higher length itemsets is not
that serious.
• Finally, the largest memory stress
happens on finding frequentpairs.
9. Classic A-Priori: takes 2 passes of whole data
• 1st pass holds a hash table for the
counts of all the single items.
• After the 1st pass, the frequent
items are found out,
• now you can use a BitSet to replace
the hash table by setting the
frequent items to 1 and leaving
others to 0;
• or you can only store the indexes of
frequent items and ignore others.
• 2nd pass will take the 1st pass’
results as an input besides of
reading through the whole data
second time.
• During 2nd pass, you start from
an empty hash table, whose key
is pair of items and value is its
count.
• For each transaction basket,
exclude infrequent items,
• within the frequent items, form
all possible pairs, and add 1 to
the count of that pair.
• After this pass, filter out the
infrequent pairs.
10. A-Priori’s shortages
• 1. During the 1st pass of data,
there are a lot of memory unused.
• 2. The candidates of frequent
pairs (C2) can be still a lot if the
dataset is too large.
• 3. Counting is a process that can
be map-reduced. You do not need
a hash table to really store
everything during the 2nd pass.
Some improvements exists, PCY,
multi-hash, multi-stage, …
11. PCY (The algorithm of Park, Chen and Yu) makes
use of the unused memory during the 1st pass
• During the 1st pass, we creates 2
empty hash tables, the 1st is for
counting single items, the 2nd is
for hashing pairs.
• When processing each
transaction basket, you not only
count for the singletons,
• But also generate all the pairs
within this basket, and hash
each pair to a bucket of the 2nd
hash table by adding 1 to that
bucket.
• So the 1st hash table is simple, it
must has N keys and values, like:
hashTable1st(item ID)=count
• For the 2nd hash table, its size
(how many buckets) is depended
on the specific problem.
• Its key is just a bucket position
and has no other meanings, its
value represents how many pairs
are hashed onto that bucket, like:
hashTable2nd(bucket pos)=count
12. The 2nd pass of PCY algorithm
• After the 1st pass, in the 1st hash
table, you can find frequent
singletons,
• In the 2nd hash table, you can find
the frequent buckets,
• Replace the 2 hash tables to 2
BitSet, or just discard the not-
frequent items and bucket
positions and save the remaining
into 2 files separately.
• The 2nd pass will read in the 2 files
generated from the 1st pass,
besides of reading through the
whole data.
• During the 2nd pass, for each
transaction basket, exclude items
that are infrequent,
• within the frequent items,
generate all possible pairs,
• for each pair, hash it to a bucket
(same hash function as 1st pass),
• if that bucket is frequent according
to 1st pass, then add 1 to the count
of that pair.
• After this pass, filter out the
infrequent pairs.
13. Why hashing works?
• If a bucket is infrequent, then
all the pairs hashed into that
bucket cannot be frequent,
because their sum of counts is
less than s (threshold).
• If a bucket is frequent, then the
sum of counts of pairs hashed
into that bucket is larger than s.
Then some pairs inside that
bucket can be frequent, some
can be not, or all of them are
not frequent individually.
• Note we use deterministic hash
function not randomly hashing.
• If you use multiple hash functions in 1st
pass, it’s called multi-hash algorithm.
• If you add one more pass and use
another hash function in 2nd pass, it’s
called multi-stage algorithm.
14. • If you add one more pass and use
another hash function in 2nd pass, it’s
called multi-stage algorithm.
• If you use multiple hash functions in 1st
pass, it’s called multi-hash algorithm.
15. How many buckets do you need for hashing?
• Suppose totally there are T transactions, each of which contains
averagely A items, the threshold for becoming frequent is s and the
number of buckets you create for PCY hashing is B.
• Then approximately there are 𝑇
𝐴(𝐴−1)
2
possible pairs.
• Each bucket will get
𝑇𝐴2
2𝐵
counts in average.
• If (
𝑇𝐴2
2𝐵
) is around s, then only few buckets are infrequent, then we
do not gain much performance by hashing.
• But if (
𝑇𝐴2
2𝐵
) is much smaller than s (<0.1), then only few buckets can
get frequent. So we can filter out a lot of infrequent candidates
before we really count on them.
16. 2. Details of my class project: dataset
• Ta-Feng Grocery Dataset for all
transactions during Nov 2000 to Feb
2001.
• Ta-Feng is a membership retailer
warehouse in Taiwan which sells mostly
food-based products but also office
supplies and furniture (Hsu et. al, 2004).
• The datasetcontains 119,578
transactions, which sum up to 817,741
single item selling records, with 32,266
different customers and 23,812 unique
products.
• Its size is about 52.7 MB. But as a
training, I will use Hadoop to find the
frequent pairs.
• It can be download in
http://recsyswiki.com/wiki/Grocery
_shopping_datasets
17. Prepare work
• The product ID originally is in 13 digits,
we re-indexing it within [0, 23811].
• Usually the threshold s is set as 1% of
transaction numbers.
• However, this datasetis not large, we
cannot set too high threshold. Finally
0.25% of transaction numbers is used as
threshold, which is 300.
• Only 403 frequent singletons (from
23,812 unique products)
• For PCY, since we derive
𝑇𝐴2
2𝐵
≪ 𝑠, so
B ≫
𝑇𝐴2
2𝑠
=
119578∗72
2∗300
= 9765. We set B
as a prime number 999,983(=10^6-17).
• For the hash function, I use the
following:
• H(x, y)={ (x + y)*(x + y + 1) / 2 + max(x,
y) } mod B
18. Mapreduced A-Priori: Hadoop is good at counting
• Suppose you already have frequent
singletons, read from Distributed
Cache and stored in a BitSet as 1/0.
• Mapper input: {𝑖, [ 𝑀𝑖1, 𝑀𝑖2, …,
𝑀𝑖𝑆𝑖
]}, as one transaction, where 𝑖
is the customer ID, 𝑆𝑖 items are
bought.
• Within [𝑀𝑖1, 𝑀𝑖2…, 𝑀𝑖𝑆𝑖
], exclude
infrequent items, only frequent
items remains as [𝐹𝑖1, 𝐹𝑖2…, 𝐹𝑖𝑓 𝑖
],
where 𝑓𝑖 ≤ 𝑆𝑖.
• 𝑓𝑜𝑟 𝑥 ∈ [𝐹𝑖1, 𝐹𝑖2…, 𝐹𝑖𝑓 𝑖
]:
𝑓𝑜𝑟 𝑦 ∈ [𝐹𝑖1, 𝐹𝑖2…, 𝐹𝑖𝑓 𝑖
]:
if x < y: output {(x, y), 1}
• Reducer input: {(x, y), [a, b, c, … ]};
• Take the sum of the list as t;
• if t>=s: Output {(x, y), t}.
• Combiner is the same as the
reducer, except only taking the sum
but not filtering infrequent pairs.
• Original A-Priori needs a data
structure to store all candidate
pairs in memory to do the counting.
• Here we deal with each transaction
individually. Hadoop will help us
gather each pair’s count without
taking large memory.
19. Mapreduced PCY 1st pass
• Mapper input: {𝑖, [ 𝑀𝑖1, 𝑀𝑖2, …, 𝑀𝑖𝑆 𝑖
]}, as
one transaction, where 𝑖 is the customer
ID, 𝑆𝑖 items are bought.
• 𝑓𝑜𝑟 𝑖𝑡𝑒𝑚 ∈ [𝑀𝑖1, 𝑀𝑖2…, 𝑀𝑖𝑆 𝑖
]:
Output {(item, “single”), 1}
• 𝑓𝑜𝑟 𝑥 ∈ [𝑀𝑖1, 𝑀𝑖2…, 𝑀𝑖𝑆 𝑖
]:
𝑓𝑜𝑟 𝑦 ∈ [𝑀𝑖1, 𝑀𝑖2…, 𝑀𝑖𝑆 𝑖
]:
If x < y: Output {(hash(x, y), “hash”), 1}
• Reducer input : {(item, “single”), [a,
b, c, …]} or {(hashValue, “hash”), [d,
e, f, …]}, but cannot contain both;
• Take the sum of the list as t;
• If t>=s: Output {key, t}.
(where the key can be (item,
“single”) or (hashValue, “hash”) ).
Combiner is the same as the reducer,
except only taking the sum but not
filtering infrequent pairs.
20. Mapreduced PCY 2nd Pass: counting
• Suppose you already have frequent
singletons and frequent buckets, read
from Distributed Cache and stored in 2
BitSets as 1/0.
• Mapper input: {𝑖, [ 𝑀𝑖1, 𝑀𝑖2, …, 𝑀𝑖𝑆 𝑖
]}, as
one transaction, where 𝑖 is the customer
ID, 𝑆𝑖 items are bought.
• Within [𝑀𝑖1, 𝑀𝑖2…, 𝑀𝑖𝑆 𝑖
], exclude
infrequent items, only frequent items
remains as [𝐹𝑖1, 𝐹𝑖2…, 𝐹𝑖𝑓 𝑖
], where 𝑓𝑖 ≤ 𝑆𝑖.
• 𝑓𝑜𝑟 𝑥 ∈ [𝐹𝑖1, 𝐹𝑖2…, 𝐹𝑖𝑓 𝑖
]:
𝑓𝑜𝑟 𝑦 ∈ [𝐹𝑖1, 𝐹𝑖2…, 𝐹𝑖𝑓 𝑖
]:
if (x < y) & hash(x, y)∈ frequent
buckets: output {(x, y), 1}
• Reducer input: {(x, y), [a, b, c, … ]};
• Take the sum of the list as t;
• If t>=s: Output {(x, y), t}.
• Combiner is the same as the reducer,
except only taking the sum but not
filtering infrequent pairs.
• As we said, Hadoop will help us gather
each pair’s count without stressing
memory. You don’t need to store
everything in memory.
21. Higher order A-Priori
• Now suppose you have a set of
frequent length-k-itemsets L(k),
how to find L(k+1)?
• Create an empty hash table C,
whose key is length-k-itemset, and
value is count.
• Within L(k) using double loops to
form all possible itemset pairs,
• for each pair (x, y), take the union
𝑧 = 𝑥 ∪ 𝑦,
• If the length of z is k+1, then add 1
to the count of z, i.e., C[z]+=1.
• Finally, if the count of z is less than
k+1, then z does not have k+1
frequent length-k-subsets, delete z
from C.
• The remaining itemsets in C are
candidate itemsets.
• Now going through the whole
dataset,count for each candidate
itemsets and filter out the
infrequent candidates.
22. Frequent pairs and triples and their count
• The number of
frequent itemsets
drops exponentially
as its length (k)
grows.
• Within 18 out of 21
frequent pairs, the
two product IDs are
close to each other,
like (4831, 4839),
(6041, 6068), ….
23. Association rules: 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼 → 𝐽 =
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐼∪𝐽)
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐼)
= 𝑃(𝐽|𝐼)
• If you know an itemset J is frequent,
how do you find all related
association rules’ confidence?
• There’re better ways than this,
however, since our results stops
only after k=3, we will enumerate all
of them (2^k possibilities).
• Suppose 𝐼 ∪ 𝐽 = {a, b, c}, we use
binary number to simulate the
whether an item appears in J (1) or
not (0).
• Then I = 𝐼 ∪ 𝐽 − J .
• J={}, 000, 0
• J={a}, 001, 1
• J={b}, 010, 2
• J={a, b}, 011, 3
• J={c}, 100, 4
• J={a, c}, 101, 5
• J={b, c}, 110, 6
• J={a, b, c}, 111, 7
Binary digits, each digit
represents a specific item
appearsin J or not
Decimal numbers for
you to easily going
through
24. The associations found (confidence threshold=0.5)
• The index within each
association rule are quietly
close to each other, like
19405 19405, (4832,
4839) 4831, ….
• We guess the supermarket,
when they design the
positions or creating IDs for
their products, they try to
put stuffs that are frequently
bought together close on
purpose.
25. SON: (the Algorithm of Savasere, Omiecinski, and
Navathe)
This is a simple way of paralleling:
• Each computing unit gets a portion
of data that can be fit in its
memory (need a data structure to
store the local data),
• then use A-Priori or PCY or other
methods to find the locally
frequent itemsets,which forms the
candidate set.
• During this process, the local
threshold is adjusted to p*s, (p is
the portion of data you get)
• During the 2nd pass, you only count for
the candidate itemsets.
• Filter out infrequent candidates.
• If an itemset is not locally frequent from
all computing units, then it also cannot
be frequent globally ( 𝑖 𝑝𝑖 = 1).
• This algorithm can deal with whatever
length itemsetsin its paralleling process.
• But it needs to store raw data to local
memory (more expensive than storing
count, practical only when you have a lot
of clusters).
26. References
• 1. Mining of Massive Dataset, Jure Leskovec, Anand Rajaramanand Jeffrey D.
Ullman.
• 2. Chun-Nan Hsu, Hao-Hsiang Chung and Han-Shen Huang, Mining Skewed and
Sparse Tranaction Data for Personalized Shopping recommendation, Machine
Learning, 2004, Volume 57, Number 1-2, Page 35