Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by Kyle Polich of DataScience

Applications of theApriori Algorithm
on Open Data

Who am I?
2
• I'm Kyle Polich
• I work at DataScience
• I hostThe Data Skeptic Podcast
• I’m excited to share some ideas about data
mining framed around the Apriori Algorithm
• And examples on open data you can
reproduce

Outline
3
• What is Association Mining?
• The Apriori Algorithm
• Examples
• Big Data
• Criticisms
• Tips andTricks

General Concept
4
• Unsupervised Learning
• Association rule learning (A and B) (A and B and C)
• If N items, than 2N-1 itemsets (powerset w/o empty)
• Common itemsets are made up of common
sub-itemsets
• Iteratively build candidates based on frequency

Isn’t this a dead algorithm?
5
?!

6
Well, the apriori algorithm might be outdated
but a) this page is about that algorithm! and
b) not necessary to state,
but it is the first significant algorithm, and
the basic idea is used again and again in
several succeeding algorithms
so it is important to understand it.Exa 18:33,
16 May 2007 (UTC)
Excerpt fromWikipedia talk page
By user 81.104.165.184

7

8
C4.5
Apriori algorithm
Hyperloglog

9
Google Scholar tracks 18,286
citations
TODO: visualize this as a time series

10
1. Easy to learn in a 30 minute session
2. Always start simple, and grow in complexity
3. Simple, but still powerful
4. Practical to implement
5. Runs well at scale
6. Good study of algorithmic design
7. I believe it’s a useful algorithm

Origin / Creators
11
Fast Algorithms for Mining Association Rules
Rakesh Agrawal & Ramakrishnan Srikant
IBMAlmaden Research Center
20th InternationalConference onVery Large Data Bases
Santiago, Chile - September 1994
http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf

Key Concept: Associative Rules
12
• “Peanut Butter” AND “Jelly”
• “Sausage”AND “mustard” AND “deli roll”
• “Good schools” AND “easy parking” AND
“walk to restaurants”

Metrics
22
Support
% of cases containing itemset
R and Machine Learning (5)
Benjamin Uminsky
Gian Gonzanga
Jim Mcguire
Kyle Polich
Szilard Pafka
Everyone (35)
Aaron Wepler, Abhi Nemani, Adam Mollenkopf, Alan Gates, Amelia
Mcnamara, Arvind Prabhakar, Ashish Singh, Benjamin Uminsky, Bikas Saha,
Brian Kursar, Chris Fregly, Felix Chern, Gian GonzangatH, Hyunsik Choi, Jeff
Morris, Jim Mcguire, John De Goes, Jonathan Gray, Josiah Carlson, Karen
Lopez, Khanderao Kand, Kyle Polich, Michael Limcaco, Michael Stack,
Rachel Pedreschi, Raj Babu, Romain Rigaux, Sabri Sansoy, Szilard Pafka,Tim
Ellis,Tim Fulmer, Ulas Bardak,Vinayak Borkar, Will Ochandarena, ZainAsgar
5 / 35 = .14286

Metrics
23
Confidence
% of cases containing itemset
R (6)
Amelia Mcnamara, Benjamin Uminsky, Gian Gonzanga, Jim
Mcguire, Kyle Polich, Szilard Pafka
Machine Learning (7)
Benjamin Uminsky, Brian Kursar, Gian Gonzanga, Jim
Mcguire, Kyle Polich, Szilard Pafka, Ulas Bardak
R -> Machine Learning
5 / 7 = .71286

CodeWalkthrough
24
Let minimum support = .19
name count support
Algorithms 7 0.2
Machine Learning 7 0.2
Software Engineering 7 0.2
Software Development 9 0.257143
Distributed Systems 11 0.314286
Java 12 0.342857
Big Data 13 0.371429
Hadoop 14 0.4

CodeWalkthrough
25
Let minimum support = .19; k=2
name count support
Algorithms 7 0.2
Java 12 0.342857
Big Data 13 0.371429
Hadoop 14 0.4

CodeWalkthrough
26
name count support
Algorithms 7 0.2
Java 12 0.342857
Big Data 13 0.371429
Hadoop 14 0.4
Algorithms Hadoop
Software
Development Distributed Systems
Hadoop
Distributed
Systems Big Data Distributed Systems
Java Hadoop
Software
Engineering Distributed Systems
Software
Development Hadoop Distributed Systems Machine Learning
Hadoop Big Data
Software
Development Java
Hadoop
Software
Engineering Java Big Data
Hadoop
Machine
Learning Java Software Engineering
Algorithms
Distributed
Systems Java Machine Learning
Java Algorithms
Software
Development Big Data
Software
Development Algorithms
Software
Development Software Engineering
Algorithms Big Data
Software
Development Machine Learning
Algorithms
Software
Engineering
Software
Engineering Big Data
Algorithms
Machine
Learning Big Data Machine Learning
Java
Distributed
Systems
Software
Engineering Machine Learning

CodeWalkthrough
27
name count support
Algorithms 7 0.2
Java 12 0.342857
Big Data 13 0.371429
Hadoop 14 0.4
Algorithms Hadoop 3
Software
Development Distributed Systems 4
Hadoop
Distributed
Systems 10 Big Data Distributed Systems 7
Java Hadoop 8
Software
Engineering Distributed Systems 3
Software
Development Hadoop 4 Distributed Systems Machine Learning 0
Hadoop Big Data 8
Software
Development Java 4
Hadoop
Software
Engineering 2 Java Big Data 5
Hadoop
Machine
Learning 1 Java Software Engineering 3
Algorithms
Distributed
Systems 4 Java Machine Learning 1
Java Algorithms 4
Software
Development Big Data 4
Software
Development Algorithms 3
Software
Development Software Engineering 5
Algorithms Big Data 2
Software
Development Machine Learning 0
Algorithms
Software
Engineering 3
Software
Engineering Big Data 2
Algorithms
Machine
Learning 2 Big Data Machine Learning 2
Java
Distributed
Systems 8
Software
Engineering Machine Learning 0

CodeWalkthrough
28
name count support
Algorithms 7 0.2
Java 12 0.342857
Big Data 13 0.371429
Hadoop 14 0.4
Algorithms Hadoop 3
Software
Development Distributed Systems 4
Hadoop
Distributed
Systems 10 Big Data Distributed Systems 7
Java Hadoop 8
Software
Engineering Distributed Systems 3
Software
Development Hadoop 4 Distributed Systems Machine Learning 0
Hadoop Big Data 8
Software
Development Java 4
Hadoop
Software
Engineering 2 Java Big Data 5
Hadoop
Machine
Learning 1 Java Software Engineering 3
Algorithms
Distributed
Systems 4 Java Machine Learning 1
Java Algorithms 4
Software
Development Big Data 4
Software
Development Algorithms 3
Software
Development Software Engineering 5
Algorithms Big Data 2
Software
Development Machine Learning 0
Algorithms
Software
Engineering 3
Software
Engineering Big Data 2
Algorithms
Machine
Learning 2 Big Data Machine Learning 2
Java
Distributed
Systems 8
Software
Engineering Machine Learning 0

CodeWalkthrough
29
name count support
Hadoop, Distributed Systems 10 .35
Java, Hadoop 8 0.22857
Hadoop, Big Data 8 0.22857
Java, Distributed Systems 8 0.22857
Big Data, Distributed Systems 7 0.2
Hadoop Distributed Systems Java 7 0.2
Hadoop Distributed Systems Big Data 7 0.2

CodeWalkthrough
30
name count support
Hadoop, Distributed Systems, Java 7 0.2
Hadoop, Distributed Systems, Big Data 7 0.2
Hadoop
Distributed Systems
Java
Big Data
1. Alan Gates
2. Ashish Singh
3. Jonathan Gray
4. Michael Stack
5. Vinayak Borkar

CodeWalkthrough
31
Hadoop
Distributed Systems
Java
Big Data
1. Alan Gates
2. Ashish Singh
3. Jonathan Gray
4. Michael Stack
5. Vinayak Borkar

CodeWalkthrough
32
Hadoop 0.4
Algorithms 0.2
Distributed Systems 0.314286
Java 0.342857
Software Development 0.257143
Big Data 0.371429
Software Engineering 0.2
Machine Learning 0.2
['Big Data', 'Hadoop'] 0.228571
['Distributed Systems', 'Hadoop'] 0.285714
['Distributed Systems', 'Java'] 0.228571
['Hadoop', 'Java'] 0.228571
['Big Data', 'Distributed Systems'] 0.2
['Big Data', 'Distributed Systems', 'Hadoop'] 0.2
['Distributed Systems', 'Hadoop', 'Java'] 0.2

Computational Commentary
33
• Outer loop should
(presumably) be a small
number of iterations
• Be careful selecting your
minimum!
• Maybe put a max iterations?

34
• |t| is constant, and large;
this step must be carefully
considered!

35
• This can be the “map” step
• Pseudo code a bit unclear
here
• Could be highly optimized
• Can run in O(n) time with
pre-built hash tables

36
• The “reduce” step
• Fast step in practice, but can
also be optimized

Performance and Sensitivity
on Big Data Day LA 2015 Speakers dataset
37

Recipes - Single Itemsets
40
garlic onion parsley
all purpose flour salt vanilla extract
canola oil chicken broth onion
all-purpose flour almond extract brown sugar
baking powder butter softened cinnamon
all-purpose flour baking powder sugar
brown sugar milk sugar
cilantro olive oil red onion
all purpose flour butter softened sugar
bay leaves oregano parmesan cheese
ginger soba noodles toasted pine nuts

Los Angeles 311 Data
41
Blocked Driveways Bulky Item Pick-up
Holiday Trash Collection Internal Affairs Group - LAPD
Report Broken Parking Meters Abandoned Vehicles
Complaint - LAPD (How to Make
a Complaint) Bulky Item Pick-up
Animal Service Centers Report streetlight outages
Police Auctions Blocked Driveways
Sprinklers Running at Parks Bulky Item Pick-up
Graffiti Removal - Community
Beautification
877 ASK-LAPD - Non-emergency
Police Service
LADWP Central Operator Constituent Service Office of the Mayor

Frequent itemset mining in games
42
• Anders Drachen has written about Apriori applications in gaming
• http://bit.ly/1Fi8vHu

Block World
43
• TODO: Add this one

Recommender System Example
44
• TODO: add this one

Online Feature Discovery in
Relational Reinforcement Learning (2006)
45
Presented at the ICML Workshop on Open Problems in Statistical Relational Learning,
Pittsburgh, PA, 2006
Scott Sanner, University ofToronto
• Reinforcement learning
• Used to identify for focusing on frequently visited areas of the state
space when doing structure learning

A Novel Modified Apriori Approach for
Web Document Clustering (2015)
46
Computational Intelligence in Data Mining-Volume 3, 159-171, 2015
Roul,Varshneya, Kalra, Sahay
• Keywords / ngrams as items; documents as itemsets
• Centroid describes topic / theme of pages
• Decrease candidate itemsets during candidate generation
• Only consider itemsets in a specific iteration
• Some code optimizations around unnecessary steps

Apache Hive Implementation
48
CREATE EXTERNAL TABLE apriori_transactions
(transaction string, item string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
LOCATION '/mnt/hive/sandbox/apriori/data';
CREATE EXTERNAL TABLE apriori_itemsets
(itemset string, cardinality int, occurances int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
LOCATION '/mnt/hive/sandbox/apriori/itemsets';
SELECT itemset, occurances
FROM apriori_itemsets
WHERE cardinality = ?

Apache Hive Implementation
49
• TODO: provide the full example

Repeated database table scans
51
• Distributed solutions can solve this on large
datasets
• In-memory analysis can solve for small

Fails to observe rare but important matches
52
• Described as “weak” associative rules
• Example fromThe Elements of Statistical
Learning by Hastie,Tibshirani, and Friedman
is “caviar” and “wine”
• Adaptations of the algorithm could address
this

Lacks Personalization
53
• True, but this is not an objective

Great for Ensembling
55
• Quick and dirty unsupervised analysis
• Get initial glimpse into a new dataset
• Feed results into other approaches

Optimize forYour Use Case
56
• TODO: Hive trick
• Find efficient data structure to capture your
transactions

Market Basket / Affinity Analysis
57
Purpose
• Identify cross-selling / up-selling opportunities
• Shelf / aisle placement optimization
The Apriori Algorithm…
• provides an easy, fast, first look
• is useful in creating a feature label variable
called “has common itemset”
• turns out great results in ensemble
approaches

58
The Apriori Algorithm is worth your time.
• Informative when studied
• Unsupervised, great starting point
• Extendable
• Great as an ensemble approach
CONCLUSION

Thank you.
@DataSkeptic http://linkd.in/1IkLy8N
kyle@datascience.com

Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by Kyle Polich of DataScience

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by Kyle Polich of DataScience

Similar to Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by Kyle Polich of DataScience (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by Kyle Polich of DataScience

Editor's Notes