SlideShare a Scribd company logo
Applications of theApriori Algorithm
on Open Data
Who am I?
2
• I'm Kyle Polich
• I work at DataScience
• I hostThe Data Skeptic Podcast
• I’m excited to share some ideas about data
mining framed around the Apriori Algorithm
• And examples on open data you can
reproduce
Outline
3
• What is Association Mining?
• The Apriori Algorithm
• Examples
• Big Data
• Criticisms
• Tips andTricks
General Concept
4
• Unsupervised Learning
• Association rule learning (A and B) (A and B and C)
• If N items, than 2N-1 itemsets (powerset w/o empty)
• Common itemsets are made up of common
sub-itemsets
• Iteratively build candidates based on frequency
Isn’t this a dead algorithm?
5
?!
Isn’t this a dead algorithm?
6
Well, the apriori algorithm might be outdated
but a) this page is about that algorithm! and
b) not necessary to state,
but it is the first significant algorithm, and
the basic idea is used again and again in
several succeeding algorithms
so it is important to understand it.Exa 18:33,
16 May 2007 (UTC)
Excerpt fromWikipedia talk page
By user 81.104.165.184
Isn’t this a dead algorithm?
7
Isn’t this a dead algorithm?
8
C4.5
Apriori algorithm
Hyperloglog
Isn’t this a dead algorithm?
9
Google Scholar tracks 18,286
citations
TODO: visualize this as a time series
Isn’t this a dead algorithm?
10
1. Easy to learn in a 30 minute session
2. Always start simple, and grow in complexity
3. Simple, but still powerful
4. Practical to implement
5. Runs well at scale
6. Good study of algorithmic design
7. I believe it’s a useful algorithm
Origin / Creators
11
Fast Algorithms for Mining Association Rules
Rakesh Agrawal & Ramakrishnan Srikant
IBMAlmaden Research Center
20th InternationalConference onVery Large Data Bases
Santiago, Chile - September 1994
http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf
Key Concept: Associative Rules
12
• “Peanut Butter” AND “Jelly”
• “Sausage”AND “mustard” AND “deli roll”
• “Good schools” AND “easy parking” AND
“walk to restaurants”
Pseudocode
13
Pseudocode
14
Pseudocode
15
Pseudocode
16
Pseudocode
17
Pseudocode
18
Pseudocode
19
Pseudocode
20
Toy Example
21
Metrics
22
Support
% of cases containing itemset
R and Machine Learning (5)
Benjamin Uminsky
Gian Gonzanga
Jim Mcguire
Kyle Polich
Szilard Pafka
Everyone (35)
Aaron Wepler, Abhi Nemani, Adam Mollenkopf, Alan Gates, Amelia
Mcnamara, Arvind Prabhakar, Ashish Singh, Benjamin Uminsky, Bikas Saha,
Brian Kursar, Chris Fregly, Felix Chern, Gian GonzangatH, Hyunsik Choi, Jeff
Morris, Jim Mcguire, John De Goes, Jonathan Gray, Josiah Carlson, Karen
Lopez, Khanderao Kand, Kyle Polich, Michael Limcaco, Michael Stack,
Rachel Pedreschi, Raj Babu, Romain Rigaux, Sabri Sansoy, Szilard Pafka,Tim
Ellis,Tim Fulmer, Ulas Bardak,Vinayak Borkar, Will Ochandarena, ZainAsgar
5 / 35 = .14286
Metrics
23
Confidence
% of cases containing itemset
R (6)
Amelia Mcnamara, Benjamin Uminsky, Gian Gonzanga, Jim
Mcguire, Kyle Polich, Szilard Pafka
Machine Learning (7)
Benjamin Uminsky, Brian Kursar, Gian Gonzanga, Jim
Mcguire, Kyle Polich, Szilard Pafka, Ulas Bardak
R -> Machine Learning
5 / 7 = .71286
CodeWalkthrough
24
Let minimum support = .19
name count support
Algorithms 7 0.2
Machine Learning 7 0.2
Software Engineering 7 0.2
Software Development 9 0.257143
Distributed Systems 11 0.314286
Java 12 0.342857
Big Data 13 0.371429
Hadoop 14 0.4
CodeWalkthrough
25
Let minimum support = .19; k=2
name count support
Algorithms 7 0.2
Machine Learning 7 0.2
Software Engineering 7 0.2
Software Development 9 0.257143
Distributed Systems 11 0.314286
Java 12 0.342857
Big Data 13 0.371429
Hadoop 14 0.4
CodeWalkthrough
26
Let minimum support = .19; k=2
name count support
Algorithms 7 0.2
Machine Learning 7 0.2
Software Engineering 7 0.2
Software Development 9 0.257143
Distributed Systems 11 0.314286
Java 12 0.342857
Big Data 13 0.371429
Hadoop 14 0.4
Algorithms Hadoop
Software
Development Distributed Systems
Hadoop
Distributed
Systems Big Data Distributed Systems
Java Hadoop
Software
Engineering Distributed Systems
Software
Development Hadoop Distributed Systems Machine Learning
Hadoop Big Data
Software
Development Java
Hadoop
Software
Engineering Java Big Data
Hadoop
Machine
Learning Java Software Engineering
Algorithms
Distributed
Systems Java Machine Learning
Java Algorithms
Software
Development Big Data
Software
Development Algorithms
Software
Development Software Engineering
Algorithms Big Data
Software
Development Machine Learning
Algorithms
Software
Engineering
Software
Engineering Big Data
Algorithms
Machine
Learning Big Data Machine Learning
Java
Distributed
Systems
Software
Engineering Machine Learning
CodeWalkthrough
27
Let minimum support = .19; k=2
name count support
Algorithms 7 0.2
Machine Learning 7 0.2
Software Engineering 7 0.2
Software Development 9 0.257143
Distributed Systems 11 0.314286
Java 12 0.342857
Big Data 13 0.371429
Hadoop 14 0.4
Algorithms Hadoop 3
Software
Development Distributed Systems 4
Hadoop
Distributed
Systems 10 Big Data Distributed Systems 7
Java Hadoop 8
Software
Engineering Distributed Systems 3
Software
Development Hadoop 4 Distributed Systems Machine Learning 0
Hadoop Big Data 8
Software
Development Java 4
Hadoop
Software
Engineering 2 Java Big Data 5
Hadoop
Machine
Learning 1 Java Software Engineering 3
Algorithms
Distributed
Systems 4 Java Machine Learning 1
Java Algorithms 4
Software
Development Big Data 4
Software
Development Algorithms 3
Software
Development Software Engineering 5
Algorithms Big Data 2
Software
Development Machine Learning 0
Algorithms
Software
Engineering 3
Software
Engineering Big Data 2
Algorithms
Machine
Learning 2 Big Data Machine Learning 2
Java
Distributed
Systems 8
Software
Engineering Machine Learning 0
CodeWalkthrough
28
Let minimum support = .19; k=2
name count support
Algorithms 7 0.2
Machine Learning 7 0.2
Software Engineering 7 0.2
Software Development 9 0.257143
Distributed Systems 11 0.314286
Java 12 0.342857
Big Data 13 0.371429
Hadoop 14 0.4
Algorithms Hadoop 3
Software
Development Distributed Systems 4
Hadoop
Distributed
Systems 10 Big Data Distributed Systems 7
Java Hadoop 8
Software
Engineering Distributed Systems 3
Software
Development Hadoop 4 Distributed Systems Machine Learning 0
Hadoop Big Data 8
Software
Development Java 4
Hadoop
Software
Engineering 2 Java Big Data 5
Hadoop
Machine
Learning 1 Java Software Engineering 3
Algorithms
Distributed
Systems 4 Java Machine Learning 1
Java Algorithms 4
Software
Development Big Data 4
Software
Development Algorithms 3
Software
Development Software Engineering 5
Algorithms Big Data 2
Software
Development Machine Learning 0
Algorithms
Software
Engineering 3
Software
Engineering Big Data 2
Algorithms
Machine
Learning 2 Big Data Machine Learning 2
Java
Distributed
Systems 8
Software
Engineering Machine Learning 0
CodeWalkthrough
29
Let minimum support = .19; k=3
name count support
Hadoop, Distributed Systems 10 .35
Java, Hadoop 8 0.22857
Hadoop, Big Data 8 0.22857
Java, Distributed Systems 8 0.22857
Big Data, Distributed Systems 7 0.2
Hadoop Distributed Systems Java 7 0.2
Hadoop Distributed Systems Big Data 7 0.2
CodeWalkthrough
30
Let minimum support = .19; k=3
name count support
Hadoop, Distributed Systems, Java 7 0.2
Hadoop, Distributed Systems, Big Data 7 0.2
Hadoop
Distributed Systems
Java
Big Data
1. Alan Gates
2. Ashish Singh
3. Jonathan Gray
4. Michael Stack
5. Vinayak Borkar
CodeWalkthrough
31
Let minimum support = .19; k=4
Hadoop
Distributed Systems
Java
Big Data
1. Alan Gates
2. Ashish Singh
3. Jonathan Gray
4. Michael Stack
5. Vinayak Borkar
CodeWalkthrough
32
Hadoop 0.4
Algorithms 0.2
Distributed Systems 0.314286
Java 0.342857
Software Development 0.257143
Big Data 0.371429
Software Engineering 0.2
Machine Learning 0.2
['Big Data', 'Hadoop'] 0.228571
['Distributed Systems', 'Hadoop'] 0.285714
['Distributed Systems', 'Java'] 0.228571
['Hadoop', 'Java'] 0.228571
['Big Data', 'Distributed Systems'] 0.2
['Big Data', 'Distributed Systems', 'Hadoop'] 0.2
['Distributed Systems', 'Hadoop', 'Java'] 0.2
Computational Commentary
33
• Outer loop should
(presumably) be a small
number of iterations
• Be careful selecting your
minimum!
• Maybe put a max iterations?
Computational Commentary
34
• |t| is constant, and large;
this step must be carefully
considered!
Computational Commentary
35
• This can be the “map” step
• Pseudo code a bit unclear
here
• Could be highly optimized
• Can run in O(n) time with
pre-built hash tables
Computational Commentary
36
• The “reduce” step
• Fast step in practice, but can
also be optimized
Performance and Sensitivity
on Big Data Day LA 2015 Speakers dataset
37
38
Examples.
Recipes - Single Itemsets
39
Recipes - Single Itemsets
40
garlic onion parsley
all purpose flour salt vanilla extract
canola oil chicken broth onion
all-purpose flour almond extract brown sugar
baking powder butter softened cinnamon
all-purpose flour baking powder sugar
brown sugar milk sugar
cilantro olive oil red onion
all purpose flour butter softened sugar
bay leaves oregano parmesan cheese
ginger soba noodles toasted pine nuts
Los Angeles 311 Data
41
Blocked Driveways Bulky Item Pick-up
Holiday Trash Collection Internal Affairs Group - LAPD
Report Broken Parking Meters Abandoned Vehicles
Complaint - LAPD (How to Make
a Complaint) Bulky Item Pick-up
Animal Service Centers Report streetlight outages
Police Auctions Blocked Driveways
Sprinklers Running at Parks Bulky Item Pick-up
Graffiti Removal - Community
Beautification
877 ASK-LAPD - Non-emergency
Police Service
LADWP Central Operator Constituent Service Office of the Mayor
Frequent itemset mining in games
42
• Anders Drachen has written about Apriori applications in gaming
• http://bit.ly/1Fi8vHu
Block World
43
• TODO: Add this one
Recommender System Example
44
• TODO: add this one
Online Feature Discovery in
Relational Reinforcement Learning (2006)
45
Presented at the ICML Workshop on Open Problems in Statistical Relational Learning,
Pittsburgh, PA, 2006
Scott Sanner, University ofToronto
• Reinforcement learning
• Used to identify for focusing on frequently visited areas of the state
space when doing structure learning
A Novel Modified Apriori Approach for
Web Document Clustering (2015)
46
Computational Intelligence in Data Mining-Volume 3, 159-171, 2015
Roul,Varshneya, Kalra, Sahay
• Keywords / ngrams as items; documents as itemsets
• Centroid describes topic / theme of pages
• Decrease candidate itemsets during candidate generation
• Only consider itemsets in a specific iteration
• Some code optimizations around unnecessary steps
47
Big Data.
Apache Hive Implementation
48
CREATE EXTERNAL TABLE apriori_transactions
(transaction string, item string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
LOCATION '/mnt/hive/sandbox/apriori/data';
CREATE EXTERNAL TABLE apriori_itemsets
(itemset string, cardinality int, occurances int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
LOCATION '/mnt/hive/sandbox/apriori/itemsets';
SELECT itemset, occurances
FROM apriori_itemsets
WHERE cardinality = ?
Apache Hive Implementation
49
• TODO: provide the full example
50
Criticism.
Repeated database table scans
51
• Distributed solutions can solve this on large
datasets
• In-memory analysis can solve for small
Fails to observe rare but important matches
52
• Described as “weak” associative rules
• Example fromThe Elements of Statistical
Learning by Hastie,Tibshirani, and Friedman
is “caviar” and “wine”
• Adaptations of the algorithm could address
this
Lacks Personalization
53
• True, but this is not an objective
54
Tips and
Tricks.
Great for Ensembling
55
• Quick and dirty unsupervised analysis
• Get initial glimpse into a new dataset
• Feed results into other approaches
Optimize forYour Use Case
56
• TODO: Hive trick
• Find efficient data structure to capture your
transactions
Market Basket / Affinity Analysis
57
Purpose
• Identify cross-selling / up-selling opportunities
• Shelf / aisle placement optimization
The Apriori Algorithm…
• provides an easy, fast, first look
• is useful in creating a feature label variable
called “has common itemset”
• turns out great results in ensemble
approaches
58
The Apriori Algorithm is worth your time.
• Informative when studied
• Unsupervised, great starting point
• Extendable
• Great as an ensemble approach
CONCLUSION
Thank you.
@DataSkeptic http://linkd.in/1IkLy8N
kyle@datascience.com

More Related Content

What's hot

Real Time Big Data
Real Time Big DataReal Time Big Data
Real Time Big Data
InfoFarm
 
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
Codemotion
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Dataiku
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Big Data Spain
 
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDemocratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Databricks
 
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsBlue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Databricks
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
Adaryl "Bob" Wakefield, MBA
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
Janus graph lookingbackwardreachingforward
Janus graph lookingbackwardreachingforwardJanus graph lookingbackwardreachingforward
Janus graph lookingbackwardreachingforward
Demai Ni
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
Vladimír Schreiner
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
Databricks
 
Hadoop and other animals
Hadoop and other animalsHadoop and other animals
Hadoop and other animals
DataWorks Summit/Hadoop Summit
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
Databricks
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
IMC Institute
 
Real Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik RamasamyReal Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik Ramasamy
Data Con LA
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
Allen Day, PhD
 
Advanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time SpeedAdvanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time Speed
danpotterdwch
 
JanusGraph, Jupyter Meetup NYC
JanusGraph, Jupyter Meetup NYCJanusGraph, Jupyter Meetup NYC
JanusGraph, Jupyter Meetup NYC
Jason Plurad
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Nathan Bijnens
 

What's hot (19)

Real Time Big Data
Real Time Big DataReal Time Big Data
Real Time Big Data
 
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDemocratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn Creator
 
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsBlue Pill/Red Pill: The Matrix of Thousands of Data Streams
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Janus graph lookingbackwardreachingforward
Janus graph lookingbackwardreachingforwardJanus graph lookingbackwardreachingforward
Janus graph lookingbackwardreachingforward
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
 
Hadoop and other animals
Hadoop and other animalsHadoop and other animals
Hadoop and other animals
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
 
Real Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik RamasamyReal Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik Ramasamy
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Advanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time SpeedAdvanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time Speed
 
JanusGraph, Jupyter Meetup NYC
JanusGraph, Jupyter Meetup NYCJanusGraph, Jupyter Meetup NYC
JanusGraph, Jupyter Meetup NYC
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
 

Similar to Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by Kyle Polich of DataScience

How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
Neo4j
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
Giivee The
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
Trieu Nguyen
 
Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023
Hal Speed
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
Praveen Kumar (Tyagi)
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
Rethink Analytics with an Enterprise Data Hub
Rethink Analytics with an Enterprise Data HubRethink Analytics with an Enterprise Data Hub
Rethink Analytics with an Enterprise Data Hub
Cloudera, Inc.
 
Dsc 2021 presentation_radovan_bacovic
Dsc 2021 presentation_radovan_bacovicDsc 2021 presentation_radovan_bacovic
Dsc 2021 presentation_radovan_bacovic
Radovan Baćović
 
From Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data ScienceFrom Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data Science
Institute of Contemporary Sciences
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Semantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and PracticesSemantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and Practices
Steffen Staab
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
Nicolas Poggi
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
markgrover
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
Tao Feng
 
Keynote at the MTSR conference
Keynote at the MTSR conferenceKeynote at the MTSR conference
Keynote at the MTSR conference
Johannes Keizer
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply Chains
Neo4j
 
SnapLogic Technology Open House – January 2018
SnapLogic Technology Open House – January 2018SnapLogic Technology Open House – January 2018
SnapLogic Technology Open House – January 2018
SnapLogic
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
Fabricio Quintanilla
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersitnig
 
Data tools ecosystem for non-programmers
Data tools ecosystem for non-programmersData tools ecosystem for non-programmers
Data tools ecosystem for non-programmers
Outliers Collective
 

Similar to Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by Kyle Polich of DataScience (20)

How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Rethink Analytics with an Enterprise Data Hub
Rethink Analytics with an Enterprise Data HubRethink Analytics with an Enterprise Data Hub
Rethink Analytics with an Enterprise Data Hub
 
Dsc 2021 presentation_radovan_bacovic
Dsc 2021 presentation_radovan_bacovicDsc 2021 presentation_radovan_bacovic
Dsc 2021 presentation_radovan_bacovic
 
From Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data ScienceFrom Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data Science
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Semantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and PracticesSemantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and Practices
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Keynote at the MTSR conference
Keynote at the MTSR conferenceKeynote at the MTSR conference
Keynote at the MTSR conference
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply Chains
 
SnapLogic Technology Open House – January 2018
SnapLogic Technology Open House – January 2018SnapLogic Technology Open House – January 2018
SnapLogic Technology Open House – January 2018
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmers
 
Data tools ecosystem for non-programmers
Data tools ecosystem for non-programmersData tools ecosystem for non-programmers
Data tools ecosystem for non-programmers
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
Data Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 

Recently uploaded (20)

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 

Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by Kyle Polich of DataScience

  • 1. Applications of theApriori Algorithm on Open Data
  • 2. Who am I? 2 • I'm Kyle Polich • I work at DataScience • I hostThe Data Skeptic Podcast • I’m excited to share some ideas about data mining framed around the Apriori Algorithm • And examples on open data you can reproduce
  • 3. Outline 3 • What is Association Mining? • The Apriori Algorithm • Examples • Big Data • Criticisms • Tips andTricks
  • 4. General Concept 4 • Unsupervised Learning • Association rule learning (A and B) (A and B and C) • If N items, than 2N-1 itemsets (powerset w/o empty) • Common itemsets are made up of common sub-itemsets • Iteratively build candidates based on frequency
  • 5. Isn’t this a dead algorithm? 5 ?!
  • 6. Isn’t this a dead algorithm? 6 Well, the apriori algorithm might be outdated but a) this page is about that algorithm! and b) not necessary to state, but it is the first significant algorithm, and the basic idea is used again and again in several succeeding algorithms so it is important to understand it.Exa 18:33, 16 May 2007 (UTC) Excerpt fromWikipedia talk page By user 81.104.165.184
  • 7. Isn’t this a dead algorithm? 7
  • 8. Isn’t this a dead algorithm? 8 C4.5 Apriori algorithm Hyperloglog
  • 9. Isn’t this a dead algorithm? 9 Google Scholar tracks 18,286 citations TODO: visualize this as a time series
  • 10. Isn’t this a dead algorithm? 10 1. Easy to learn in a 30 minute session 2. Always start simple, and grow in complexity 3. Simple, but still powerful 4. Practical to implement 5. Runs well at scale 6. Good study of algorithmic design 7. I believe it’s a useful algorithm
  • 11. Origin / Creators 11 Fast Algorithms for Mining Association Rules Rakesh Agrawal & Ramakrishnan Srikant IBMAlmaden Research Center 20th InternationalConference onVery Large Data Bases Santiago, Chile - September 1994 http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf
  • 12. Key Concept: Associative Rules 12 • “Peanut Butter” AND “Jelly” • “Sausage”AND “mustard” AND “deli roll” • “Good schools” AND “easy parking” AND “walk to restaurants”
  • 22. Metrics 22 Support % of cases containing itemset R and Machine Learning (5) Benjamin Uminsky Gian Gonzanga Jim Mcguire Kyle Polich Szilard Pafka Everyone (35) Aaron Wepler, Abhi Nemani, Adam Mollenkopf, Alan Gates, Amelia Mcnamara, Arvind Prabhakar, Ashish Singh, Benjamin Uminsky, Bikas Saha, Brian Kursar, Chris Fregly, Felix Chern, Gian GonzangatH, Hyunsik Choi, Jeff Morris, Jim Mcguire, John De Goes, Jonathan Gray, Josiah Carlson, Karen Lopez, Khanderao Kand, Kyle Polich, Michael Limcaco, Michael Stack, Rachel Pedreschi, Raj Babu, Romain Rigaux, Sabri Sansoy, Szilard Pafka,Tim Ellis,Tim Fulmer, Ulas Bardak,Vinayak Borkar, Will Ochandarena, ZainAsgar 5 / 35 = .14286
  • 23. Metrics 23 Confidence % of cases containing itemset R (6) Amelia Mcnamara, Benjamin Uminsky, Gian Gonzanga, Jim Mcguire, Kyle Polich, Szilard Pafka Machine Learning (7) Benjamin Uminsky, Brian Kursar, Gian Gonzanga, Jim Mcguire, Kyle Polich, Szilard Pafka, Ulas Bardak R -> Machine Learning 5 / 7 = .71286
  • 24. CodeWalkthrough 24 Let minimum support = .19 name count support Algorithms 7 0.2 Machine Learning 7 0.2 Software Engineering 7 0.2 Software Development 9 0.257143 Distributed Systems 11 0.314286 Java 12 0.342857 Big Data 13 0.371429 Hadoop 14 0.4
  • 25. CodeWalkthrough 25 Let minimum support = .19; k=2 name count support Algorithms 7 0.2 Machine Learning 7 0.2 Software Engineering 7 0.2 Software Development 9 0.257143 Distributed Systems 11 0.314286 Java 12 0.342857 Big Data 13 0.371429 Hadoop 14 0.4
  • 26. CodeWalkthrough 26 Let minimum support = .19; k=2 name count support Algorithms 7 0.2 Machine Learning 7 0.2 Software Engineering 7 0.2 Software Development 9 0.257143 Distributed Systems 11 0.314286 Java 12 0.342857 Big Data 13 0.371429 Hadoop 14 0.4 Algorithms Hadoop Software Development Distributed Systems Hadoop Distributed Systems Big Data Distributed Systems Java Hadoop Software Engineering Distributed Systems Software Development Hadoop Distributed Systems Machine Learning Hadoop Big Data Software Development Java Hadoop Software Engineering Java Big Data Hadoop Machine Learning Java Software Engineering Algorithms Distributed Systems Java Machine Learning Java Algorithms Software Development Big Data Software Development Algorithms Software Development Software Engineering Algorithms Big Data Software Development Machine Learning Algorithms Software Engineering Software Engineering Big Data Algorithms Machine Learning Big Data Machine Learning Java Distributed Systems Software Engineering Machine Learning
  • 27. CodeWalkthrough 27 Let minimum support = .19; k=2 name count support Algorithms 7 0.2 Machine Learning 7 0.2 Software Engineering 7 0.2 Software Development 9 0.257143 Distributed Systems 11 0.314286 Java 12 0.342857 Big Data 13 0.371429 Hadoop 14 0.4 Algorithms Hadoop 3 Software Development Distributed Systems 4 Hadoop Distributed Systems 10 Big Data Distributed Systems 7 Java Hadoop 8 Software Engineering Distributed Systems 3 Software Development Hadoop 4 Distributed Systems Machine Learning 0 Hadoop Big Data 8 Software Development Java 4 Hadoop Software Engineering 2 Java Big Data 5 Hadoop Machine Learning 1 Java Software Engineering 3 Algorithms Distributed Systems 4 Java Machine Learning 1 Java Algorithms 4 Software Development Big Data 4 Software Development Algorithms 3 Software Development Software Engineering 5 Algorithms Big Data 2 Software Development Machine Learning 0 Algorithms Software Engineering 3 Software Engineering Big Data 2 Algorithms Machine Learning 2 Big Data Machine Learning 2 Java Distributed Systems 8 Software Engineering Machine Learning 0
  • 28. CodeWalkthrough 28 Let minimum support = .19; k=2 name count support Algorithms 7 0.2 Machine Learning 7 0.2 Software Engineering 7 0.2 Software Development 9 0.257143 Distributed Systems 11 0.314286 Java 12 0.342857 Big Data 13 0.371429 Hadoop 14 0.4 Algorithms Hadoop 3 Software Development Distributed Systems 4 Hadoop Distributed Systems 10 Big Data Distributed Systems 7 Java Hadoop 8 Software Engineering Distributed Systems 3 Software Development Hadoop 4 Distributed Systems Machine Learning 0 Hadoop Big Data 8 Software Development Java 4 Hadoop Software Engineering 2 Java Big Data 5 Hadoop Machine Learning 1 Java Software Engineering 3 Algorithms Distributed Systems 4 Java Machine Learning 1 Java Algorithms 4 Software Development Big Data 4 Software Development Algorithms 3 Software Development Software Engineering 5 Algorithms Big Data 2 Software Development Machine Learning 0 Algorithms Software Engineering 3 Software Engineering Big Data 2 Algorithms Machine Learning 2 Big Data Machine Learning 2 Java Distributed Systems 8 Software Engineering Machine Learning 0
  • 29. CodeWalkthrough 29 Let minimum support = .19; k=3 name count support Hadoop, Distributed Systems 10 .35 Java, Hadoop 8 0.22857 Hadoop, Big Data 8 0.22857 Java, Distributed Systems 8 0.22857 Big Data, Distributed Systems 7 0.2 Hadoop Distributed Systems Java 7 0.2 Hadoop Distributed Systems Big Data 7 0.2
  • 30. CodeWalkthrough 30 Let minimum support = .19; k=3 name count support Hadoop, Distributed Systems, Java 7 0.2 Hadoop, Distributed Systems, Big Data 7 0.2 Hadoop Distributed Systems Java Big Data 1. Alan Gates 2. Ashish Singh 3. Jonathan Gray 4. Michael Stack 5. Vinayak Borkar
  • 31. CodeWalkthrough 31 Let minimum support = .19; k=4 Hadoop Distributed Systems Java Big Data 1. Alan Gates 2. Ashish Singh 3. Jonathan Gray 4. Michael Stack 5. Vinayak Borkar
  • 32. CodeWalkthrough 32 Hadoop 0.4 Algorithms 0.2 Distributed Systems 0.314286 Java 0.342857 Software Development 0.257143 Big Data 0.371429 Software Engineering 0.2 Machine Learning 0.2 ['Big Data', 'Hadoop'] 0.228571 ['Distributed Systems', 'Hadoop'] 0.285714 ['Distributed Systems', 'Java'] 0.228571 ['Hadoop', 'Java'] 0.228571 ['Big Data', 'Distributed Systems'] 0.2 ['Big Data', 'Distributed Systems', 'Hadoop'] 0.2 ['Distributed Systems', 'Hadoop', 'Java'] 0.2
  • 33. Computational Commentary 33 • Outer loop should (presumably) be a small number of iterations • Be careful selecting your minimum! • Maybe put a max iterations?
  • 34. Computational Commentary 34 • |t| is constant, and large; this step must be carefully considered!
  • 35. Computational Commentary 35 • This can be the “map” step • Pseudo code a bit unclear here • Could be highly optimized • Can run in O(n) time with pre-built hash tables
  • 36. Computational Commentary 36 • The “reduce” step • Fast step in practice, but can also be optimized
  • 37. Performance and Sensitivity on Big Data Day LA 2015 Speakers dataset 37
  • 39. Recipes - Single Itemsets 39
  • 40. Recipes - Single Itemsets 40 garlic onion parsley all purpose flour salt vanilla extract canola oil chicken broth onion all-purpose flour almond extract brown sugar baking powder butter softened cinnamon all-purpose flour baking powder sugar brown sugar milk sugar cilantro olive oil red onion all purpose flour butter softened sugar bay leaves oregano parmesan cheese ginger soba noodles toasted pine nuts
  • 41. Los Angeles 311 Data 41 Blocked Driveways Bulky Item Pick-up Holiday Trash Collection Internal Affairs Group - LAPD Report Broken Parking Meters Abandoned Vehicles Complaint - LAPD (How to Make a Complaint) Bulky Item Pick-up Animal Service Centers Report streetlight outages Police Auctions Blocked Driveways Sprinklers Running at Parks Bulky Item Pick-up Graffiti Removal - Community Beautification 877 ASK-LAPD - Non-emergency Police Service LADWP Central Operator Constituent Service Office of the Mayor
  • 42. Frequent itemset mining in games 42 • Anders Drachen has written about Apriori applications in gaming • http://bit.ly/1Fi8vHu
  • 43. Block World 43 • TODO: Add this one
  • 44. Recommender System Example 44 • TODO: add this one
  • 45. Online Feature Discovery in Relational Reinforcement Learning (2006) 45 Presented at the ICML Workshop on Open Problems in Statistical Relational Learning, Pittsburgh, PA, 2006 Scott Sanner, University ofToronto • Reinforcement learning • Used to identify for focusing on frequently visited areas of the state space when doing structure learning
  • 46. A Novel Modified Apriori Approach for Web Document Clustering (2015) 46 Computational Intelligence in Data Mining-Volume 3, 159-171, 2015 Roul,Varshneya, Kalra, Sahay • Keywords / ngrams as items; documents as itemsets • Centroid describes topic / theme of pages • Decrease candidate itemsets during candidate generation • Only consider itemsets in a specific iteration • Some code optimizations around unnecessary steps
  • 48. Apache Hive Implementation 48 CREATE EXTERNAL TABLE apriori_transactions (transaction string, item string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION '/mnt/hive/sandbox/apriori/data'; CREATE EXTERNAL TABLE apriori_itemsets (itemset string, cardinality int, occurances int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION '/mnt/hive/sandbox/apriori/itemsets'; SELECT itemset, occurances FROM apriori_itemsets WHERE cardinality = ?
  • 49. Apache Hive Implementation 49 • TODO: provide the full example
  • 51. Repeated database table scans 51 • Distributed solutions can solve this on large datasets • In-memory analysis can solve for small
  • 52. Fails to observe rare but important matches 52 • Described as “weak” associative rules • Example fromThe Elements of Statistical Learning by Hastie,Tibshirani, and Friedman is “caviar” and “wine” • Adaptations of the algorithm could address this
  • 53. Lacks Personalization 53 • True, but this is not an objective
  • 55. Great for Ensembling 55 • Quick and dirty unsupervised analysis • Get initial glimpse into a new dataset • Feed results into other approaches
  • 56. Optimize forYour Use Case 56 • TODO: Hive trick • Find efficient data structure to capture your transactions
  • 57. Market Basket / Affinity Analysis 57 Purpose • Identify cross-selling / up-selling opportunities • Shelf / aisle placement optimization The Apriori Algorithm… • provides an easy, fast, first look • is useful in creating a feature label variable called “has common itemset” • turns out great results in ensemble approaches
  • 58. 58 The Apriori Algorithm is worth your time. • Informative when studied • Unsupervised, great starting point • Extendable • Great as an ensemble approach CONCLUSION

Editor's Notes

  1. K-means. Cutting down all comparisons
  2. Talk page
  3. Google Trend shows reasonable interest, even today
  4. Holding better than C4.5, more interesting than hyperloglog
  5. 2 – point in right direction 6 – we need to study more, digital red lining
  6. I will go step by step through this, the subtleties are important
  7. Gets all potential itemsets based on the previous iteration. Assume itemsets made up of common item subsets
  8. Originally database. I use in-memory hash tables
  9. Very expensive looping over T – database scan
  10. Pulled speakers skills from linkedin
  11. R and Machine Learning
  12. Initialize all 1 element datasets – too many to show here, set .19 as support parameter
  13. Set k=2, check L1, start
  14. Apriori-gen step generates all possible rules based on the previous rules. Given what is in upper right, all pairs
  15. Here are all the counts
  16. Filter out those below our minimum sensitivity
  17. Do the next iteration of k
  18. Only 5 people have the available combination of popular skills. Not enough for minimum support…
  19. Thus, loop is done
  20. Our final results
  21. Few iterations
  22. t \in T is a database call in the original iteration; fine because you should have a small number of iterations
  23. I pre-calculate a hash table mapping 1-itemsets to a hash of the transactions that contain it Thus n = k
  24. I pre-calculate a hash table mapping 1-itemsets to a hash of the transactions that contain it Thus n = k
  25. Trade off, not smooth because small data
  26. You’ll notice my dataset isn’t perfectly clean. I could have cleaned more, but I like to leave some dirt to measure the resilience and to measure the iterative improvement.
  27. You’ll notice my dataset isn’t perfectly clean. I could have cleaned more, but I like to leave some dirt to measure the resilience and to measure the iterative improvement. Also, some of these are interesting, some are not.
  28. Comment on their work with only one trip to the database
  29. Also, Tristan’s suggestion
  30. Also, Tristan’s suggestion
  31. Also, Tristan’s suggestion
  32. Most baskets are lognormal – how do you get to the interesting stuff? Focus on ensembling
  33. Simple is not the same thing as bad
  34. Next time example unsupervised