SlideShare a Scribd company logo
Massively Distributed Environments and Closed
Itemset Mining: the DCIM Approach
Mehdi Zitouni & Reza Akbarinia & Sadok Ben Yahia & Florent Masseglia
Mehdi.Zitouni@inria.fr
CAiSE 2017, june 16th 2017, Essen Germany
1
Plan
2
1 • Knowledge descovery in big data
2 • DCIM approach for CFI mining in big data
3 • Experimental results
4 • Conclusion
Big data mining
• Advances in hardware and software technologies : Internet, social
networks, smart phones, etc.
• Big data mining : multiple forms of knowledge
• Pattern recognition, statistics, databases, linguistics and visualization
3
?
ENOUGH
!!
Knowledge discovery
4
Knowledge discovery
5
Big data mining
• A class of useful patterns : Frequent Itemsets.
• Frequency of elements in a data base : behavior of the employees in
companies, behavior of the customers in stores, etc
• When data volume grow, frequent elements grow !
• Condensed representation of frequent patterns and gives the same results:
Closed Frequent Itemsets
6
Preliminary Notions : CFI
• Itemset support : the number of transactions containing the itemset
• Frequent itemset : its support is ≥ then a threshold σ specified by the user
• Closed frequent itemset : a condensed representation of frequent itemset,
• is frequent and closed (no superset that has the same support count)
example : having σ = 2
• A, B, C, E : items
• ABC, BCE, … : itemsets
7
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
2
3222
ABC
ABCE
ACEABE BCE
ABCE
3
2
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A C
2 B C E
3 A B C E
4 B E
5 A B C E
Preliminary Notions : MapReduce
• Distributed data processing platform by Google 1 ,
• Available as open-source Apache Hadoop.
• Programming Model based on Key-Value pairs : map and reduce functions !
81 J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters
A, A, B,
B, B, C,
C, B, A
A, A, B
B, B, C
C, B, A
A,1 A,1 B,1
B,1 B,1 C,1
C,1 B,1 A,1
A,1 A,1 A,1
B,1 B,1 B,1
C,1 C,1
A, 3
B, 3
C, 2
example : Word Count
Map phase Reduce phase
DCIM algorithm
• Three steps :
1. Splitting : splits the dataset into multiple and successive parts
2. Job 1 : Frequency counting : first pass over the dataset and count the
support of each item and prune non-frequent ones
3. Job 2 : CFI Mining : mines the CFIs using prime number based approach
• Prime number based approach : a data modelization to avoid string operations
which are very costly in terms of communication and execution time.
9
X ; 2
Y ; 3
Z ; 5
Is X ⊂ X Y ?
X Y ; 2 x 3 = 6
If (6 % 2) = = 0
Then X ⊂ X Y True
example : membership test
DCIM algorithm : Frequency counting
10
Example : having σ = 2
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
T id Itemset
1 A D C
2 B C E
3 A B C E
4 B E
5 A B C E
items Support
A 3
B 4
C 4
D 1
E 4
items Support Primes
B 4 2
C 4 3
E 4 5
A 3 7
Descending order of supports
Itemset
Prime
Numbers
T id
Multiplicaton
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 2, 3, 5, 7 210
B E 2, 5 10
B C E A 2, 3, 5, 7 210
DCIM algorithm : CFI Mining “Map Phase”
11
• Sets of minimized contexts, denoted as Conditional-context.
• Conditional-context ?
Example : having σ = 2
Itemset
Prime
Numbers
T id
A C 7, 3 21
B C E 2, 3, 5 30
A B C E 7, 2, 3, 5 210
B E 2, 5 10
A B C E 7, 2, 3, 5 210
A-Conditional-context
Itemset
Prime
Numbers
T id
C E 3, 5 30
C E 3, 5 30
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 7, 2, 3, 5 210
B E 2, 5 10
B C E A 7, 2, 3, 5 210
Itemset
Prime
Numbers
T id
C 3 3
B C E 2, 3, 5 30
B C E 2, 3, 5 30
AB-Conditional-context
Remove «B»
DCIM algorithm : CFI Mining “Map Phase”
12
Map Inputs : T id Processing Map Outputs
{C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
{B C E} = 30
30 = 2 × 3 × 5
6 = 2 × 3
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{B C E A} = 210
210 = 2 × 3 × 5 × 7
30 = 2 × 3 × 5
6 = 2 × 3
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{B E} = 10 10 = 2 × 5 {E} = 5 : {B} = 2
{B C E A} = 210
210 = 2 × 3 × 5 × 7
30 = 2 × 3 × 5
6 = 2 × 3
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
A B C E 2, 3, 5, 7 210
B E 2, 5 10
A B C E 2, 3, 5, 7 210
Map Inputs : T id Processing Map Outputs
{C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
DCIM algorithm : CFI Mining “Reduce Phase”
13
no superset of the itemset in question that has the same support
count, GCD calculations
Example :
A-Conditional-context : {7}
3
30
30
Output : { 3 × 7 = 21 } → A C
GCD = 3
6
6
2
6
Output : { 5 × 2 = 10 } → B E
E-Conditional-context : {5}
GCD = 2
Itemset
Prime
Numbers
T id
C A 3, 7 21
B C E 2, 3, 5 30
B C E A 2, 3, 5, 7 210
B E 2, 5 10
B C E A 2, 3, 5, 7 210
14
DCIM algorithm : CFI Mining “Reduce Phase”
Map Outputs
{A} = 7 : {C} = 3
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
{E} = 5 : {B} = 2
{A} = 7 : {BCE} = 30
{E} = 5 : {BC} = 6
{C} = 3 : {B} = 2
Reduce Inputs CFI Mining → Reduce Outputs
{A} = 7 : {3, 30, 30}
{AB} = 14 : {15, 15}
{AE} ? → {AE} ⊂ {ABCE}
GCD(3, 30, 30) = 3 = C → 3 x 7 = 21 : {AC} is CFI
GCD(15,15) = 15 = CE → 15 x 14 = 210 : {ABCE} is CFI
STOP !
{E} = 5 : {6, 6, 2, 6}
{EC} = 15 : {2, 2, 2}
GCD(6, 6, 2, 6) = 2 = B → 2 x 5 = 10 : {BE} is CFI
GCD(2, 2, 2) = 2 → 2 x 15 = 30 : {BCE} is CFI
{C} = 3 : {7, 2, 2, 2} GCD(7, 2, 2, 2) = 1 → 3 = C : {C} is CFI
CFIs = {AC, ABCE, BE, BCE}
15
Experimental Results : Datasets
• Wikipedia Articles
• Each line mimics a research article,
• 7,892,123 transactions with 6,853,616 items,
• Maximal length of a transaction is 153,953,
• Clue Web
• One billion web pages in ten languages,
• 53,268,952 transactions with 11,153,752 items,
• Maximal length of a transaction is 689,153,
16
Experimental Results : Setup and implementation
• One of the clusters of Grid5000
• 32 nodes equipped with Hadoop 2.6.0 version,
• 96 Gigabytes Ram,
• 2,9 to 3,9 Ghz Processors,
• Java and Openjdk-7-jdk.
• Compared to a basic implementation of CLOSET algorithm in
MapReduce and the parallel FP-growth.
• Execution time and speedup for multiple values of σ.
17
Efficiency : Wikipedia Articles
18
Speedup : ClueWeb
Conclusion
• Big data : game changing revolution !!
19
Conclusion
• A reliable and efficient parallel algorithm for CFI mining namely DCIM,
• DCIM shows significantly better performances than approaches from
the state of the art,
• An efficient data modeling : Prime numbers processings !
→ The approach is effective and efficient
• CFI mining in data streams
20
21
Thank you !
Questions ?

More Related Content

What's hot

Funções 2
Funções  2Funções  2
Funções 2
KalculosOnline
 
Module 6.7
Module 6.7Module 6.7
Module 6.7
mathwithcoachhall
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
Brandon Bijot
 
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Hsien-Hsin Sean Lee, Ph.D.
 
Productos notables web 2.o
Productos notables web 2.oProductos notables web 2.o
Productos notables web 2.o
José Manuel Coronel
 
Appendix 3 Linear Functions PowerPoint
Appendix 3 Linear Functions PowerPointAppendix 3 Linear Functions PowerPoint
Appendix 3 Linear Functions PowerPoint
apaglione
 
Fixed point scaling
Fixed point scalingFixed point scaling
Fixed point scaling
rishi ram khanal
 
Squaring binomials
Squaring binomialsSquaring binomials
Squaring binomials
Andrés
 
Alg2 lesson 5-4
Alg2 lesson 5-4Alg2 lesson 5-4
Alg2 lesson 5-4
Carol Defreese
 
River Valley Emath Paper 1_solutions_printed
River Valley Emath Paper 1_solutions_printedRiver Valley Emath Paper 1_solutions_printed
River Valley Emath Paper 1_solutions_printed
Felicia Shirui
 
Fs for creditors aging report
Fs for creditors aging reportFs for creditors aging report
Fs for creditors aging report
Anand prakash
 
5HBC2012 Conic Worksheet
5HBC2012 Conic Worksheet5HBC2012 Conic Worksheet
5HBC2012 Conic Worksheet
A Jorge Garcia
 
Mate tarea - 5º
Mate   tarea - 5ºMate   tarea - 5º
Mate tarea - 5º
brisagaela29
 
X 1 cq - exponentes
X 1 cq - exponentesX 1 cq - exponentes
X 1 cq - exponentes
aldosivi98
 
Complex Integral
Complex IntegralComplex Integral
Complex Integral
Mikołaj Hajduk
 
Data visualization with multiple groups using ggplot2
Data visualization with multiple groups using ggplot2Data visualization with multiple groups using ggplot2
Data visualization with multiple groups using ggplot2
Rupak Roy
 
Geo Spatial Plot using R
Geo Spatial Plot using R Geo Spatial Plot using R
Geo Spatial Plot using R
Rupak Roy
 
Spm last minute revision mt
Spm last minute revision mtSpm last minute revision mt
Spm last minute revision mt
A'dilah Hanum
 
Per6 basis_Representations Of Integers
Per6 basis_Representations Of IntegersPer6 basis_Representations Of Integers
Per6 basis_Representations Of Integers
Evert Sandye Taasiringan
 

What's hot (19)

Funções 2
Funções  2Funções  2
Funções 2
 
Module 6.7
Module 6.7Module 6.7
Module 6.7
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
 
Productos notables web 2.o
Productos notables web 2.oProductos notables web 2.o
Productos notables web 2.o
 
Appendix 3 Linear Functions PowerPoint
Appendix 3 Linear Functions PowerPointAppendix 3 Linear Functions PowerPoint
Appendix 3 Linear Functions PowerPoint
 
Fixed point scaling
Fixed point scalingFixed point scaling
Fixed point scaling
 
Squaring binomials
Squaring binomialsSquaring binomials
Squaring binomials
 
Alg2 lesson 5-4
Alg2 lesson 5-4Alg2 lesson 5-4
Alg2 lesson 5-4
 
River Valley Emath Paper 1_solutions_printed
River Valley Emath Paper 1_solutions_printedRiver Valley Emath Paper 1_solutions_printed
River Valley Emath Paper 1_solutions_printed
 
Fs for creditors aging report
Fs for creditors aging reportFs for creditors aging report
Fs for creditors aging report
 
5HBC2012 Conic Worksheet
5HBC2012 Conic Worksheet5HBC2012 Conic Worksheet
5HBC2012 Conic Worksheet
 
Mate tarea - 5º
Mate   tarea - 5ºMate   tarea - 5º
Mate tarea - 5º
 
X 1 cq - exponentes
X 1 cq - exponentesX 1 cq - exponentes
X 1 cq - exponentes
 
Complex Integral
Complex IntegralComplex Integral
Complex Integral
 
Data visualization with multiple groups using ggplot2
Data visualization with multiple groups using ggplot2Data visualization with multiple groups using ggplot2
Data visualization with multiple groups using ggplot2
 
Geo Spatial Plot using R
Geo Spatial Plot using R Geo Spatial Plot using R
Geo Spatial Plot using R
 
Spm last minute revision mt
Spm last minute revision mtSpm last minute revision mt
Spm last minute revision mt
 
Per6 basis_Representations Of Integers
Per6 basis_Representations Of IntegersPer6 basis_Representations Of Integers
Per6 basis_Representations Of Integers
 

Similar to Massively distributed environments and closed itemset mining

LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
Lalit Kumar
 
3rd Semester Computer Science and Engineering (ACU) Question papers
3rd Semester Computer Science and Engineering  (ACU) Question papers3rd Semester Computer Science and Engineering  (ACU) Question papers
3rd Semester Computer Science and Engineering (ACU) Question papers
BGS Institute of Technology, Adichunchanagiri University (ACU)
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
Sri Ambati
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
Turi, Inc.
 
Speeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using CodesSpeeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using Codes
NAVER Engineering
 
Code Optimization.ppt
Code Optimization.pptCode Optimization.ppt
Code Optimization.ppt
JohnSamuel280314
 
CS540-2-lecture11 - Copy.ppt
CS540-2-lecture11 - Copy.pptCS540-2-lecture11 - Copy.ppt
CS540-2-lecture11 - Copy.ppt
ssuser0be977
 
K means clustering
K means clusteringK means clustering
K means clustering
Ahmedasbasb
 
Story of static code analyzer development
Story of static code analyzer developmentStory of static code analyzer development
Story of static code analyzer development
Andrey Karpov
 
Efficient_Cube_computation.ppt
Efficient_Cube_computation.pptEfficient_Cube_computation.ppt
Efficient_Cube_computation.ppt
Kulwinder Padda
 
important C questions and_answers praveensomesh
important C questions and_answers praveensomeshimportant C questions and_answers praveensomesh
important C questions and_answers praveensomesh
praveensomesh
 
FPGA based BCH Decoder
FPGA based BCH DecoderFPGA based BCH Decoder
FPGA based BCH Decoder
ijsrd.com
 
3rd Semester Computer Science and Engineering (ACU-2022) Question papers
3rd Semester Computer Science and Engineering  (ACU-2022) Question papers3rd Semester Computer Science and Engineering  (ACU-2022) Question papers
3rd Semester Computer Science and Engineering (ACU-2022) Question papers
BGS Institute of Technology, Adichunchanagiri University (ACU)
 
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATIONSCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
aftab alam
 
Mcs 10 104 compiler design dec 2014
Mcs 10 104 compiler design dec 2014Mcs 10 104 compiler design dec 2014
Mcs 10 104 compiler design dec 2014
Sreeju Sree
 
2 funda.ppt
2 funda.ppt2 funda.ppt
2 funda.ppt
02LabiqaIslam
 
01_intro-cpp.ppt
01_intro-cpp.ppt01_intro-cpp.ppt
01_intro-cpp.ppt
SWETHAABIRAMIM
 
01_intro-cpp.ppt
01_intro-cpp.ppt01_intro-cpp.ppt
01_intro-cpp.ppt
DrBashirMSaad
 
5. Arithmaticn combinational Ckt.ppt
5. Arithmaticn combinational Ckt.ppt5. Arithmaticn combinational Ckt.ppt
5. Arithmaticn combinational Ckt.ppt
abhishekchakraborty788933
 
C test
C testC test

Similar to Massively distributed environments and closed itemset mining (20)

LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
 
3rd Semester Computer Science and Engineering (ACU) Question papers
3rd Semester Computer Science and Engineering  (ACU) Question papers3rd Semester Computer Science and Engineering  (ACU) Question papers
3rd Semester Computer Science and Engineering (ACU) Question papers
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Speeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using CodesSpeeding Up Distributed Machine Learning Using Codes
Speeding Up Distributed Machine Learning Using Codes
 
Code Optimization.ppt
Code Optimization.pptCode Optimization.ppt
Code Optimization.ppt
 
CS540-2-lecture11 - Copy.ppt
CS540-2-lecture11 - Copy.pptCS540-2-lecture11 - Copy.ppt
CS540-2-lecture11 - Copy.ppt
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Story of static code analyzer development
Story of static code analyzer developmentStory of static code analyzer development
Story of static code analyzer development
 
Efficient_Cube_computation.ppt
Efficient_Cube_computation.pptEfficient_Cube_computation.ppt
Efficient_Cube_computation.ppt
 
important C questions and_answers praveensomesh
important C questions and_answers praveensomeshimportant C questions and_answers praveensomesh
important C questions and_answers praveensomesh
 
FPGA based BCH Decoder
FPGA based BCH DecoderFPGA based BCH Decoder
FPGA based BCH Decoder
 
3rd Semester Computer Science and Engineering (ACU-2022) Question papers
3rd Semester Computer Science and Engineering  (ACU-2022) Question papers3rd Semester Computer Science and Engineering  (ACU-2022) Question papers
3rd Semester Computer Science and Engineering (ACU-2022) Question papers
 
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATIONSCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
 
Mcs 10 104 compiler design dec 2014
Mcs 10 104 compiler design dec 2014Mcs 10 104 compiler design dec 2014
Mcs 10 104 compiler design dec 2014
 
2 funda.ppt
2 funda.ppt2 funda.ppt
2 funda.ppt
 
01_intro-cpp.ppt
01_intro-cpp.ppt01_intro-cpp.ppt
01_intro-cpp.ppt
 
01_intro-cpp.ppt
01_intro-cpp.ppt01_intro-cpp.ppt
01_intro-cpp.ppt
 
5. Arithmaticn combinational Ckt.ppt
5. Arithmaticn combinational Ckt.ppt5. Arithmaticn combinational Ckt.ppt
5. Arithmaticn combinational Ckt.ppt
 
C test
C testC test
C test
 

Recently uploaded

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
exukyp
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 

Recently uploaded (20)

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 

Massively distributed environments and closed itemset mining

  • 1. Massively Distributed Environments and Closed Itemset Mining: the DCIM Approach Mehdi Zitouni & Reza Akbarinia & Sadok Ben Yahia & Florent Masseglia Mehdi.Zitouni@inria.fr CAiSE 2017, june 16th 2017, Essen Germany 1
  • 2. Plan 2 1 • Knowledge descovery in big data 2 • DCIM approach for CFI mining in big data 3 • Experimental results 4 • Conclusion
  • 3. Big data mining • Advances in hardware and software technologies : Internet, social networks, smart phones, etc. • Big data mining : multiple forms of knowledge • Pattern recognition, statistics, databases, linguistics and visualization 3 ? ENOUGH !!
  • 6. Big data mining • A class of useful patterns : Frequent Itemsets. • Frequency of elements in a data base : behavior of the employees in companies, behavior of the customers in stores, etc • When data volume grow, frequent elements grow ! • Condensed representation of frequent patterns and gives the same results: Closed Frequent Itemsets 6
  • 7. Preliminary Notions : CFI • Itemset support : the number of transactions containing the itemset • Frequent itemset : its support is ≥ then a threshold σ specified by the user • Closed frequent itemset : a condensed representation of frequent itemset, • is frequent and closed (no superset that has the same support count) example : having σ = 2 • A, B, C, E : items • ABC, BCE, … : itemsets 7 T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E 2 3222 ABC ABCE ACEABE BCE ABCE 3 2 T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A C 2 B C E 3 A B C E 4 B E 5 A B C E
  • 8. Preliminary Notions : MapReduce • Distributed data processing platform by Google 1 , • Available as open-source Apache Hadoop. • Programming Model based on Key-Value pairs : map and reduce functions ! 81 J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters A, A, B, B, B, C, C, B, A A, A, B B, B, C C, B, A A,1 A,1 B,1 B,1 B,1 C,1 C,1 B,1 A,1 A,1 A,1 A,1 B,1 B,1 B,1 C,1 C,1 A, 3 B, 3 C, 2 example : Word Count Map phase Reduce phase
  • 9. DCIM algorithm • Three steps : 1. Splitting : splits the dataset into multiple and successive parts 2. Job 1 : Frequency counting : first pass over the dataset and count the support of each item and prune non-frequent ones 3. Job 2 : CFI Mining : mines the CFIs using prime number based approach • Prime number based approach : a data modelization to avoid string operations which are very costly in terms of communication and execution time. 9 X ; 2 Y ; 3 Z ; 5 Is X ⊂ X Y ? X Y ; 2 x 3 = 6 If (6 % 2) = = 0 Then X ⊂ X Y True example : membership test
  • 10. DCIM algorithm : Frequency counting 10 Example : having σ = 2 T id Itemset 1 A D C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A D C 2 B C E 3 A B C E 4 B E 5 A B C E T id Itemset 1 A D C 2 B C E 3 A B C E 4 B E 5 A B C E items Support A 3 B 4 C 4 D 1 E 4 items Support Primes B 4 2 C 4 3 E 4 5 A 3 7 Descending order of supports Itemset Prime Numbers T id Multiplicaton C A 3, 7 21 B C E 2, 3, 5 30 B C E A 2, 3, 5, 7 210 B E 2, 5 10 B C E A 2, 3, 5, 7 210
  • 11. DCIM algorithm : CFI Mining “Map Phase” 11 • Sets of minimized contexts, denoted as Conditional-context. • Conditional-context ? Example : having σ = 2 Itemset Prime Numbers T id A C 7, 3 21 B C E 2, 3, 5 30 A B C E 7, 2, 3, 5 210 B E 2, 5 10 A B C E 7, 2, 3, 5 210 A-Conditional-context Itemset Prime Numbers T id C E 3, 5 30 C E 3, 5 30 Itemset Prime Numbers T id C A 3, 7 21 B C E 2, 3, 5 30 B C E A 7, 2, 3, 5 210 B E 2, 5 10 B C E A 7, 2, 3, 5 210 Itemset Prime Numbers T id C 3 3 B C E 2, 3, 5 30 B C E 2, 3, 5 30 AB-Conditional-context Remove «B»
  • 12. DCIM algorithm : CFI Mining “Map Phase” 12 Map Inputs : T id Processing Map Outputs {C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3 {B C E} = 30 30 = 2 × 3 × 5 6 = 2 × 3 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {B C E A} = 210 210 = 2 × 3 × 5 × 7 30 = 2 × 3 × 5 6 = 2 × 3 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {B E} = 10 10 = 2 × 5 {E} = 5 : {B} = 2 {B C E A} = 210 210 = 2 × 3 × 5 × 7 30 = 2 × 3 × 5 6 = 2 × 3 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 Itemset Prime Numbers T id C A 3, 7 21 B C E 2, 3, 5 30 A B C E 2, 3, 5, 7 210 B E 2, 5 10 A B C E 2, 3, 5, 7 210 Map Inputs : T id Processing Map Outputs {C A} = 21 21 = 3 × 7 {A} = 7 : {C} = 3
  • 13. DCIM algorithm : CFI Mining “Reduce Phase” 13 no superset of the itemset in question that has the same support count, GCD calculations Example : A-Conditional-context : {7} 3 30 30 Output : { 3 × 7 = 21 } → A C GCD = 3 6 6 2 6 Output : { 5 × 2 = 10 } → B E E-Conditional-context : {5} GCD = 2 Itemset Prime Numbers T id C A 3, 7 21 B C E 2, 3, 5 30 B C E A 2, 3, 5, 7 210 B E 2, 5 10 B C E A 2, 3, 5, 7 210
  • 14. 14 DCIM algorithm : CFI Mining “Reduce Phase” Map Outputs {A} = 7 : {C} = 3 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 {E} = 5 : {B} = 2 {A} = 7 : {BCE} = 30 {E} = 5 : {BC} = 6 {C} = 3 : {B} = 2 Reduce Inputs CFI Mining → Reduce Outputs {A} = 7 : {3, 30, 30} {AB} = 14 : {15, 15} {AE} ? → {AE} ⊂ {ABCE} GCD(3, 30, 30) = 3 = C → 3 x 7 = 21 : {AC} is CFI GCD(15,15) = 15 = CE → 15 x 14 = 210 : {ABCE} is CFI STOP ! {E} = 5 : {6, 6, 2, 6} {EC} = 15 : {2, 2, 2} GCD(6, 6, 2, 6) = 2 = B → 2 x 5 = 10 : {BE} is CFI GCD(2, 2, 2) = 2 → 2 x 15 = 30 : {BCE} is CFI {C} = 3 : {7, 2, 2, 2} GCD(7, 2, 2, 2) = 1 → 3 = C : {C} is CFI CFIs = {AC, ABCE, BE, BCE}
  • 15. 15 Experimental Results : Datasets • Wikipedia Articles • Each line mimics a research article, • 7,892,123 transactions with 6,853,616 items, • Maximal length of a transaction is 153,953, • Clue Web • One billion web pages in ten languages, • 53,268,952 transactions with 11,153,752 items, • Maximal length of a transaction is 689,153,
  • 16. 16 Experimental Results : Setup and implementation • One of the clusters of Grid5000 • 32 nodes equipped with Hadoop 2.6.0 version, • 96 Gigabytes Ram, • 2,9 to 3,9 Ghz Processors, • Java and Openjdk-7-jdk. • Compared to a basic implementation of CLOSET algorithm in MapReduce and the parallel FP-growth. • Execution time and speedup for multiple values of σ.
  • 19. Conclusion • Big data : game changing revolution !! 19
  • 20. Conclusion • A reliable and efficient parallel algorithm for CFI mining namely DCIM, • DCIM shows significantly better performances than approaches from the state of the art, • An efficient data modeling : Prime numbers processings ! → The approach is effective and efficient • CFI mining in data streams 20