SlideShare a Scribd company logo
1 of 26
Download to read offline
31st
of August, 2015 TPC-TC'2015@Hawaii 1
Big-SeqSB-Gen: a Formal & Scalable Approach for
Parallel Generation of Big Synthetic Sequence
Databases
Rim Moussa
CS Dept., Engineering School of Carthage
University of Carthage
rim.moussa@esti.rnu.tn
7th
TPC Technology Conference (TPC-TC)
Hawaii, USA. 2015
31st
of August, 2015
31st
of August, 2015 TPC-TC'2015@Hawaii 2
Context
●Sequence Database Applications
»Customer shopping sequences
»Cabs (ground transportation) trajectories
»Natural disasters (e.g. earthquakes)
»Computer logs
»Stocks & Markets
»Telephone calling patterns
»Weblog click streams
●Challenges
»Scalable Processing
● Mining hidden patterns (frequent or outliers)
● Cluster Sequences
»Availability of Big Datasets ?
●Companies do not disclose their data
●Unavailability of Synthetic generator for Big Sequence DB ?
31st
of August, 2015 TPC-TC'2015@Hawaii 3
Outline
●What is a Sequence Database ?
●Parallel Generation of a Sequence Database
»Requirements
»W Ehitney numerator
●Implementation & Performance Study
●Related Work
●Conclusions & Future Work
31st
of August, 2015 TPC-TC'2015@Hawaii 4
Sequence Database
●Defnition
»A Sequence Database consists of sequences of ordered
elements or events, recorded with or without a concrete notion
of time. By Jiawei Han, Micheline Kamber and Jian Pei
●Length
»{1}{3,5} is a 3-sequence
●Count(s)
»Nbr of sequences that contain s
»Count({1}{3}{3}) = 2 (s1
and s2
)
●Support(s)
»Count(s)  DB size
»Support({1}{3}{3}) = 0.5 (50%)
●S Subsequence, uper-sequence
»{1}{3}{3} is a subsequence of s1
»s1
is a super sequence of {1}{3}{3}
ID Sequence
s1
{1}{1,2,3}{1,3}{1,4}
s2
{1,4}{3}{2,3}{1,5}
s3
{5,6}{1,2}{4,6}{3}{2}
s4
{5}{7}{1,6}{3}{2}{3}
31st
of August, 2015 TPC-TC'2015@Hawaii 5
Parallel Sequence Generator
--5 Requirements by Hupler K.
Relevance: PSG implements Whitney Enumerators
for distinct sequences' generation.
Repeatability: for multiple runs with same input params,
PSG must output a sequence DB with same characteristics.
Economy: PSG is open-source and hardware and
platform independent.
Fairness: data sets generated by PSG do not overft
a particular algorithm.
Performance: PSG reports metrics assessing its velocity,
scalability and scale-out performances.
31st
of August, 2015 TPC-TC'2015@Hawaii 6
●W Nhitney umber
»Investigated by Raissi C. and Pei J. for bounding sequential
patterns (KDD'2011)
»Given,
●I = {i1
, i2
, i3
, …, in
} a fnite set of n items
●k : sequence length
●Wk
: Number of sequential patterns of length k is
● We propose W Ehitney numerator for the enumeration of
W Nhitney umbers
W Ehitney numerator
31st
of August, 2015 TPC-TC'2015@Hawaii 7
W Ehitney numerators by Example (1/3)
WE4
31st
of August, 2015 TPC-TC'2015@Hawaii 8
W Ehitney numerators by Example (2/3)
31st
of August, 2015 TPC-TC'2015@Hawaii 9
W Ehitney numerators by Example (3/3)
31st
of August, 2015 TPC-TC'2015@Hawaii 10
W Nhitney umbers Capacities
n W5
W10
W15
10 392,002 76,096,000 108 Quadrillion
(108,056,434,303,249,000)
50 1.35 Billion
(1,370,262,510)
7 Trillion
(7,031,803,751,400)
4.87 Octillion
(4,876,551,905,744,990,0...,000)
100 44.36 Billion
(44,462,112,520)
919 Trillion
(919,027,432,253,800)
167.7 Nollion
(167,755,438,441,193,0...,000)
Billion: 109
Trillion: 1012
Quadrillion: 1015
Octillion: 1027
Nonillion: 1030
n: number of elements composing the sequences
Wk
: number of sequences of length k
31st
of August, 2015 TPC-TC'2015@Hawaii 11
Implementation & Performance Study
Implementation PSG *MR Job TestDFSIO HPC platform Experiments
●Directions towards an efficient implementation
»Enumerate Source Sequences at less cost
● NextSequence method
»Enumerate different patterns of Source Sequences
● Specify a quota for each pattern type
»Add noise to Source Sequences
»Enumerate Source Sequences in parallel
● Each Sequence generator emits a different source sequence
● Or in order to distribute sequential patterns among different fles, each
Sequence generator emits the sequence a number of times
●Apache Hadoop YARN (2.4) – MapReduce Framework
»HDFS fles can be processed by Apache Spark, Apache Storm
»Whitney package is coded in JAVA, can be used to switch to a
traditional parallel implementation
31st
of August, 2015 TPC-TC'2015@Hawaii 12
Implementation & Performance Study
Implementation PSG *MR Job TestDFSIO HPC platform Experiments
●Parallel Sequence Generator
»PSG for short
»Mapper Set up
●parse job parameters: WEk
, #sequences, support, ...
»Mapper (sequence generator, worker)
●Iterate #source sequences  #mappers times
●Generate a new Source Sequence: s
●Iterate avg(count(s))
●Write the sequence = noise(s) to the mapper data fle
●Emit #sequences and ∑ bytes written to HDFS
»Reducer
●Calculates ∑#sequences and ∑ bytes
●PSG versus TestDFSIO
»in order to highlight sequence generation overhead
31st
of August, 2015 TPC-TC'2015@Hawaii 13
Implementation & Performance Study
Implementation PSG *MR Job TestDFSIO HPC platform Experiments
●TestDFSIO
»MapReduce implementation of a Distributed IO benchmark tool for
Hadoop Distributed File System
»Supports Read and Write workloads
»Write workload example: create 10 fles, such that each fle is 10GB
in order to generate a 100 GB dataset
● Each data fle is generated by a Mapper
● At set up, each Mapper creates a buffer (1MB buffer per default) and
repeatedly appends the buffer to its data fle until fle size reaches
specifed input value (10GB fle for instance)
●TeraGen
»MapReduce implementation of a parallel Record generator for
TeraSort Benchmark
»Cons: Generate keys in order to simulate input data fles for the
MapReduce job
31st
of August, 2015 TPC-TC'2015@Hawaii 14
Implementation & Performance Study
Implementation PSG *MR Job TestDFSIO HPC platform Experiments
●French GRID platform -Grid5000
»Suno nodes @Sophia site
● 32 GB of memory
● 2 CPUs per node and 4 cores per CPU
● each CPU is Intel Xeon E5520, 2.27 GHz
»Cluster of 3 to 11 Suno nodes
● 1 master and 2 to 10 slaves
»10Gbps Ethernet
●Set up
»Automatic deployment of Apache Hadoop YARN on a cluster of
Grid5K nodes
»JRE 1.7
31st
of August, 2015 TPC-TC'2015@Hawaii 15
Implementation & Performance Study
Implementation PSG *MR Job TestDFSIO HPC platform Experiments
●Frequent Sequential pattern use case
●PSG Key parameters
»N is the Cluster size is 3, 6 or 11 nodes
● 1 master and N-1 slaves (workers)
»Number of Sequence Generators (mappers)
»Number of Sources Sequences to inject
● 5-source sequences type i.e. WE5;
● Avg size of each sequence is 420 bytes
● Each source sequence repeats in average 5% of the number of source
sequences injected
●TestDFSIO parameters
»Cluster size
»Number of mappers (i.e. number of fles)
»Each fle size (GB)
●Key performance factors
»Write Throughput in #sequences per second for PSG
»Write Throughput in bytes per second for PSG and TestDFSIO
31st
of August, 2015 TPC-TC'2015@Hawaii 16
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Max. sequence DB of 450 GB, more than 2 billions of sequences
Max. throughput of 315MBps for 90K sources sequences (DB is 91GB)
31st
of August, 2015 TPC-TC'2015@Hawaii 17
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Max. sequence DB of 1.8 TB, more than 8 billions of sequences at 694MBps
Max. throughput of 741MBps for 180K sources sequences (DB is 365GB)
31st
of August, 2015 TPC-TC'2015@Hawaii 18
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Max. throughput of 3 millions of sequences per second
31st
of August, 2015 TPC-TC'2015@Hawaii 19
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Max. sequence DB of 4 TB, more than 18 billions of sequences at 1.2GBps
(less than one hour 57minutes)
31st
of August, 2015 TPC-TC'2015@Hawaii 20
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Max. sequence DB of 4 TB, more than 18 billions of sequences at 5.3
millions of sequences per second
31st
of August, 2015 TPC-TC'2015@Hawaii 21
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Limited by storage capacities of each cluster
31st
of August, 2015 TPC-TC'2015@Hawaii 22
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
n times the number of data nodes results in n times the write throughput
31st
of August, 2015 TPC-TC'2015@Hawaii 23
Related Work
●IBM Quest Synthetic Data Generator
»Generates sequential patterns
»Capacity and velocity issues
● IBM Quest Synthetic Data Generator was not designed for fast generation of
big sequence databases
»Functioning issues
● Random corrupt of base sequential patterns used for populating the
sequence database – No guarantee that a sequential pattern repeats a
number of times proportional to the database size.
●Other Big Data Generators with different objectives
»TeraSort: sort large number of records generated by Teragen
»PDGF: parallel generation of a big data warehouse
31st
of August, 2015 TPC-TC'2015@Hawaii 24
Conclusions
●Proposal W E, implementation & tests of hitney numerators
»Whitney Enumerators allow generation of large number of distinct
sequences
»Open source java code
●Parallel Sequence Generator
»scalable and formal approach for Parallel Generation of Big Synthetic
Sequence Databases satisfying both user-specified sequences’
characteristics and velocity requirements
»Scalability and Scale-out performance study
»Open source java code
●Project Homepage
»https://sites.google.com/site/rimmoussa/miningbigseqdb
31st
of August, 2015 TPC-TC'2015@Hawaii 25
Future Work
●Mining of frequent Sequential patterns
»A-priori based algorithms
● GSP (done)
»Pattern growth algorithms
● PrefixSpan (done)
»Optimized algorithms (on-going)
● Fast pruning of candidates
● Efficient mining of long sequential patterns
●Investigate (on-going)
»Cabs' trajectories: Rome Cabs(Feb.'2014, 1.6GB)
»Computer logs: Dartmouth Univ. log data set (6GB, 2001-2006)
»Customers shopping transactions in TPC-H benchmark
●Generate and Mine more Kinds of Patterns
»Sequential patterns: a → b
»Parallel patterns: a // b
»Regular expressions of patterns
31st
of August, 2015 TPC-TC'2015@Hawaii 26
Thank you!
Q & A
A Formal & Scalable Approach for Parallel Generation of Big Synthetic
Sequence Database
Rim Moussa
7th
TPC-TC@Hawaii
31st
of August, 2015

More Related Content

What's hot

Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsRim Moussa
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexesDaniel Lemire
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil AmbagadeSigmoid
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020GEO Analytics Canada
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia Bharat Kalia
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidCharles Allen
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.
 

What's hot (20)

Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
 
Druid
DruidDruid
Druid
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
 
Omid: A transactional Framework for HBase
Omid: A transactional Framework for HBaseOmid: A transactional Framework for HBase
Omid: A transactional Framework for HBase
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020
 
ArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & RoadmapArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & Roadmap
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
 
Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
 

Similar to Parallel Sequence Generator

DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATAInfluxData
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
An FPGA-based acceleration methodology and performance model for iterative st...
An FPGA-based acceleration methodology and performance model for iterative st...An FPGA-based acceleration methodology and performance model for iterative st...
An FPGA-based acceleration methodology and performance model for iterative st...NECST Lab @ Politecnico di Milano
 
Intro to Kapacitor for Alerting and Anomaly Detection
Intro to Kapacitor for Alerting and Anomaly DetectionIntro to Kapacitor for Alerting and Anomaly Detection
Intro to Kapacitor for Alerting and Anomaly DetectionInfluxData
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitorInfluxData
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkDatabricks
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...ijdpsjournal
 
Power of SPL
Power of SPLPower of SPL
Power of SPLTian Chen
 
Kapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EngineKapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EnginePrashant Vats
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
 
STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual S...
STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual S...STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual S...
STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual S...Hannaneh Najdataei
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 

Similar to Parallel Sequence Generator (20)

DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
An FPGA-based acceleration methodology and performance model for iterative st...
An FPGA-based acceleration methodology and performance model for iterative st...An FPGA-based acceleration methodology and performance model for iterative st...
An FPGA-based acceleration methodology and performance model for iterative st...
 
Intro to Kapacitor for Alerting and Anomaly Detection
Intro to Kapacitor for Alerting and Anomaly DetectionIntro to Kapacitor for Alerting and Anomaly Detection
Intro to Kapacitor for Alerting and Anomaly Detection
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Scalable frequent itemset mining using heterogeneous computing par apriori a...
Scalable frequent itemset mining using heterogeneous computing  par apriori a...Scalable frequent itemset mining using heterogeneous computing  par apriori a...
Scalable frequent itemset mining using heterogeneous computing par apriori a...
 
Power of SPL
Power of SPLPower of SPL
Power of SPL
 
Kapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EngineKapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing Engine
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual S...
STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual S...STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual S...
STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual S...
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
vega
vegavega
vega
 
Dl2 computing gpu
Dl2 computing gpuDl2 computing gpu
Dl2 computing gpu
 

More from Rim Moussa

polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfRim Moussa
 
Big Data Projects
Big Data ProjectsBig Data Projects
Big Data ProjectsRim Moussa
 
Automation of MultiDimensional DB Design (poster)
Automation of MultiDimensional DB Design (poster)Automation of MultiDimensional DB Design (poster)
Automation of MultiDimensional DB Design (poster)Rim Moussa
 
highly available distributed databases (poster)
highly available distributed databases (poster)highly available distributed databases (poster)
highly available distributed databases (poster)Rim Moussa
 

More from Rim Moussa (6)

polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdf
 
Big Data Projects
Big Data ProjectsBig Data Projects
Big Data Projects
 
EMR AWS Demo
EMR AWS DemoEMR AWS Demo
EMR AWS Demo
 
BICOD-2017
BICOD-2017BICOD-2017
BICOD-2017
 
Automation of MultiDimensional DB Design (poster)
Automation of MultiDimensional DB Design (poster)Automation of MultiDimensional DB Design (poster)
Automation of MultiDimensional DB Design (poster)
 
highly available distributed databases (poster)
highly available distributed databases (poster)highly available distributed databases (poster)
highly available distributed databases (poster)
 

Recently uploaded

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 

Recently uploaded (20)

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 

Parallel Sequence Generator

  • 1. 31st of August, 2015 TPC-TC'2015@Hawaii 1 Big-SeqSB-Gen: a Formal & Scalable Approach for Parallel Generation of Big Synthetic Sequence Databases Rim Moussa CS Dept., Engineering School of Carthage University of Carthage rim.moussa@esti.rnu.tn 7th TPC Technology Conference (TPC-TC) Hawaii, USA. 2015 31st of August, 2015
  • 2. 31st of August, 2015 TPC-TC'2015@Hawaii 2 Context ●Sequence Database Applications »Customer shopping sequences »Cabs (ground transportation) trajectories »Natural disasters (e.g. earthquakes) »Computer logs »Stocks & Markets »Telephone calling patterns »Weblog click streams ●Challenges »Scalable Processing ● Mining hidden patterns (frequent or outliers) ● Cluster Sequences »Availability of Big Datasets ? ●Companies do not disclose their data ●Unavailability of Synthetic generator for Big Sequence DB ?
  • 3. 31st of August, 2015 TPC-TC'2015@Hawaii 3 Outline ●What is a Sequence Database ? ●Parallel Generation of a Sequence Database »Requirements »W Ehitney numerator ●Implementation & Performance Study ●Related Work ●Conclusions & Future Work
  • 4. 31st of August, 2015 TPC-TC'2015@Hawaii 4 Sequence Database ●Defnition »A Sequence Database consists of sequences of ordered elements or events, recorded with or without a concrete notion of time. By Jiawei Han, Micheline Kamber and Jian Pei ●Length »{1}{3,5} is a 3-sequence ●Count(s) »Nbr of sequences that contain s »Count({1}{3}{3}) = 2 (s1 and s2 ) ●Support(s) »Count(s)  DB size »Support({1}{3}{3}) = 0.5 (50%) ●S Subsequence, uper-sequence »{1}{3}{3} is a subsequence of s1 »s1 is a super sequence of {1}{3}{3} ID Sequence s1 {1}{1,2,3}{1,3}{1,4} s2 {1,4}{3}{2,3}{1,5} s3 {5,6}{1,2}{4,6}{3}{2} s4 {5}{7}{1,6}{3}{2}{3}
  • 5. 31st of August, 2015 TPC-TC'2015@Hawaii 5 Parallel Sequence Generator --5 Requirements by Hupler K. Relevance: PSG implements Whitney Enumerators for distinct sequences' generation. Repeatability: for multiple runs with same input params, PSG must output a sequence DB with same characteristics. Economy: PSG is open-source and hardware and platform independent. Fairness: data sets generated by PSG do not overft a particular algorithm. Performance: PSG reports metrics assessing its velocity, scalability and scale-out performances.
  • 6. 31st of August, 2015 TPC-TC'2015@Hawaii 6 ●W Nhitney umber »Investigated by Raissi C. and Pei J. for bounding sequential patterns (KDD'2011) »Given, ●I = {i1 , i2 , i3 , …, in } a fnite set of n items ●k : sequence length ●Wk : Number of sequential patterns of length k is ● We propose W Ehitney numerator for the enumeration of W Nhitney umbers W Ehitney numerator
  • 7. 31st of August, 2015 TPC-TC'2015@Hawaii 7 W Ehitney numerators by Example (1/3) WE4
  • 8. 31st of August, 2015 TPC-TC'2015@Hawaii 8 W Ehitney numerators by Example (2/3)
  • 9. 31st of August, 2015 TPC-TC'2015@Hawaii 9 W Ehitney numerators by Example (3/3)
  • 10. 31st of August, 2015 TPC-TC'2015@Hawaii 10 W Nhitney umbers Capacities n W5 W10 W15 10 392,002 76,096,000 108 Quadrillion (108,056,434,303,249,000) 50 1.35 Billion (1,370,262,510) 7 Trillion (7,031,803,751,400) 4.87 Octillion (4,876,551,905,744,990,0...,000) 100 44.36 Billion (44,462,112,520) 919 Trillion (919,027,432,253,800) 167.7 Nollion (167,755,438,441,193,0...,000) Billion: 109 Trillion: 1012 Quadrillion: 1015 Octillion: 1027 Nonillion: 1030 n: number of elements composing the sequences Wk : number of sequences of length k
  • 11. 31st of August, 2015 TPC-TC'2015@Hawaii 11 Implementation & Performance Study Implementation PSG *MR Job TestDFSIO HPC platform Experiments ●Directions towards an efficient implementation »Enumerate Source Sequences at less cost ● NextSequence method »Enumerate different patterns of Source Sequences ● Specify a quota for each pattern type »Add noise to Source Sequences »Enumerate Source Sequences in parallel ● Each Sequence generator emits a different source sequence ● Or in order to distribute sequential patterns among different fles, each Sequence generator emits the sequence a number of times ●Apache Hadoop YARN (2.4) – MapReduce Framework »HDFS fles can be processed by Apache Spark, Apache Storm »Whitney package is coded in JAVA, can be used to switch to a traditional parallel implementation
  • 12. 31st of August, 2015 TPC-TC'2015@Hawaii 12 Implementation & Performance Study Implementation PSG *MR Job TestDFSIO HPC platform Experiments ●Parallel Sequence Generator »PSG for short »Mapper Set up ●parse job parameters: WEk , #sequences, support, ... »Mapper (sequence generator, worker) ●Iterate #source sequences  #mappers times ●Generate a new Source Sequence: s ●Iterate avg(count(s)) ●Write the sequence = noise(s) to the mapper data fle ●Emit #sequences and ∑ bytes written to HDFS »Reducer ●Calculates ∑#sequences and ∑ bytes ●PSG versus TestDFSIO »in order to highlight sequence generation overhead
  • 13. 31st of August, 2015 TPC-TC'2015@Hawaii 13 Implementation & Performance Study Implementation PSG *MR Job TestDFSIO HPC platform Experiments ●TestDFSIO »MapReduce implementation of a Distributed IO benchmark tool for Hadoop Distributed File System »Supports Read and Write workloads »Write workload example: create 10 fles, such that each fle is 10GB in order to generate a 100 GB dataset ● Each data fle is generated by a Mapper ● At set up, each Mapper creates a buffer (1MB buffer per default) and repeatedly appends the buffer to its data fle until fle size reaches specifed input value (10GB fle for instance) ●TeraGen »MapReduce implementation of a parallel Record generator for TeraSort Benchmark »Cons: Generate keys in order to simulate input data fles for the MapReduce job
  • 14. 31st of August, 2015 TPC-TC'2015@Hawaii 14 Implementation & Performance Study Implementation PSG *MR Job TestDFSIO HPC platform Experiments ●French GRID platform -Grid5000 »Suno nodes @Sophia site ● 32 GB of memory ● 2 CPUs per node and 4 cores per CPU ● each CPU is Intel Xeon E5520, 2.27 GHz »Cluster of 3 to 11 Suno nodes ● 1 master and 2 to 10 slaves »10Gbps Ethernet ●Set up »Automatic deployment of Apache Hadoop YARN on a cluster of Grid5K nodes »JRE 1.7
  • 15. 31st of August, 2015 TPC-TC'2015@Hawaii 15 Implementation & Performance Study Implementation PSG *MR Job TestDFSIO HPC platform Experiments ●Frequent Sequential pattern use case ●PSG Key parameters »N is the Cluster size is 3, 6 or 11 nodes ● 1 master and N-1 slaves (workers) »Number of Sequence Generators (mappers) »Number of Sources Sequences to inject ● 5-source sequences type i.e. WE5; ● Avg size of each sequence is 420 bytes ● Each source sequence repeats in average 5% of the number of source sequences injected ●TestDFSIO parameters »Cluster size »Number of mappers (i.e. number of fles) »Each fle size (GB) ●Key performance factors »Write Throughput in #sequences per second for PSG »Write Throughput in bytes per second for PSG and TestDFSIO
  • 16. 31st of August, 2015 TPC-TC'2015@Hawaii 16 Implementation & Performance Study 2 slaves 5 slaves 10 slaves Scalability Scale-out Max. sequence DB of 450 GB, more than 2 billions of sequences Max. throughput of 315MBps for 90K sources sequences (DB is 91GB)
  • 17. 31st of August, 2015 TPC-TC'2015@Hawaii 17 Implementation & Performance Study 2 slaves 5 slaves 10 slaves Scalability Scale-out Max. sequence DB of 1.8 TB, more than 8 billions of sequences at 694MBps Max. throughput of 741MBps for 180K sources sequences (DB is 365GB)
  • 18. 31st of August, 2015 TPC-TC'2015@Hawaii 18 Implementation & Performance Study 2 slaves 5 slaves 10 slaves Scalability Scale-out Max. throughput of 3 millions of sequences per second
  • 19. 31st of August, 2015 TPC-TC'2015@Hawaii 19 Implementation & Performance Study 2 slaves 5 slaves 10 slaves Scalability Scale-out Max. sequence DB of 4 TB, more than 18 billions of sequences at 1.2GBps (less than one hour 57minutes)
  • 20. 31st of August, 2015 TPC-TC'2015@Hawaii 20 Implementation & Performance Study 2 slaves 5 slaves 10 slaves Scalability Scale-out Max. sequence DB of 4 TB, more than 18 billions of sequences at 5.3 millions of sequences per second
  • 21. 31st of August, 2015 TPC-TC'2015@Hawaii 21 Implementation & Performance Study 2 slaves 5 slaves 10 slaves Scalability Scale-out Limited by storage capacities of each cluster
  • 22. 31st of August, 2015 TPC-TC'2015@Hawaii 22 Implementation & Performance Study 2 slaves 5 slaves 10 slaves Scalability Scale-out n times the number of data nodes results in n times the write throughput
  • 23. 31st of August, 2015 TPC-TC'2015@Hawaii 23 Related Work ●IBM Quest Synthetic Data Generator »Generates sequential patterns »Capacity and velocity issues ● IBM Quest Synthetic Data Generator was not designed for fast generation of big sequence databases »Functioning issues ● Random corrupt of base sequential patterns used for populating the sequence database – No guarantee that a sequential pattern repeats a number of times proportional to the database size. ●Other Big Data Generators with different objectives »TeraSort: sort large number of records generated by Teragen »PDGF: parallel generation of a big data warehouse
  • 24. 31st of August, 2015 TPC-TC'2015@Hawaii 24 Conclusions ●Proposal W E, implementation & tests of hitney numerators »Whitney Enumerators allow generation of large number of distinct sequences »Open source java code ●Parallel Sequence Generator »scalable and formal approach for Parallel Generation of Big Synthetic Sequence Databases satisfying both user-specified sequences’ characteristics and velocity requirements »Scalability and Scale-out performance study »Open source java code ●Project Homepage »https://sites.google.com/site/rimmoussa/miningbigseqdb
  • 25. 31st of August, 2015 TPC-TC'2015@Hawaii 25 Future Work ●Mining of frequent Sequential patterns »A-priori based algorithms ● GSP (done) »Pattern growth algorithms ● PrefixSpan (done) »Optimized algorithms (on-going) ● Fast pruning of candidates ● Efficient mining of long sequential patterns ●Investigate (on-going) »Cabs' trajectories: Rome Cabs(Feb.'2014, 1.6GB) »Computer logs: Dartmouth Univ. log data set (6GB, 2001-2006) »Customers shopping transactions in TPC-H benchmark ●Generate and Mine more Kinds of Patterns »Sequential patterns: a → b »Parallel patterns: a // b »Regular expressions of patterns
  • 26. 31st of August, 2015 TPC-TC'2015@Hawaii 26 Thank you! Q & A A Formal & Scalable Approach for Parallel Generation of Big Synthetic Sequence Database Rim Moussa 7th TPC-TC@Hawaii 31st of August, 2015