1. 31st
of August, 2015 TPC-TC'2015@Hawaii 1
Big-SeqSB-Gen: a Formal & Scalable Approach for
Parallel Generation of Big Synthetic Sequence
Databases
Rim Moussa
CS Dept., Engineering School of Carthage
University of Carthage
rim.moussa@esti.rnu.tn
7th
TPC Technology Conference (TPC-TC)
Hawaii, USA. 2015
31st
of August, 2015
2. 31st
of August, 2015 TPC-TC'2015@Hawaii 2
Context
●Sequence Database Applications
»Customer shopping sequences
»Cabs (ground transportation) trajectories
»Natural disasters (e.g. earthquakes)
»Computer logs
»Stocks & Markets
»Telephone calling patterns
»Weblog click streams
●Challenges
»Scalable Processing
● Mining hidden patterns (frequent or outliers)
● Cluster Sequences
»Availability of Big Datasets ?
●Companies do not disclose their data
●Unavailability of Synthetic generator for Big Sequence DB ?
3. 31st
of August, 2015 TPC-TC'2015@Hawaii 3
Outline
●What is a Sequence Database ?
●Parallel Generation of a Sequence Database
»Requirements
»W Ehitney numerator
●Implementation & Performance Study
●Related Work
●Conclusions & Future Work
4. 31st
of August, 2015 TPC-TC'2015@Hawaii 4
Sequence Database
●Defnition
»A Sequence Database consists of sequences of ordered
elements or events, recorded with or without a concrete notion
of time. By Jiawei Han, Micheline Kamber and Jian Pei
●Length
»{1}{3,5} is a 3-sequence
●Count(s)
»Nbr of sequences that contain s
»Count({1}{3}{3}) = 2 (s1
and s2
)
●Support(s)
»Count(s) DB size
»Support({1}{3}{3}) = 0.5 (50%)
●S Subsequence, uper-sequence
»{1}{3}{3} is a subsequence of s1
»s1
is a super sequence of {1}{3}{3}
ID Sequence
s1
{1}{1,2,3}{1,3}{1,4}
s2
{1,4}{3}{2,3}{1,5}
s3
{5,6}{1,2}{4,6}{3}{2}
s4
{5}{7}{1,6}{3}{2}{3}
5. 31st
of August, 2015 TPC-TC'2015@Hawaii 5
Parallel Sequence Generator
--5 Requirements by Hupler K.
Relevance: PSG implements Whitney Enumerators
for distinct sequences' generation.
Repeatability: for multiple runs with same input params,
PSG must output a sequence DB with same characteristics.
Economy: PSG is open-source and hardware and
platform independent.
Fairness: data sets generated by PSG do not overft
a particular algorithm.
Performance: PSG reports metrics assessing its velocity,
scalability and scale-out performances.
6. 31st
of August, 2015 TPC-TC'2015@Hawaii 6
●W Nhitney umber
»Investigated by Raissi C. and Pei J. for bounding sequential
patterns (KDD'2011)
»Given,
●I = {i1
, i2
, i3
, …, in
} a fnite set of n items
●k : sequence length
●Wk
: Number of sequential patterns of length k is
● We propose W Ehitney numerator for the enumeration of
W Nhitney umbers
W Ehitney numerator
7. 31st
of August, 2015 TPC-TC'2015@Hawaii 7
W Ehitney numerators by Example (1/3)
WE4
10. 31st
of August, 2015 TPC-TC'2015@Hawaii 10
W Nhitney umbers Capacities
n W5
W10
W15
10 392,002 76,096,000 108 Quadrillion
(108,056,434,303,249,000)
50 1.35 Billion
(1,370,262,510)
7 Trillion
(7,031,803,751,400)
4.87 Octillion
(4,876,551,905,744,990,0...,000)
100 44.36 Billion
(44,462,112,520)
919 Trillion
(919,027,432,253,800)
167.7 Nollion
(167,755,438,441,193,0...,000)
Billion: 109
Trillion: 1012
Quadrillion: 1015
Octillion: 1027
Nonillion: 1030
n: number of elements composing the sequences
Wk
: number of sequences of length k
11. 31st
of August, 2015 TPC-TC'2015@Hawaii 11
Implementation & Performance Study
Implementation PSG *MR Job TestDFSIO HPC platform Experiments
●Directions towards an efficient implementation
»Enumerate Source Sequences at less cost
● NextSequence method
»Enumerate different patterns of Source Sequences
● Specify a quota for each pattern type
»Add noise to Source Sequences
»Enumerate Source Sequences in parallel
● Each Sequence generator emits a different source sequence
● Or in order to distribute sequential patterns among different fles, each
Sequence generator emits the sequence a number of times
●Apache Hadoop YARN (2.4) – MapReduce Framework
»HDFS fles can be processed by Apache Spark, Apache Storm
»Whitney package is coded in JAVA, can be used to switch to a
traditional parallel implementation
12. 31st
of August, 2015 TPC-TC'2015@Hawaii 12
Implementation & Performance Study
Implementation PSG *MR Job TestDFSIO HPC platform Experiments
●Parallel Sequence Generator
»PSG for short
»Mapper Set up
●parse job parameters: WEk
, #sequences, support, ...
»Mapper (sequence generator, worker)
●Iterate #source sequences #mappers times
●Generate a new Source Sequence: s
●Iterate avg(count(s))
●Write the sequence = noise(s) to the mapper data fle
●Emit #sequences and ∑ bytes written to HDFS
»Reducer
●Calculates ∑#sequences and ∑ bytes
●PSG versus TestDFSIO
»in order to highlight sequence generation overhead
13. 31st
of August, 2015 TPC-TC'2015@Hawaii 13
Implementation & Performance Study
Implementation PSG *MR Job TestDFSIO HPC platform Experiments
●TestDFSIO
»MapReduce implementation of a Distributed IO benchmark tool for
Hadoop Distributed File System
»Supports Read and Write workloads
»Write workload example: create 10 fles, such that each fle is 10GB
in order to generate a 100 GB dataset
● Each data fle is generated by a Mapper
● At set up, each Mapper creates a buffer (1MB buffer per default) and
repeatedly appends the buffer to its data fle until fle size reaches
specifed input value (10GB fle for instance)
●TeraGen
»MapReduce implementation of a parallel Record generator for
TeraSort Benchmark
»Cons: Generate keys in order to simulate input data fles for the
MapReduce job
14. 31st
of August, 2015 TPC-TC'2015@Hawaii 14
Implementation & Performance Study
Implementation PSG *MR Job TestDFSIO HPC platform Experiments
●French GRID platform -Grid5000
»Suno nodes @Sophia site
● 32 GB of memory
● 2 CPUs per node and 4 cores per CPU
● each CPU is Intel Xeon E5520, 2.27 GHz
»Cluster of 3 to 11 Suno nodes
● 1 master and 2 to 10 slaves
»10Gbps Ethernet
●Set up
»Automatic deployment of Apache Hadoop YARN on a cluster of
Grid5K nodes
»JRE 1.7
15. 31st
of August, 2015 TPC-TC'2015@Hawaii 15
Implementation & Performance Study
Implementation PSG *MR Job TestDFSIO HPC platform Experiments
●Frequent Sequential pattern use case
●PSG Key parameters
»N is the Cluster size is 3, 6 or 11 nodes
● 1 master and N-1 slaves (workers)
»Number of Sequence Generators (mappers)
»Number of Sources Sequences to inject
● 5-source sequences type i.e. WE5;
● Avg size of each sequence is 420 bytes
● Each source sequence repeats in average 5% of the number of source
sequences injected
●TestDFSIO parameters
»Cluster size
»Number of mappers (i.e. number of fles)
»Each fle size (GB)
●Key performance factors
»Write Throughput in #sequences per second for PSG
»Write Throughput in bytes per second for PSG and TestDFSIO
16. 31st
of August, 2015 TPC-TC'2015@Hawaii 16
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Max. sequence DB of 450 GB, more than 2 billions of sequences
Max. throughput of 315MBps for 90K sources sequences (DB is 91GB)
17. 31st
of August, 2015 TPC-TC'2015@Hawaii 17
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Max. sequence DB of 1.8 TB, more than 8 billions of sequences at 694MBps
Max. throughput of 741MBps for 180K sources sequences (DB is 365GB)
18. 31st
of August, 2015 TPC-TC'2015@Hawaii 18
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Max. throughput of 3 millions of sequences per second
19. 31st
of August, 2015 TPC-TC'2015@Hawaii 19
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Max. sequence DB of 4 TB, more than 18 billions of sequences at 1.2GBps
(less than one hour 57minutes)
20. 31st
of August, 2015 TPC-TC'2015@Hawaii 20
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Max. sequence DB of 4 TB, more than 18 billions of sequences at 5.3
millions of sequences per second
21. 31st
of August, 2015 TPC-TC'2015@Hawaii 21
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
Limited by storage capacities of each cluster
22. 31st
of August, 2015 TPC-TC'2015@Hawaii 22
Implementation & Performance Study
2 slaves 5 slaves 10 slaves Scalability Scale-out
n times the number of data nodes results in n times the write throughput
23. 31st
of August, 2015 TPC-TC'2015@Hawaii 23
Related Work
●IBM Quest Synthetic Data Generator
»Generates sequential patterns
»Capacity and velocity issues
● IBM Quest Synthetic Data Generator was not designed for fast generation of
big sequence databases
»Functioning issues
● Random corrupt of base sequential patterns used for populating the
sequence database – No guarantee that a sequential pattern repeats a
number of times proportional to the database size.
●Other Big Data Generators with different objectives
»TeraSort: sort large number of records generated by Teragen
»PDGF: parallel generation of a big data warehouse
24. 31st
of August, 2015 TPC-TC'2015@Hawaii 24
Conclusions
●Proposal W E, implementation & tests of hitney numerators
»Whitney Enumerators allow generation of large number of distinct
sequences
»Open source java code
●Parallel Sequence Generator
»scalable and formal approach for Parallel Generation of Big Synthetic
Sequence Databases satisfying both user-specified sequences’
characteristics and velocity requirements
»Scalability and Scale-out performance study
»Open source java code
●Project Homepage
»https://sites.google.com/site/rimmoussa/miningbigseqdb
25. 31st
of August, 2015 TPC-TC'2015@Hawaii 25
Future Work
●Mining of frequent Sequential patterns
»A-priori based algorithms
● GSP (done)
»Pattern growth algorithms
● PrefixSpan (done)
»Optimized algorithms (on-going)
● Fast pruning of candidates
● Efficient mining of long sequential patterns
●Investigate (on-going)
»Cabs' trajectories: Rome Cabs(Feb.'2014, 1.6GB)
»Computer logs: Dartmouth Univ. log data set (6GB, 2001-2006)
»Customers shopping transactions in TPC-H benchmark
●Generate and Mine more Kinds of Patterns
»Sequential patterns: a → b
»Parallel patterns: a // b
»Regular expressions of patterns
26. 31st
of August, 2015 TPC-TC'2015@Hawaii 26
Thank you!
Q & A
A Formal & Scalable Approach for Parallel Generation of Big Synthetic
Sequence Database
Rim Moussa
7th
TPC-TC@Hawaii
31st
of August, 2015