Leveraging Intra-Node Parallelization in HPCC Systems

2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
Leveraging Intra-Node
Parallelization in HPCC Systems
Fabian Fier

Parallelize Set Similarity Join
• Many applications need to identify similar
pairs of documents:
• Plagiarism detection
• Community mining in social networks
• Near-duplicate web page detection
• Document clustering
• ...
• Operation: Set similarity join (SSJ)
• Find all pairs of records (r, s) where
sim(r, s) ≥ t (r ∈ R, s ∈ S)
• Nice to have in a distributed system
Leveraging Intra-Node Parallelization in HPCC Systems 3
sr

Naïve Approach to Compute SSJ
• …
• L_R computeSimilarity(L_R r, L_R s) := TRANSFORM
• SELF.RecordId1 := r.RecordId;
• SELF.RecordId2 := s.RecordId;
• SELF.Sim := (compute similarity);
• END;
• …
• resToFilter := JOIN(R, S, TRUE, computeSimilarity(LEFT, RIGHT), ALL);
• result := resToFilter(Sim > 90);

Issue a:
Memory
exhaustion due
to too high
replication

Parallelize Filter-and-Verification Approaches
• Use data characteristics to replicate and group independent data (inverted index)
r1 a b e
r2 a d e
r3 b c d e f g
a r1, r2
b r1, r3
c r3
d r2, r3
e r1, r2, r3
f r3
g r3

Parallelize Filter-and-Verification Approaches
Issue b:
Straggling
executors
Issue c:
Not scalable:
only suitable for
small datasets
Cf. Fier et al.:
Set Similarity
Joins on
MapReduce: An
experimental
Survey

Basic Ideas
1. Global replication and grouping: -> a, c
• Without data dependencies
• Regarding system restrictions (RAM)
2. Use local parallelization more efficiently (> 1 Core per executor) -> b
• Use existing approaches local data structures, accessible my multiple cores
Wish list:
a) Stay in RAM
b) Efficient use of CPUs
c) Scalability to Big
Data
Potential!

Idea 1: Global Replication and Grouping
• Apply hash: no data dependency
• Choose hash such that
• #groups < #executors
• Groups fit into RAM of executor
1 2 3 4
1
2
3
4
p p1 1⋈ p p1 2⋈
p p2 2⋈ p p2 3⋈ p p2 4⋈
p p1 3⋈ p p1 4⋈
p p3 3
⋈ p p3 4⋈
p p4 4⋈
example: self-join

Idea 2: Leverage Local Parallelization
• HPCC Systems allows to have multiple executors per node
• However, executors cannot share data without copying
• Use multithreading in each executor with access to global inverted index
• C++ Std Threads within one executor
• allows fine-granular control over threads, especially regarding pinning to
avoid CPU migrations (NUMA effects)
• Multithreaded user-defined functions are not officially supported… ;-)
• Necessary to write a plugin; embedded code doesn‘t work

Details

Details
• main thread (void ppj2())
• copies input into InputDS: array of struct + pointers to token arrays (necessary
for random access)
• creates inverted index
• spawns threads
• copies threadResults into resultDS dataset
• worker thread
• process batchSize records
• write results back to a shared vector threadResults -> synchronization
necessary
1 2 3 4
1
2
3
4
executors
- InputDS
- Inverted Index
- threadResults
- ResultDS
main thread
worker threads
...
...

Compile and Install Plugin
• Download HPCC source code (same version like on
cluster)
• Make it compile ;-)
• Refer to plugins/exampleplugin
• C++ Mappings in ECL documentation -> „undefined
symbol“
• Add the new plugin to cmake config files
• Compile and deploy .so file to each cluster node
• Cluster in „blocked“ state: pkill on all executors on all
slave nodes
• use DBGLOG() to write to ECL Logs

Monitoring: netdata
Installation: bash <(curl -Ss https://my-netdata.io/kickstart.sh)

Monitoring: netdata
Browser generates graphs. Custom Dashboards showing multiple nodes:

Experiments: Data Scalability
• DBLP dataset 1x-25x
• threshold(Jaccard)=0.7
• numThreads=2
0
20
40
60
80
100
120
0 5 10 15 20 25 30
Runtime(S)
Dataset Scale
Wish list:
a) Stay in RAM
b) Efficient use of CPUs
c) Scalability to Big
Data

Experiments: Thread Scalability
• DBLP dataset 25x
• threshold(Jaccard)=0.7
• numThreads=2-32
90
91
92
93
94
95
96
97
98
99
100
101
0 5 10 15 20 25 30 35
Runtime(S)
Number of Threads / executor

Current Work
• Utilize local parallelization better
• Optimize approach to NUMA effects by pinning threads on cores in one CPU that
share datasets

Lessons Learned
• Less (complexity) is more
• Hash-based replication and grouping is more robust than relying on data
characteristics
• Fine-granular optimizations of filters (filter-and-verification approach) do not
have a big effect on the overall runtime in a distributed environment. In fact,
we didn‘t use any sophisticated filter here.

Thank you!
Special thanks to LexisNexis for providing a research grant

View this presentation on YouTube:
https://www.youtube.com/watch?v=nTWpfa0wdDk&list=PL-
8MJMUpp8IKH5-d56az56t52YccleX5h&index=7&t=0s (3:13)

Leveraging Intra-Node Parallelization in HPCC Systems

More Related Content

What's hot

Similar to Leveraging Intra-Node Parallelization in HPCC Systems

More from HPCC Systems

Recently uploaded

Leveraging Intra-Node Parallelization in HPCC Systems

Editor's Notes