2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
Leveraging Intra-Node
Parallelization in HPCC Systems
Fabian Fier
Motivation
Parallelize Set Similarity Join
• Many applications need to identify similar
pairs of documents:
• Plagiarism detection
• Community mining in social networks
• Near-duplicate web page detection
• Document clustering
• ...
• Operation: Set similarity join (SSJ)
• Find all pairs of records (r, s) where
sim(r, s) ≥ t (r ∈ R, s ∈ S)
• Nice to have in a distributed system
Leveraging Intra-Node Parallelization in HPCC Systems 3
sr
Naïve Approach to Compute SSJ
• …
• L_R computeSimilarity(L_R r, L_R s) := TRANSFORM
• SELF.RecordId1 := r.RecordId;
• SELF.RecordId2 := s.RecordId;
• SELF.Sim := (compute similarity);
• END;
• …
• resToFilter := JOIN(R, S, TRUE, computeSimilarity(LEFT, RIGHT), ALL);
• result := resToFilter(Sim > 90);
Leveraging Intra-Node Parallelization in HPCC Systems 4
Naïve Approach to Compute SSJ
Leveraging Intra-Node Parallelization in HPCC Systems 5
Naïve Approach to Compute SSJ
Leveraging Intra-Node Parallelization in HPCC Systems 6
Issue a:
Memory
exhaustion due
to too high
replication
Parallelize Filter-and-Verification Approaches
• Use data characteristics to replicate and group independent data (inverted index)
Leveraging Intra-Node Parallelization in HPCC Systems 7
r1 a b e
r2 a d e
r3 b c d e f g
a r1, r2
b r1, r3
c r3
d r2, r3
e r1, r2, r3
f r3
g r3
Parallelize Filter-and-Verification Approaches
Leveraging Intra-Node Parallelization in HPCC Systems 8
Issue b:
Straggling
executors
Issue c:
Not scalable:
only suitable for
small datasets
Cf. Fier et al.:
Set Similarity
Joins on
MapReduce: An
experimental
Survey
Approach
Basic Ideas
Leveraging Intra-Node Parallelization in HPCC Systems 10
1. Global replication and grouping: -> a, c
• Without data dependencies
• Regarding system restrictions (RAM)
2. Use local parallelization more efficiently (> 1 Core per executor) -> b
• Use existing approaches local data structures, accessible my multiple cores
Wish list:
a) Stay in RAM
b) Efficient use of CPUs
c) Scalability to Big
Data
Potential!
Idea 1: Global Replication and Grouping
Leveraging Intra-Node Parallelization in HPCC Systems 11
• Apply hash: no data dependency
• Choose hash such that
• #groups < #executors
• Groups fit into RAM of executor
1 2 3 4
1
2
3
4
p p1 1⋈ p p1 2⋈
p p2 2⋈ p p2 3⋈ p p2 4⋈
p p1 3⋈ p p1 4⋈
p p3 3
⋈ p p3 4⋈
p p4 4⋈
example: self-join
Idea 2: Leverage Local Parallelization
Leveraging Intra-Node Parallelization in HPCC Systems 12
• HPCC Systems allows to have multiple executors per node
• However, executors cannot share data without copying
• Use multithreading in each executor with access to global inverted index
• C++ Std Threads within one executor
• allows fine-granular control over threads, especially regarding pinning to
avoid CPU migrations (NUMA effects)
• Multithreaded user-defined functions are not officially supported… ;-)
• Necessary to write a plugin; embedded code doesn‘t work
Details
Leveraging Intra-Node Parallelization in HPCC Systems 13
Details
Leveraging Intra-Node Parallelization in HPCC Systems 14
• main thread (void ppj2())
• copies input into InputDS: array of struct + pointers to token arrays (necessary
for random access)
• creates inverted index
• spawns threads
• copies threadResults into resultDS dataset
• worker thread
• process batchSize records
• write results back to a shared vector threadResults -> synchronization
necessary
1 2 3 4
1
2
3
4
executors
- InputDS
- Inverted Index
- threadResults
- ResultDS
main thread
worker threads
...
...
Hands On!
Compile and Install Plugin
Leveraging Intra-Node Parallelization in HPCC Systems 16
• Download HPCC source code (same version like on
cluster)
• Make it compile ;-)
• Refer to plugins/exampleplugin
• C++ Mappings in ECL documentation -> „undefined
symbol“
• Add the new plugin to cmake config files
• Compile and deploy .so file to each cluster node
• Cluster in „blocked“ state: pkill on all executors on all
slave nodes
• use DBGLOG() to write to ECL Logs
Monitoring: netdata
Leveraging Intra-Node Parallelization in HPCC Systems 17
Installation: bash <(curl -Ss https://my-netdata.io/kickstart.sh)
Monitoring: netdata
Leveraging Intra-Node Parallelization in HPCC Systems 18
Browser generates graphs. Custom Dashboards showing multiple nodes:
Experiments: Data Scalability
• DBLP dataset 1x-25x
• threshold(Jaccard)=0.7
• numThreads=2
Leveraging Intra-Node Parallelization in HPCC Systems 19
0
20
40
60
80
100
120
0 5 10 15 20 25 30
Runtime(S)
Dataset Scale
Wish list:
a) Stay in RAM
b) Efficient use of CPUs
c) Scalability to Big
Data
Experiments: Thread Scalability
• DBLP dataset 25x
• threshold(Jaccard)=0.7
• numThreads=2-32
Leveraging Intra-Node Parallelization in HPCC Systems 20
90
91
92
93
94
95
96
97
98
99
100
101
0 5 10 15 20 25 30 35
Runtime(S)
Number of Threads / executor
Current Work
Leveraging Intra-Node Parallelization in HPCC Systems 21
• Utilize local parallelization better
• Optimize approach to NUMA effects by pinning threads on cores in one CPU that
share datasets
Lessons Learned
Leveraging Intra-Node Parallelization in HPCC Systems 22
• Less (complexity) is more
• Hash-based replication and grouping is more robust than relying on data
characteristics
• Fine-granular optimizations of filters (filter-and-verification approach) do not
have a big effect on the overall runtime in a distributed environment. In fact,
we didn‘t use any sophisticated filter here.
Thank you!
Special thanks to LexisNexis for providing a research grant
Leveraging Intra-Node Parallelization in HPCC Systems 23
Leveraging Intra-Node Parallelization in HPCC Systems 24
View this presentation on YouTube:
https://www.youtube.com/watch?v=nTWpfa0wdDk&list=PL-
8MJMUpp8IKH5-d56az56t52YccleX5h&index=7&t=0s (3:13)

Leveraging Intra-Node Parallelization in HPCC Systems

  • 1.
    2019 HPCC Systems® Community Day ChallengeYourself – Challenge the Status Quo Leveraging Intra-Node Parallelization in HPCC Systems Fabian Fier
  • 2.
  • 3.
    Parallelize Set SimilarityJoin • Many applications need to identify similar pairs of documents: • Plagiarism detection • Community mining in social networks • Near-duplicate web page detection • Document clustering • ... • Operation: Set similarity join (SSJ) • Find all pairs of records (r, s) where sim(r, s) ≥ t (r ∈ R, s ∈ S) • Nice to have in a distributed system Leveraging Intra-Node Parallelization in HPCC Systems 3 sr
  • 4.
    Naïve Approach toCompute SSJ • … • L_R computeSimilarity(L_R r, L_R s) := TRANSFORM • SELF.RecordId1 := r.RecordId; • SELF.RecordId2 := s.RecordId; • SELF.Sim := (compute similarity); • END; • … • resToFilter := JOIN(R, S, TRUE, computeSimilarity(LEFT, RIGHT), ALL); • result := resToFilter(Sim > 90); Leveraging Intra-Node Parallelization in HPCC Systems 4
  • 5.
    Naïve Approach toCompute SSJ Leveraging Intra-Node Parallelization in HPCC Systems 5
  • 6.
    Naïve Approach toCompute SSJ Leveraging Intra-Node Parallelization in HPCC Systems 6 Issue a: Memory exhaustion due to too high replication
  • 7.
    Parallelize Filter-and-Verification Approaches •Use data characteristics to replicate and group independent data (inverted index) Leveraging Intra-Node Parallelization in HPCC Systems 7 r1 a b e r2 a d e r3 b c d e f g a r1, r2 b r1, r3 c r3 d r2, r3 e r1, r2, r3 f r3 g r3
  • 8.
    Parallelize Filter-and-Verification Approaches LeveragingIntra-Node Parallelization in HPCC Systems 8 Issue b: Straggling executors Issue c: Not scalable: only suitable for small datasets Cf. Fier et al.: Set Similarity Joins on MapReduce: An experimental Survey
  • 9.
  • 10.
    Basic Ideas Leveraging Intra-NodeParallelization in HPCC Systems 10 1. Global replication and grouping: -> a, c • Without data dependencies • Regarding system restrictions (RAM) 2. Use local parallelization more efficiently (> 1 Core per executor) -> b • Use existing approaches local data structures, accessible my multiple cores Wish list: a) Stay in RAM b) Efficient use of CPUs c) Scalability to Big Data Potential!
  • 11.
    Idea 1: GlobalReplication and Grouping Leveraging Intra-Node Parallelization in HPCC Systems 11 • Apply hash: no data dependency • Choose hash such that • #groups < #executors • Groups fit into RAM of executor 1 2 3 4 1 2 3 4 p p1 1⋈ p p1 2⋈ p p2 2⋈ p p2 3⋈ p p2 4⋈ p p1 3⋈ p p1 4⋈ p p3 3 ⋈ p p3 4⋈ p p4 4⋈ example: self-join
  • 12.
    Idea 2: LeverageLocal Parallelization Leveraging Intra-Node Parallelization in HPCC Systems 12 • HPCC Systems allows to have multiple executors per node • However, executors cannot share data without copying • Use multithreading in each executor with access to global inverted index • C++ Std Threads within one executor • allows fine-granular control over threads, especially regarding pinning to avoid CPU migrations (NUMA effects) • Multithreaded user-defined functions are not officially supported… ;-) • Necessary to write a plugin; embedded code doesn‘t work
  • 13.
  • 14.
    Details Leveraging Intra-Node Parallelizationin HPCC Systems 14 • main thread (void ppj2()) • copies input into InputDS: array of struct + pointers to token arrays (necessary for random access) • creates inverted index • spawns threads • copies threadResults into resultDS dataset • worker thread • process batchSize records • write results back to a shared vector threadResults -> synchronization necessary 1 2 3 4 1 2 3 4 executors - InputDS - Inverted Index - threadResults - ResultDS main thread worker threads ... ...
  • 15.
  • 16.
    Compile and InstallPlugin Leveraging Intra-Node Parallelization in HPCC Systems 16 • Download HPCC source code (same version like on cluster) • Make it compile ;-) • Refer to plugins/exampleplugin • C++ Mappings in ECL documentation -> „undefined symbol“ • Add the new plugin to cmake config files • Compile and deploy .so file to each cluster node • Cluster in „blocked“ state: pkill on all executors on all slave nodes • use DBGLOG() to write to ECL Logs
  • 17.
    Monitoring: netdata Leveraging Intra-NodeParallelization in HPCC Systems 17 Installation: bash <(curl -Ss https://my-netdata.io/kickstart.sh)
  • 18.
    Monitoring: netdata Leveraging Intra-NodeParallelization in HPCC Systems 18 Browser generates graphs. Custom Dashboards showing multiple nodes:
  • 19.
    Experiments: Data Scalability •DBLP dataset 1x-25x • threshold(Jaccard)=0.7 • numThreads=2 Leveraging Intra-Node Parallelization in HPCC Systems 19 0 20 40 60 80 100 120 0 5 10 15 20 25 30 Runtime(S) Dataset Scale Wish list: a) Stay in RAM b) Efficient use of CPUs c) Scalability to Big Data
  • 20.
    Experiments: Thread Scalability •DBLP dataset 25x • threshold(Jaccard)=0.7 • numThreads=2-32 Leveraging Intra-Node Parallelization in HPCC Systems 20 90 91 92 93 94 95 96 97 98 99 100 101 0 5 10 15 20 25 30 35 Runtime(S) Number of Threads / executor
  • 21.
    Current Work Leveraging Intra-NodeParallelization in HPCC Systems 21 • Utilize local parallelization better • Optimize approach to NUMA effects by pinning threads on cores in one CPU that share datasets
  • 22.
    Lessons Learned Leveraging Intra-NodeParallelization in HPCC Systems 22 • Less (complexity) is more • Hash-based replication and grouping is more robust than relying on data characteristics • Fine-granular optimizations of filters (filter-and-verification approach) do not have a big effect on the overall runtime in a distributed environment. In fact, we didn‘t use any sophisticated filter here.
  • 23.
    Thank you! Special thanksto LexisNexis for providing a research grant Leveraging Intra-Node Parallelization in HPCC Systems 23
  • 24.
    Leveraging Intra-Node Parallelizationin HPCC Systems 24 View this presentation on YouTube: https://www.youtube.com/watch?v=nTWpfa0wdDk&list=PL- 8MJMUpp8IKH5-d56az56t52YccleX5h&index=7&t=0s (3:13)

Editor's Notes

  • #20 5 executors/node 24 cores (4 CPUs)
  • #21 es ist nicht die Synchronisation; selbst ohne geht’s nicht schneller