SlideShare a Scribd company logo
1 of 49
Download to read offline
Ben Langmead
Assistant Professor
Johns Hopkins University, Department of Computer Science
November 2016
Reshaping core genomics software
tools for the many-core era
3
The Intel Parallel Computing Center at
4
Outline
§  Introduction to sequencing data analysis & Bowtie
§  Thread scaling improvements using TBB
–  Choice of mutex
–  Two-stage parsing
§  AVX2, AVX512-KNC & AVX512-KNC improvements
§  Impact on the field
5
Sequencing
6
Sequencing
7
Sequencing
8
Read alignment
9
Read alignment
§  Needle in a haystack
§  Billions of reads from
a single week-long
sequencing run
§  Human reference
genome is ~3B bases
(letters) long
10
Bowtie and Bowtie 2
§  Together cited by
>12K other scientific
studies since 2009
§  Bundled with dozens
of other tools & many
Linux distros
11
HISAT
§  Based on Bowtie 2 and a leading spliced aligner for RNA sequencing data
§  Cited in >75 scientific studies since 2015
12
Design of Bowtie & Bowtie 2
Bowtie 1
Bowtie 2
13
Design of Bowtie & Bowtie 2
Bowtie 1
Bowtie 2
Random access to large index
data structure and minimal ILP
14
Design of Bowtie & Bowtie 2
Bowtie 1
Bowtie 2
Dynamic programming, lots of ILP
0
250
500
750
1000
0 25 50 75 100 125
# threads (unpaired)
Normalizedrunningtime
lock
TBB spin_mutex
tinythreads fast_mutex
15
Thread scaling
§  Switching to analogous TBB lock could bring big improvement
Ivy Bridge, 4 NUMA nodes,
120 threads
Vertical axis is per-
thread running time;
lower is better
Bowtie 1 unpaired
100
150
200
250
300
350
0 25 50 75 100 125
# threads (unpaired)
Normalizedrunningtime
lock
None (stubbed I/O)
TBB spin_mutex
version
Original parsing
16
Thread scaling
§  Removing synchronization by “stubbing” input lock gives further improvement
Bowtie 2 unpairedIvy Bridge, 4 NUMA nodes,
120 threads
Vertical axis is per-thread
running time; lower is better
17
Thread scaling
§  Vtune investigation indicates synchronization itself (e.g. see __TBB_LockByte)
is taking the time
18
Thread scaling
Bowtie 2 unpaired
How to close the
gap between
actual and ideal
performance?
19
Thread scaling
Bowtie 2 unpaired
Why does mutex
choice have outsize
effect?
CMU 15-418/618, Spring 2015
Test-and-set lock performance
Benchmark&executes:&
lock(L);&
critical>section(c)&
unlock(L);
Time(us)
Number of processors
Benchmark: total of N lock/unlock sequences (in aggregate) by P processors
Critical section time removed so graph plots only time acquiring/releasing the lock
Bus contention increases amount of
time to transfer lock (lock holder must
wait to acquire bus to release)
Not shown: bus contention also slows
down execution of critical section
Figure credit: Culler, Singh, and Gupta
20
Thread scaling
§  Mutex spinning on atomic op
(compare-and-swap, test-and-
set), spurs exchange of cache
coherence messages
§  Image by Kayvon Fatahalian,
Copyright 2015 Carnegie
Mellon University
21
Thread scaling
§  Even a standard pthreads mutex was outperforming the
spin lock when running one thread per available core
–  More evidence that cache coherence traffic is culprit
§  Queue locks are known to have better cache properties
–  Waiting thread spins on normal (non-atomic) read
–  Cache line read belongs exclusively to that thread
and can live in L1
22
Thread scaling
§  We hypothesized a NUMA-aware “cohort lock” could help further
Dice, David, Virendra J. Marathe, and Nir Shavit. "Lock cohorting: a general technique for
designing NUMA locks." ACM SIGPLAN Notices. Vol. 47. No. 8. ACM, 2012.
23
Cohort locking
class	
  CohortLock	
  {	
  
public:	
  
	
  	
  CohortLock()	
  :	
  lockers_numa_idx(-­‐1)	
  {	
  
	
  	
  	
  	
  starvation_counters	
  =	
  new	
  int[MAX_NODES]();	
  
	
  	
  	
  	
  own_global	
  =	
  new	
  bool[MAX_NODES]();	
  
	
  	
  	
  	
  local_locks	
  =	
  new	
  TKTLock[MAX_NODES];	
  
	
  	
  }	
  
	
  	
  ~CohortLock()	
  {	
  
	
  	
  	
  	
  delete[]	
  starvation_counters;	
  
	
  	
  	
  	
  delete[]	
  own_global;	
  
	
  	
  	
  	
  delete[]	
  local_locks;	
  
	
  	
  }	
  
	
  	
  void	
  lock();	
  
	
  	
  void	
  unlock();	
  
private:	
  
	
  	
  static	
  const	
  int	
  STARVATION_LIMIT	
  =	
  100;	
  
	
  	
  static	
  const	
  int	
  MAX_NODES	
  =	
  128;	
  
	
  	
  volatile	
  int*	
  	
  starvation_counters;	
  //	
  1	
  per	
  node	
  
	
  	
  volatile	
  bool*	
  own_global;	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  1	
  per	
  node	
  
	
  	
  volatile	
  int	
  	
  	
  lockers_numa_idx;	
  	
  	
  	
  //	
  1	
  per	
  node	
  
	
  	
  TKTLock*	
  local_locks;	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  1	
  per	
  node	
  
	
  	
  PTLLock	
  global_lock;	
  
};	
  
§  Each NUMA node has per-node ticket lock
§  Other per-node information tracks when to
pass lock to other threads on same node
§  Single global partitioned ticket lock
24
Cohort locking
void	
  CohortLock::lock()	
  {	
  
	
  	
  const	
  int	
  numa_idx	
  =	
  determine_numa_idx();	
  
	
  	
  local_locks[numa_idx].lock();	
  
	
  	
  if(!own_global[numa_idx])	
  {	
  
	
  	
  	
  	
  	
  	
  global_lock.lock();	
  
	
  	
  }	
  
	
  	
  starvation_counters[numa_idx]++;	
  
	
  	
  own_global[numa_idx]	
  =	
  true;	
  
	
  	
  lockers_numa_idx	
  =	
  numa_idx;	
  
}	
  
	
  
void	
  CohortLock::unlock()	
  {	
  
	
  	
  assert(lockers_numa_idx	
  !=	
  -­‐1);	
  
	
  	
  int	
  numa_idx	
  =	
  lockers_numa_idx;	
  
	
  	
  lockers_numa_idx	
  =	
  -­‐1;	
  
	
  	
  if(local_locks[numa_idx].q_length()	
  ==	
  1	
  ||	
  
	
  	
  	
  	
  	
  starvation_counters[numa_idx]	
  >	
  STARVATION_LIMIT)	
  
	
  	
  {	
  
	
  	
  	
  	
  global_lock.unlock();	
  
	
  	
  	
  	
  starvation_counters[numa_idx]	
  =	
  0;	
  
	
  	
  	
  	
  own_global[numa_idx]	
  =	
  false;	
  
	
  	
  }	
  
	
  	
  local_locks[numa_idx].unlock();	
  
}	
  
§  When locking:
–  Grab local lock
–  Once grabbed, grab global lock if not
already owned by this node
§  When unlocking:
–  Is another thread on same node queued?
If so, hand lock to next in queue
–  Otherwise release global & local locks
–  Override hand-off if others are starving
25
Cohort locking
§  Another implementation of cohort locking available in ConcurrencyKit:
http://concurrencykit.org
–  https://github.com/concurrencykit/ck/blob/master/include/ck_cohort.h
26
Thread scaling
§  Chris Wilks added TBB queue locks, JHU/TBB
Cohort locks (2 flavors) to Bowtie 2, Bowtie & HISAT
§  Available in public branches, with all but cohort locks
available in master branch and in recent releases
27
Thread scaling
§  Novel strategy splits input parsing into two “phases”
§  First (“light parsing”) rapidly detects record
boundaries, requiring synchronization but with very
brief critical section
§  Second (“full parsing”) fully parses each record
(pictured, right) with no synchronization
§  Minimizes time spent in crucial critical section
@ABC_123_1
GCTATTATGCTAT
+
JJSYEGGU8233^
@ABC_424_1
GTGATATGCAT
+
SYEG!U8@233
@ABCD_9_1
GCTATTATGCTATAAAC
+
JJSYEGGU8233^32FR
@D_91231_1
GCTATTATGCTAT
+
JJSYEGGU8233^
…
100
200
300
0 25 50 75 100 125
# threads (unpaired)
Normalizedrunningtime
lock
None (stubbed I/O)
TBB mutex
TBB queuing_mutex
TBB spin_mutex
TBB/JHU CohortLock
tinythreads fast_mutex
28
Thread scaling: Bowtie 2 unpaired
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  TBB queuing_mutex and TBB/JHU cohort lock perform best
100
200
300
0 25 50 75 100 125
# threads (unpaired)
Normalizedrunningtime
None (stubbed I/O)
TBB mutex
TBB queuing_mutex
TBB spin_mutex
TBB/JHU CohortLock
tinythreads fast_mutex
version
Optimized parsing
Original parsing
29
Thread scaling: Bowtie 2 unpaired
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  Two-phase parsing yields substantial thread-scaling boost; close to perfect up
to 120 threads, regardless of mutex
100
200
300
0 25 50 75 100 125
# threads (unpaired)
Normalizedrunningtime
lock
None (stubbed I/O)
TBB mutex
TBB queuing_mutex
TBB spin_mutex
TBB/JHU CohortLock
tinythreads fast_mutex
30
Thread scaling: Bowtie 2 paired-end
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  queuing_mutex and cohort lock again perform the best, near ideal
31
Thread scaling: Bowtie 2 paired-end
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  Two-phase parsing yields substantial thread-scaling boost; close to perfect up
to 120 threads, with mutex having smaller impact
0
100
200
300
0 25 50 75 100 125
# threads (paired−end)
Normalizedrunningtime
lock
None (stubbed I/O)
TBB mutex
TBB queuing_mutex
TBB spin_mutex
TBB/JHU CohortLock
tinythreads fast_mutex
32
Thread scaling: Bowtie
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  As with Bowtie 2, near-ideal scaling with queuing and cohort locks
0
100
200
300
400
0 25 50 75 100 125
# threads
Normalizedrunningtime
version
Optimized parsing
Original parsing
lock
TBB queuing_mutex
tinythreads fast_mutex
33
Thread scaling: HISAT unpaired
Vertical axis is per-thread
running time; lower is better
Ivy Bridge, 4 NUMA nodes,
120 threads
§  Huge improvements with queuing_lock and two-phase parsing
34
Thread scaling
§  Further gains possible with batch parsing, where the first phase “lightly” parses
several reads at once, reducing # critical section entrances
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
20 30 40
bowtie # threads
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
100
200
300
0 25 50 75 100 125
bowtie # threads
Normalizedrunningtime
●
● ●
●
●
●
● ● ●
●
● ● ● ● ●
● ●
●
●
●
● ● ● ● ● ● ●
● ●
●
● ● ●
● ● ● ●
●
●
●
20 30 40
bowtie2 # threads
● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
● ●
● ● ●
●
●
● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
● ● ●
● ● ●
● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
40
50
60
70
0 25 50 75 100 125
bowtie2 # threads
Normalizedrunningtime
●
●
●
●
●
●
●
●
● ● ●
●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
100
150
200
250
lizedrunningtime
20 30 40
bowtie # threads
0 25 50 75 100 125
bowtie # threads
●
● ●
●
●
●
● ● ●
●
● ● ● ● ●
● ●
●
●
●
● ● ● ● ● ● ●
● ●
●
● ● ●
● ● ● ●
●
●
●
20 30 40
bowtie2 # threads
● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
● ●
● ● ●
●
●
● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
● ● ●
● ● ●
● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
40
50
60
70
0 25 50 75 100 125
bowtie2 # threads
Normalizedrunningtime
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
20 30 40
hisat # threads
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
50
100
150
200
250
0 25 50 75 100 125
hisat # threads
Normalizedrunningtime
lock MP tinythreads fast_mutex TBB queuing_mutex
version ● ● ●Batch parsing Original parsing Two−phase parsing
35
Thread scaling: Bowtie 2 on Broadwell
§  Experiment conducted by John Oneill at Intel
§  TBB + optimized parsing yields speedups of 1.1x - 1.8x on 88 threads on
Broadwell E5-2699 v4 part. TBB/JHU Cohort lock outperforms other mutexes.
36
Thread scaling: Bowtie 2 on Knight’s Landing
§  Experiment conducted by John Oneill at Intel
§  TBB + optimized parsing yields speedups of 2x - 2.7x on 192 threads on KNL
B0 bin3 part. TBB/JHU Cohort lock outperforms other mutexes.
37
Thread scaling: summary
§  Using a queue mutex / cohort lock can yield big improvement over spin /
normal lock
§  Achieved near-ideal scaling up to 120 threads with (a) queue/cohort locks and
(b) cleaner parsing for Bowtie, Bowtie 2.
§  Promising scaling results on KNC & KNL; more to do
§  Cohort locks were best option in Broadwell & KNL experiments
§  Cohort locks seem to put KNL in a better position to outperform Xeon on
genomics workloads
38
Vectorization of Bowtie 2 inner loop
§  Dynamic programming alignment not unique to Bowtie 2
§  Common to many sequence alignment problems
39
Vectorization of Bowtie 2 inner loop
40
Vectorization of Bowtie 2 inner loop
The wider the vector word, the more times the fixup loop iterates
§  Mitigates the benefit of
having wider words
41
Vectorization of Bowtie 2 inner loop
…but in some situations, the fixup loop can be skipped with little or no downside
§  Important future work is to
determine whether selective
suppression of fixup loop
can remove most or all of
the downside of having
wider words
42
Impact on the field
§  As of Bowtie 1.0.1 release / Bowtie 2 2.2.0 release, Intel improvements are “in
the wild,” assisting life science researchers
43
Impact on the field
§  Added TBB to Bowtie 1.1.2, Bowtie 2 2.2.6. Also added to public branch of
HISAT. Plan to make TBB the default threading library in upcoming release.
44
Impact on the field
§  Daehwan Kim of JHU IPCC team parallelized the index building process in
Bowtie 2; TBB version of parallel index building available as of 2.2.7
45
Impact on the field
§  With changes fully reflected in
Bowtie 1.2.0 and Bowtie 2 2.3.0,
JHU team drafting manuscript
describing improvements and
lessons learned
46
Future directions
§  Where and why does the cohort lock help?
§  Does cohort lock have a future in TBB?
§  Can selective suppression of Bowtie 2 fixup loop
unlock power of wider vector words?
§  Can all of the above yield a big Knight’s Landing
throughput win?
47
Other resources
§  http://www.langmead-lab.org
§  https://www.coursera.org/learn/dna-sequencing
–  YouTube videos for above: http://bit.ly/ADS1_videos
48
Thank you
§  John Oneill, Ram Ramanujam, Kevin O’leary, and many other great Intel
engineers we spoke to and worked with
§  Lisa Smith, Brian Napier and others in IPCC program
§  Langmead lab team: Chris Wilks, Valentin Antonescu
§  Salzberg lab team: Steven Salzberg, Daehwan Kim
§  Intel
Thank you for your time
Ben Langmead
langmea@cs.jhu.edu
www.intel.com/hpcdevcon

More Related Content

What's hot

Varnish in action pbc10
Varnish in action pbc10Varnish in action pbc10
Varnish in action pbc10Combell NV
 
Угадываем пароль за минуту
Угадываем пароль за минутуУгадываем пароль за минуту
Угадываем пароль за минутуPositive Hack Days
 
Hacking (with) WebSockets
Hacking (with) WebSocketsHacking (with) WebSockets
Hacking (with) WebSocketsSergey Shekyan
 
Low latency & mechanical sympathy issues and solutions
Low latency & mechanical sympathy  issues and solutionsLow latency & mechanical sympathy  issues and solutions
Low latency & mechanical sympathy issues and solutionsJean-Philippe BEMPEL
 
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementAnne Nicolas
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelAnne Nicolas
 
Kalyna block cipher presentation in English
Kalyna block cipher presentation in EnglishKalyna block cipher presentation in English
Kalyna block cipher presentation in EnglishRoman Oliynykov
 
[BlackHat USA 2016] Nonce-Disrespecting Adversaries: Practical Forgery Attack...
[BlackHat USA 2016] Nonce-Disrespecting Adversaries: Practical Forgery Attack...[BlackHat USA 2016] Nonce-Disrespecting Adversaries: Practical Forgery Attack...
[BlackHat USA 2016] Nonce-Disrespecting Adversaries: Practical Forgery Attack...Aaron Zauner
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Tier1 App
 
RSA NetWitness Log Decoder
RSA NetWitness Log DecoderRSA NetWitness Log Decoder
RSA NetWitness Log DecoderSusam Pal
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareC4Media
 
Neutron: br-ex is now deprecated! what is modern way?
Neutron: br-ex is now deprecated! what is modern way?Neutron: br-ex is now deprecated! what is modern way?
Neutron: br-ex is now deprecated! what is modern way?Akihiro Motoki
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_partyOpen Party
 
Ethereum Blockchain and DApps - Workshop at Software University
Ethereum Blockchain and DApps  - Workshop at Software UniversityEthereum Blockchain and DApps  - Workshop at Software University
Ethereum Blockchain and DApps - Workshop at Software UniversityOpen Source University
 
VPC Implementation In OpenStack Heat
VPC Implementation In OpenStack HeatVPC Implementation In OpenStack Heat
VPC Implementation In OpenStack HeatSaju Madhavan
 
[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...
[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...
[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...PROIDEA
 

What's hot (20)

Varnish in action pbc10
Varnish in action pbc10Varnish in action pbc10
Varnish in action pbc10
 
Угадываем пароль за минуту
Угадываем пароль за минутуУгадываем пароль за минуту
Угадываем пароль за минуту
 
Hacking (with) WebSockets
Hacking (with) WebSocketsHacking (with) WebSockets
Hacking (with) WebSockets
 
Low latency & mechanical sympathy issues and solutions
Low latency & mechanical sympathy  issues and solutionsLow latency & mechanical sympathy  issues and solutions
Low latency & mechanical sympathy issues and solutions
 
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power Management
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
 
Kalyna block cipher presentation in English
Kalyna block cipher presentation in EnglishKalyna block cipher presentation in English
Kalyna block cipher presentation in English
 
Salsa20 Cipher
Salsa20 CipherSalsa20 Cipher
Salsa20 Cipher
 
[BlackHat USA 2016] Nonce-Disrespecting Adversaries: Practical Forgery Attack...
[BlackHat USA 2016] Nonce-Disrespecting Adversaries: Practical Forgery Attack...[BlackHat USA 2016] Nonce-Disrespecting Adversaries: Practical Forgery Attack...
[BlackHat USA 2016] Nonce-Disrespecting Adversaries: Practical Forgery Attack...
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
 
RSA NetWitness Log Decoder
RSA NetWitness Log DecoderRSA NetWitness Log Decoder
RSA NetWitness Log Decoder
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @Cloudflare
 
community detection
community detectioncommunity detection
community detection
 
Neutron: br-ex is now deprecated! what is modern way?
Neutron: br-ex is now deprecated! what is modern way?Neutron: br-ex is now deprecated! what is modern way?
Neutron: br-ex is now deprecated! what is modern way?
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
 
BGP zombie routes
BGP zombie routesBGP zombie routes
BGP zombie routes
 
Ethereum Blockchain and DApps - Workshop at Software University
Ethereum Blockchain and DApps  - Workshop at Software UniversityEthereum Blockchain and DApps  - Workshop at Software University
Ethereum Blockchain and DApps - Workshop at Software University
 
Cloud Monitors Cloud
Cloud Monitors CloudCloud Monitors Cloud
Cloud Monitors Cloud
 
VPC Implementation In OpenStack Heat
VPC Implementation In OpenStack HeatVPC Implementation In OpenStack Heat
VPC Implementation In OpenStack Heat
 
[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...
[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...
[CONFidence 2016]: Alex Plaskett, Georgi Geshev - QNX: 99 Problems but a Micr...
 

Similar to Reshaping Core Genomics Software Tools for the Manycore Era

osdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdfosdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdfgmdvmk
 
L3-.pptx
L3-.pptxL3-.pptx
L3-.pptxasdq4
 
Futex Scaling for Multi-core Systems
Futex Scaling for Multi-core SystemsFutex Scaling for Multi-core Systems
Futex Scaling for Multi-core SystemsDavidlohr Bueso
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterLucidworks
 
New hope is comming? Project Loom.pdf
New hope is comming? Project Loom.pdfNew hope is comming? Project Loom.pdf
New hope is comming? Project Loom.pdfKrystian Zybała
 
synopsys logic synthesis
synopsys logic synthesissynopsys logic synthesis
synopsys logic synthesisssuser471b66
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programminginside-BigData.com
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Bruno Castelucci
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsJulien Anguenot
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...DataStax
 
Instaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandraInstaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandraInstaclustr
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itZubair Nabi
 
LAGraph 2021-10-13
LAGraph 2021-10-13LAGraph 2021-10-13
LAGraph 2021-10-13Jason Riedy
 
Codemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech labCodemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech labUgo Landini
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]Aleksei Voitylov
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017HBaseCon
 

Similar to Reshaping Core Genomics Software Tools for the Manycore Era (20)

osdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdfosdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdf
 
L3-.pptx
L3-.pptxL3-.pptx
L3-.pptx
 
Futex Scaling for Multi-core Systems
Futex Scaling for Multi-core SystemsFutex Scaling for Multi-core Systems
Futex Scaling for Multi-core Systems
 
ZERO WIRE LOAD MODEL.pptx
ZERO WIRE LOAD MODEL.pptxZERO WIRE LOAD MODEL.pptx
ZERO WIRE LOAD MODEL.pptx
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, Twitter
 
New hope is comming? Project Loom.pdf
New hope is comming? Project Loom.pdfNew hope is comming? Project Loom.pdf
New hope is comming? Project Loom.pdf
 
synopsys logic synthesis
synopsys logic synthesissynopsys logic synthesis
synopsys logic synthesis
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Instaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandraInstaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandra
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on it
 
Percona FT / TokuDB
Percona FT / TokuDBPercona FT / TokuDB
Percona FT / TokuDB
 
LAGraph 2021-10-13
LAGraph 2021-10-13LAGraph 2021-10-13
LAGraph 2021-10-13
 
Codemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech labCodemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech lab
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
Memory model
Memory modelMemory model
Memory model
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
 

More from Intel® Software

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology Intel® Software
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaIntel® Software
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciIntel® Software
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.Intel® Software
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Intel® Software
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Intel® Software
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchIntel® Software
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel® Software
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019Intel® Software
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019Intel® Software
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Intel® Software
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Intel® Software
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Intel® Software
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...Intel® Software
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesIntel® Software
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision SlidesIntel® Software
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Intel® Software
 

More from Intel® Software (20)

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and Anaconda
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI Research
 
Intel Developer Program
Intel Developer ProgramIntel Developer Program
Intel Developer Program
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview Slides
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
 
AIDC India - AI on IA
AIDC India  - AI on IAAIDC India  - AI on IA
AIDC India - AI on IA
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino Slides
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision Slides
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Reshaping Core Genomics Software Tools for the Manycore Era

  • 1.
  • 2. Ben Langmead Assistant Professor Johns Hopkins University, Department of Computer Science November 2016 Reshaping core genomics software tools for the many-core era
  • 3. 3 The Intel Parallel Computing Center at
  • 4. 4 Outline §  Introduction to sequencing data analysis & Bowtie §  Thread scaling improvements using TBB –  Choice of mutex –  Two-stage parsing §  AVX2, AVX512-KNC & AVX512-KNC improvements §  Impact on the field
  • 9. 9 Read alignment §  Needle in a haystack §  Billions of reads from a single week-long sequencing run §  Human reference genome is ~3B bases (letters) long
  • 10. 10 Bowtie and Bowtie 2 §  Together cited by >12K other scientific studies since 2009 §  Bundled with dozens of other tools & many Linux distros
  • 11. 11 HISAT §  Based on Bowtie 2 and a leading spliced aligner for RNA sequencing data §  Cited in >75 scientific studies since 2015
  • 12. 12 Design of Bowtie & Bowtie 2 Bowtie 1 Bowtie 2
  • 13. 13 Design of Bowtie & Bowtie 2 Bowtie 1 Bowtie 2 Random access to large index data structure and minimal ILP
  • 14. 14 Design of Bowtie & Bowtie 2 Bowtie 1 Bowtie 2 Dynamic programming, lots of ILP
  • 15. 0 250 500 750 1000 0 25 50 75 100 125 # threads (unpaired) Normalizedrunningtime lock TBB spin_mutex tinythreads fast_mutex 15 Thread scaling §  Switching to analogous TBB lock could bring big improvement Ivy Bridge, 4 NUMA nodes, 120 threads Vertical axis is per- thread running time; lower is better Bowtie 1 unpaired
  • 16. 100 150 200 250 300 350 0 25 50 75 100 125 # threads (unpaired) Normalizedrunningtime lock None (stubbed I/O) TBB spin_mutex version Original parsing 16 Thread scaling §  Removing synchronization by “stubbing” input lock gives further improvement Bowtie 2 unpairedIvy Bridge, 4 NUMA nodes, 120 threads Vertical axis is per-thread running time; lower is better
  • 17. 17 Thread scaling §  Vtune investigation indicates synchronization itself (e.g. see __TBB_LockByte) is taking the time
  • 18. 18 Thread scaling Bowtie 2 unpaired How to close the gap between actual and ideal performance?
  • 19. 19 Thread scaling Bowtie 2 unpaired Why does mutex choice have outsize effect?
  • 20. CMU 15-418/618, Spring 2015 Test-and-set lock performance Benchmark&executes:& lock(L);& critical>section(c)& unlock(L); Time(us) Number of processors Benchmark: total of N lock/unlock sequences (in aggregate) by P processors Critical section time removed so graph plots only time acquiring/releasing the lock Bus contention increases amount of time to transfer lock (lock holder must wait to acquire bus to release) Not shown: bus contention also slows down execution of critical section Figure credit: Culler, Singh, and Gupta 20 Thread scaling §  Mutex spinning on atomic op (compare-and-swap, test-and- set), spurs exchange of cache coherence messages §  Image by Kayvon Fatahalian, Copyright 2015 Carnegie Mellon University
  • 21. 21 Thread scaling §  Even a standard pthreads mutex was outperforming the spin lock when running one thread per available core –  More evidence that cache coherence traffic is culprit §  Queue locks are known to have better cache properties –  Waiting thread spins on normal (non-atomic) read –  Cache line read belongs exclusively to that thread and can live in L1
  • 22. 22 Thread scaling §  We hypothesized a NUMA-aware “cohort lock” could help further Dice, David, Virendra J. Marathe, and Nir Shavit. "Lock cohorting: a general technique for designing NUMA locks." ACM SIGPLAN Notices. Vol. 47. No. 8. ACM, 2012.
  • 23. 23 Cohort locking class  CohortLock  {   public:      CohortLock()  :  lockers_numa_idx(-­‐1)  {          starvation_counters  =  new  int[MAX_NODES]();          own_global  =  new  bool[MAX_NODES]();          local_locks  =  new  TKTLock[MAX_NODES];      }      ~CohortLock()  {          delete[]  starvation_counters;          delete[]  own_global;          delete[]  local_locks;      }      void  lock();      void  unlock();   private:      static  const  int  STARVATION_LIMIT  =  100;      static  const  int  MAX_NODES  =  128;      volatile  int*    starvation_counters;  //  1  per  node      volatile  bool*  own_global;                    //  1  per  node      volatile  int      lockers_numa_idx;        //  1  per  node      TKTLock*  local_locks;                              //  1  per  node      PTLLock  global_lock;   };   §  Each NUMA node has per-node ticket lock §  Other per-node information tracks when to pass lock to other threads on same node §  Single global partitioned ticket lock
  • 24. 24 Cohort locking void  CohortLock::lock()  {      const  int  numa_idx  =  determine_numa_idx();      local_locks[numa_idx].lock();      if(!own_global[numa_idx])  {              global_lock.lock();      }      starvation_counters[numa_idx]++;      own_global[numa_idx]  =  true;      lockers_numa_idx  =  numa_idx;   }     void  CohortLock::unlock()  {      assert(lockers_numa_idx  !=  -­‐1);      int  numa_idx  =  lockers_numa_idx;      lockers_numa_idx  =  -­‐1;      if(local_locks[numa_idx].q_length()  ==  1  ||            starvation_counters[numa_idx]  >  STARVATION_LIMIT)      {          global_lock.unlock();          starvation_counters[numa_idx]  =  0;          own_global[numa_idx]  =  false;      }      local_locks[numa_idx].unlock();   }   §  When locking: –  Grab local lock –  Once grabbed, grab global lock if not already owned by this node §  When unlocking: –  Is another thread on same node queued? If so, hand lock to next in queue –  Otherwise release global & local locks –  Override hand-off if others are starving
  • 25. 25 Cohort locking §  Another implementation of cohort locking available in ConcurrencyKit: http://concurrencykit.org –  https://github.com/concurrencykit/ck/blob/master/include/ck_cohort.h
  • 26. 26 Thread scaling §  Chris Wilks added TBB queue locks, JHU/TBB Cohort locks (2 flavors) to Bowtie 2, Bowtie & HISAT §  Available in public branches, with all but cohort locks available in master branch and in recent releases
  • 27. 27 Thread scaling §  Novel strategy splits input parsing into two “phases” §  First (“light parsing”) rapidly detects record boundaries, requiring synchronization but with very brief critical section §  Second (“full parsing”) fully parses each record (pictured, right) with no synchronization §  Minimizes time spent in crucial critical section @ABC_123_1 GCTATTATGCTAT + JJSYEGGU8233^ @ABC_424_1 GTGATATGCAT + SYEG!U8@233 @ABCD_9_1 GCTATTATGCTATAAAC + JJSYEGGU8233^32FR @D_91231_1 GCTATTATGCTAT + JJSYEGGU8233^ …
  • 28. 100 200 300 0 25 50 75 100 125 # threads (unpaired) Normalizedrunningtime lock None (stubbed I/O) TBB mutex TBB queuing_mutex TBB spin_mutex TBB/JHU CohortLock tinythreads fast_mutex 28 Thread scaling: Bowtie 2 unpaired Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  TBB queuing_mutex and TBB/JHU cohort lock perform best
  • 29. 100 200 300 0 25 50 75 100 125 # threads (unpaired) Normalizedrunningtime None (stubbed I/O) TBB mutex TBB queuing_mutex TBB spin_mutex TBB/JHU CohortLock tinythreads fast_mutex version Optimized parsing Original parsing 29 Thread scaling: Bowtie 2 unpaired Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  Two-phase parsing yields substantial thread-scaling boost; close to perfect up to 120 threads, regardless of mutex
  • 30. 100 200 300 0 25 50 75 100 125 # threads (unpaired) Normalizedrunningtime lock None (stubbed I/O) TBB mutex TBB queuing_mutex TBB spin_mutex TBB/JHU CohortLock tinythreads fast_mutex 30 Thread scaling: Bowtie 2 paired-end Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  queuing_mutex and cohort lock again perform the best, near ideal
  • 31. 31 Thread scaling: Bowtie 2 paired-end Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  Two-phase parsing yields substantial thread-scaling boost; close to perfect up to 120 threads, with mutex having smaller impact
  • 32. 0 100 200 300 0 25 50 75 100 125 # threads (paired−end) Normalizedrunningtime lock None (stubbed I/O) TBB mutex TBB queuing_mutex TBB spin_mutex TBB/JHU CohortLock tinythreads fast_mutex 32 Thread scaling: Bowtie Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  As with Bowtie 2, near-ideal scaling with queuing and cohort locks
  • 33. 0 100 200 300 400 0 25 50 75 100 125 # threads Normalizedrunningtime version Optimized parsing Original parsing lock TBB queuing_mutex tinythreads fast_mutex 33 Thread scaling: HISAT unpaired Vertical axis is per-thread running time; lower is better Ivy Bridge, 4 NUMA nodes, 120 threads §  Huge improvements with queuing_lock and two-phase parsing
  • 34. 34 Thread scaling §  Further gains possible with batch parsing, where the first phase “lightly” parses several reads at once, reducing # critical section entrances ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 30 40 bowtie # threads ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 200 300 0 25 50 75 100 125 bowtie # threads Normalizedrunningtime ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 30 40 bowtie2 # threads ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 70 0 25 50 75 100 125 bowtie2 # threads Normalizedrunningtime ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 150 200 250 lizedrunningtime 20 30 40 bowtie # threads 0 25 50 75 100 125 bowtie # threads ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 30 40 bowtie2 # threads ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 70 0 25 50 75 100 125 bowtie2 # threads Normalizedrunningtime ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 30 40 hisat # threads ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 100 150 200 250 0 25 50 75 100 125 hisat # threads Normalizedrunningtime lock MP tinythreads fast_mutex TBB queuing_mutex version ● ● ●Batch parsing Original parsing Two−phase parsing
  • 35. 35 Thread scaling: Bowtie 2 on Broadwell §  Experiment conducted by John Oneill at Intel §  TBB + optimized parsing yields speedups of 1.1x - 1.8x on 88 threads on Broadwell E5-2699 v4 part. TBB/JHU Cohort lock outperforms other mutexes.
  • 36. 36 Thread scaling: Bowtie 2 on Knight’s Landing §  Experiment conducted by John Oneill at Intel §  TBB + optimized parsing yields speedups of 2x - 2.7x on 192 threads on KNL B0 bin3 part. TBB/JHU Cohort lock outperforms other mutexes.
  • 37. 37 Thread scaling: summary §  Using a queue mutex / cohort lock can yield big improvement over spin / normal lock §  Achieved near-ideal scaling up to 120 threads with (a) queue/cohort locks and (b) cleaner parsing for Bowtie, Bowtie 2. §  Promising scaling results on KNC & KNL; more to do §  Cohort locks were best option in Broadwell & KNL experiments §  Cohort locks seem to put KNL in a better position to outperform Xeon on genomics workloads
  • 38. 38 Vectorization of Bowtie 2 inner loop §  Dynamic programming alignment not unique to Bowtie 2 §  Common to many sequence alignment problems
  • 40. 40 Vectorization of Bowtie 2 inner loop The wider the vector word, the more times the fixup loop iterates §  Mitigates the benefit of having wider words
  • 41. 41 Vectorization of Bowtie 2 inner loop …but in some situations, the fixup loop can be skipped with little or no downside §  Important future work is to determine whether selective suppression of fixup loop can remove most or all of the downside of having wider words
  • 42. 42 Impact on the field §  As of Bowtie 1.0.1 release / Bowtie 2 2.2.0 release, Intel improvements are “in the wild,” assisting life science researchers
  • 43. 43 Impact on the field §  Added TBB to Bowtie 1.1.2, Bowtie 2 2.2.6. Also added to public branch of HISAT. Plan to make TBB the default threading library in upcoming release.
  • 44. 44 Impact on the field §  Daehwan Kim of JHU IPCC team parallelized the index building process in Bowtie 2; TBB version of parallel index building available as of 2.2.7
  • 45. 45 Impact on the field §  With changes fully reflected in Bowtie 1.2.0 and Bowtie 2 2.3.0, JHU team drafting manuscript describing improvements and lessons learned
  • 46. 46 Future directions §  Where and why does the cohort lock help? §  Does cohort lock have a future in TBB? §  Can selective suppression of Bowtie 2 fixup loop unlock power of wider vector words? §  Can all of the above yield a big Knight’s Landing throughput win?
  • 47. 47 Other resources §  http://www.langmead-lab.org §  https://www.coursera.org/learn/dna-sequencing –  YouTube videos for above: http://bit.ly/ADS1_videos
  • 48. 48 Thank you §  John Oneill, Ram Ramanujam, Kevin O’leary, and many other great Intel engineers we spoke to and worked with §  Lisa Smith, Brian Napier and others in IPCC program §  Langmead lab team: Chris Wilks, Valentin Antonescu §  Salzberg lab team: Steven Salzberg, Daehwan Kim §  Intel
  • 49. Thank you for your time Ben Langmead langmea@cs.jhu.edu www.intel.com/hpcdevcon