SlideShare a Scribd company logo
1 of 31
Download to read offline
Software + Babies
How to design software and APIs for parallelism on
modern hardware
Richard Parker
Mathematician and freelance computer
programmer in Cambridge, England
Cologne, 20.01.2016
Maths research 2011-2014.
• My original (hobby) project was to work
with matrices mod 2, obsessively fast.
• I considered even a 5% speedup
compulsory.
• Highly optimized, but up to 30-year-old
programs have been used for many years.
• Mine were ~300 times faster in the end.
I learned a lot of tricks.
• I tried to consider every possible trick.
• I had to understand how all the parts of the
(x86) microprocessor worked.
• And Intel AVX-512F when it comes.
• I have been blogging to meataxe64 (on
wordpress) if anyone is interested.
Is this useful commercially?
• I felt there must be situations where
performance improvements of x10 or x100
would be useful in the real world.
• I have recently been applying these ideas
in the context of ArangoDB database.
• My conclusion is that there are three main
software considerations which can have a
major (x10) influence on performance.
3 pillars of performance.
• Multi-Core. If you use a lot of cores, you
can get a lot more done per second.
• Use Cache. Fetches from real memory
take x100 as long as from cache.
Real-memory bandwidth matters too.
• Single thread parallelism Even one
processor can do many things at once.
Multi-Core
• 16 cores can do a lot more work per
second than one core.
• Using them is not always easy, and the
results are often disappointing. . .
• because the memory system often gets
saturated, perhaps even with 4 cores, and
using more cores doesn't help.
Make better use of cache.
• This often requires a new algorithm.
• This work is still in its infancy, but it is
gradually becoming mainstream.
• Two algorithms with the same “complexity”
such as quicksort and heapsort can easily
differ by a factor of 10 because of cache.
Single Thread Parallelism
• This Haswell processor is designed to
issue four instructions every clock cycle.
• Three of those four can be “vector”
instructions, operating on the 256-bit
vector registers.
• The result is that, roughly speaking, the
processor can do 20-40 similar things in
the same time it takes to do one.
Higher level implications.
• In a sense, these are all low-level tricks.
• But to make good use of all these low-level
tricks, the high-level software must also
make changes. . . and one is critical.
• any subroutine call should do a batch of
stuff, not just one thing.
Fred Brooks.
• In “the mythical man month”, Brooks wrote
"nine women can't make a baby
in one month".
• He was talking about writing programs, not
running them, but the same principle is
important, and getting more important, for
program execution.
Ask for more Babies.
• In its simplest form, don't write
for(i=0;i<20;i++) MakeOneBaby()
• instead write
MakeSomeBabies(20)
• The first version takes 15 years to execute.
• It must surely be clear that the second
version can be done faster, depending on
the resources available. :-)
Example - Cosine
• The computation of the cosine of an angle
takes about 50 clock cycles.
double a = cosine(double ang)
• The “babies” version can manage about
one cosine every 5 clock cycles once it
gets going. Ten times faster.
void cosines(int ct,
double * ang, double * a)
Why is Babies cosine faster?
• 4 identical operations can be done in the
vector registers at the same speed as
one. This gives an immediate factor of 4.
• The hardware can start another cosine
rather than wait for an intermediate result.
• Duplicated work (in this case loading the
coefficients of the series expansion) can
be done just once.
Even if count is 1 . . .
• With just a little care, the babies version
with a count of 1 is no slower than the non-
babies version.
• For example, the routine could start by
testing whether the count is 1, and
branching to new code if the count is not 1.
• The (not-taken) branch will be correctly
predicted if the count always is 1, and
usually no time at all will be lost.
Throughput, not Latency.
• for(i=0;i<20;i++) makeonebaby()
This depends on the Latency - time from
starting one to finishing that one.
• makesomebabies(20)
This depends on the Throughput -
number that can be done on average in
unit time.
Latency constant.
Throughput increasing.
• We seem to have hit laws of physics with
latency. It is quite hard to get electronics to
(say) do a double-precision multiply and
get the answer faster than ~2 nSec.
• But there is little to prevent it doing a lot at
once. Indeed it can do forty since Sandy-
Bridge.
Think of x86 as many units.
• Each single x86 core . . .
• Has a huge number of execution units just
waiting for the chance to do something
useful for the program.
• If you can use a lot more of them at once,
you can get a lot more done per unit time!
Not just arithmetic
• For example, fetches from L1 cache take
two clock cycles . . .
• But fetching 32 bytes is no slower than
fetching 8 (or less), and you can issue two
of these every clock cycle
• So the throughput of loading 8-byte data is
sixteen times greater than the latency.
Much the same with memory
• The fetch of a (64-byte) cache-line on this
computer has a 180 clock-cycle latency.
• But it can usually do three at once
• and then you have 180 clock cycles to be
getting on with other work that does not
depend on those fetches.
• If, that is, you have something to do.
Example - database updates.
• When applying updates to a database,
particularly if it is distant (i.e. has a high
ping-time) it can be important to batch
them up and send off the whole batch at
once.
• This is an example of the “babies” concept
that is already mainstream - DTO.
Example - heap.
• A heap allows you to put a key/value pair in,
and take out the one with the smallest key.
• I could tell you about a version that runs
considerably faster. You need to put a few
pairs in, and take a few pairs (sorted) out at
each step.
• But even if you don't know (yet) how to do
that, it is sensible to do use the “babies”
interface anyway.
• Perhaps you'll think of something later.
Example - Hash table
• Here the dead time is waiting for a memory
fetch.
• It should be clear what the interface should
be. . .
• You look-up (insert, delete, whatever) many
things in one call. Give it a chance to be
clever and fast.
Example - GeoIndex
• I have very recently looked at how one
can find the nearest points in a GeoIndex.
• There are several slowish steps - memory
fetch, trig functions, distance
computations, sorting the results etc.
• Every one of these steps can benefit
greatly by handling multiple points rather
than just one.
Compilers good without babies.
• If one implements, say, the cosine function
(one-at-a-time interface) the code is
dependency bound anyway. The exact
algorithm wanted must be carefully coded
to minimize the dependency chain, but if
this is done in (say) “c”, it will run well.
• An assembler programmer will be hard
pressed to get more than a few percent
improvement.
“Babies” changes everything.
• If you want to write a “babies” version of
cosine, it is an uphill struggle to get a
compiler to generate good code.
• The use of vector instructions, the partial
loop unrolling, the use of conditional
moves rather than branch etc. etc. etc.
make the challenge too hard for even a
modern compiler.
Compilers may get better.
• Once programmers are asking more of
their compilers (by writing babies versions)
this may give a wake-up call.
• I would love to see a language which
helps me write top-performance code!
• As of today, however, the answer is often
to write the “leaf” babies routines in
assembler code.
“Ninja Assembler”
• Today's assemblers have changed little in
50 years. . .
• But it not hard to imagine a language
designed for writing high-performance x86
code, but still offering many of the luxuries
of a high-level language.
• One issue is portability. Portability is a big
obstacle to making best use of the
machine you actually have!
Portability problems.
• Laying out data for the vector instructions.
• Aligning parts of a data structure.
• Specializing types for vector use.
• When is a branch unpredictable?
• What is the width of the vector registers?
• When may a data fetch miss cache.
• Cramming data into a single cache-line.
• At the 5% level there are many more.
The 80/20 rule.
• Even for “peak performance” code, ~80%
of the time it is in ~20% of the code.
• We accept writing 20% in assembler
• But this determines the data layout.
• So it is very hard, at the moment anyway,
to write the 80% in a high-level language.
• Mainly because of portability!
Strategic view
• As of today, it seems to me that a properly
programmed x86 can outperform anything.
• What I am less clear about is whether the
software community can afford the extra
time and cost needed to program it
properly.
• My aim is to try to reduce that cost.
Summary
• Try to use “babies” interfaces at all times -
even if you can't immediately see how or
why you might want to make it faster
• As of today, a small, assembler, babies
routine can make a big difference to the
speed of a whole system.
• And we can hope for better software
development tools in future.

More Related Content

What's hot

Flickr Architecture Presentation
Flickr Architecture PresentationFlickr Architecture Presentation
Flickr Architecture Presentationeraz
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerTreasure Data, Inc.
 
Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterKevin Weil
 
Complex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseComplex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseMax Neunhöffer
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!Timo Walther
 
CouchDB – A Database for the Web
CouchDB – A Database for the WebCouchDB – A Database for the Web
CouchDB – A Database for the WebKarel Minarik
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSSN Masahiro
 
Event sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event StoreEvent sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event StoreShivji Kumar Jha
 
Presto in my_use_case
Presto in my_use_casePresto in my_use_case
Presto in my_use_casewyukawa
 
Tale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsTale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsSATOSHI TAGOMORI
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 

What's hot (20)

Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Flickr Architecture Presentation
Flickr Architecture PresentationFlickr Architecture Presentation
Flickr Architecture Presentation
 
Building a spa_in_30min
Building a spa_in_30minBuilding a spa_in_30min
Building a spa_in_30min
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at Twitter
 
CouchDB
CouchDBCouchDB
CouchDB
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
Complex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseComplex queries in a distributed multi-model database
Complex queries in a distributed multi-model database
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
 
CouchDB – A Database for the Web
CouchDB – A Database for the WebCouchDB – A Database for the Web
CouchDB – A Database for the Web
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
 
Event sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event StoreEvent sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event Store
 
Presto in my_use_case
Presto in my_use_casePresto in my_use_case
Presto in my_use_case
 
Tale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsTale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench Tools
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 

Viewers also liked

Polyglot Persistence & Multi Model-Databases at JMaghreb3.0
Polyglot Persistence & Multi Model-Databases at JMaghreb3.0Polyglot Persistence & Multi Model-Databases at JMaghreb3.0
Polyglot Persistence & Multi Model-Databases at JMaghreb3.0ArangoDB Database
 
Domain driven design @FrOSCon
Domain driven design @FrOSConDomain driven design @FrOSCon
Domain driven design @FrOSConArangoDB Database
 
Extensibility of a database api with js
Extensibility of a database api with jsExtensibility of a database api with js
Extensibility of a database api with jsArangoDB Database
 
Creating data centric microservices
Creating data centric microservicesCreating data centric microservices
Creating data centric microservicesArangoDB Database
 
Microservice-based software architecture
Microservice-based software architectureMicroservice-based software architecture
Microservice-based software architectureArangoDB Database
 
Polyglot Persistence & Multi-Model Databases (FullStack Toronto)
Polyglot Persistence & Multi-Model Databases (FullStack Toronto)Polyglot Persistence & Multi-Model Databases (FullStack Toronto)
Polyglot Persistence & Multi-Model Databases (FullStack Toronto)ArangoDB Database
 
Processing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) PregelProcessing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) PregelArangoDB Database
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jPerformance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jArangoDB Database
 
Handling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseHandling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseArangoDB Database
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBArangoDB Database
 
Polyglot Persistence & Multi-Model Databases
Polyglot Persistence & Multi-Model DatabasesPolyglot Persistence & Multi-Model Databases
Polyglot Persistence & Multi-Model DatabasesArangoDB Database
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databasesArangoDB Database
 

Viewers also liked (14)

Polyglot Persistence & Multi Model-Databases at JMaghreb3.0
Polyglot Persistence & Multi Model-Databases at JMaghreb3.0Polyglot Persistence & Multi Model-Databases at JMaghreb3.0
Polyglot Persistence & Multi Model-Databases at JMaghreb3.0
 
Domain driven design @FrOSCon
Domain driven design @FrOSConDomain driven design @FrOSCon
Domain driven design @FrOSCon
 
Extensibility of a database api with js
Extensibility of a database api with jsExtensibility of a database api with js
Extensibility of a database api with js
 
Creating data centric microservices
Creating data centric microservicesCreating data centric microservices
Creating data centric microservices
 
Guacamole
GuacamoleGuacamole
Guacamole
 
Microservice-based software architecture
Microservice-based software architectureMicroservice-based software architecture
Microservice-based software architecture
 
Polyglot Persistence & Multi-Model Databases (FullStack Toronto)
Polyglot Persistence & Multi-Model Databases (FullStack Toronto)Polyglot Persistence & Multi-Model Databases (FullStack Toronto)
Polyglot Persistence & Multi-Model Databases (FullStack Toronto)
 
Processing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) PregelProcessing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) Pregel
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4jPerformance comparison: Multi-Model vs. MongoDB and Neo4j
Performance comparison: Multi-Model vs. MongoDB and Neo4j
 
Handling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseHandling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph Database
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
Polyglot Persistence & Multi-Model Databases
Polyglot Persistence & Multi-Model DatabasesPolyglot Persistence & Multi-Model Databases
Polyglot Persistence & Multi-Model Databases
 
NoSQL meets Microservices
NoSQL meets MicroservicesNoSQL meets Microservices
NoSQL meets Microservices
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 

Similar to Software + Babies

12. Parallel Algorithms.pptx
12. Parallel Algorithms.pptx12. Parallel Algorithms.pptx
12. Parallel Algorithms.pptxMohAlyasin1
 
Interactions complicate debugging
Interactions complicate debuggingInteractions complicate debugging
Interactions complicate debuggingSyed Zaid Irshad
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# WayBishnu Rawal
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Internet of Things, TYBSC IT, Semester 5, Unit IV
Internet of Things, TYBSC IT, Semester 5, Unit IVInternet of Things, TYBSC IT, Semester 5, Unit IV
Internet of Things, TYBSC IT, Semester 5, Unit IVArti Parab Academics
 
Scratching the itch, making Scratch for the Raspberry Pie
Scratching the itch, making Scratch for the Raspberry PieScratching the itch, making Scratch for the Raspberry Pie
Scratching the itch, making Scratch for the Raspberry PieESUG
 
BDM37 - Simon Grondin - Scaling an API proxy in OCaml
BDM37 - Simon Grondin - Scaling an API proxy in OCamlBDM37 - Simon Grondin - Scaling an API proxy in OCaml
BDM37 - Simon Grondin - Scaling an API proxy in OCamlBig Data Montreal
 
PostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya KosmodemianskyPostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya KosmodemianskyPostgreSQL-Consulting
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverstonbcoverston
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Brian Brazil
 
Real time system_performance_mon
Real time system_performance_monReal time system_performance_mon
Real time system_performance_monTomas Doran
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Concurrency Programming in Java - 01 - Introduction to Concurrency Programming
Concurrency Programming in Java - 01 - Introduction to Concurrency ProgrammingConcurrency Programming in Java - 01 - Introduction to Concurrency Programming
Concurrency Programming in Java - 01 - Introduction to Concurrency ProgrammingSachintha Gunasena
 
Lecture 1
Lecture 1Lecture 1
Lecture 1Mr SMAK
 

Similar to Software + Babies (20)

12. Parallel Algorithms.pptx
12. Parallel Algorithms.pptx12. Parallel Algorithms.pptx
12. Parallel Algorithms.pptx
 
Interactions complicate debugging
Interactions complicate debuggingInteractions complicate debugging
Interactions complicate debugging
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
 
Lecture1
Lecture1Lecture1
Lecture1
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Internet of Things, TYBSC IT, Semester 5, Unit IV
Internet of Things, TYBSC IT, Semester 5, Unit IVInternet of Things, TYBSC IT, Semester 5, Unit IV
Internet of Things, TYBSC IT, Semester 5, Unit IV
 
Scratching the itch, making Scratch for the Raspberry Pie
Scratching the itch, making Scratch for the Raspberry PieScratching the itch, making Scratch for the Raspberry Pie
Scratching the itch, making Scratch for the Raspberry Pie
 
BDM37 - Simon Grondin - Scaling an API proxy in OCaml
BDM37 - Simon Grondin - Scaling an API proxy in OCamlBDM37 - Simon Grondin - Scaling an API proxy in OCaml
BDM37 - Simon Grondin - Scaling an API proxy in OCaml
 
PostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya KosmodemianskyPostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya Kosmodemiansky
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverston
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
Real time system_performance_mon
Real time system_performance_monReal time system_performance_mon
Real time system_performance_mon
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Concurrency Programming in Java - 01 - Introduction to Concurrency Programming
Concurrency Programming in Java - 01 - Introduction to Concurrency ProgrammingConcurrency Programming in Java - 01 - Introduction to Concurrency Programming
Concurrency Programming in Java - 01 - Introduction to Concurrency Programming
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Surge2012
Surge2012Surge2012
Surge2012
 

More from ArangoDB Database

ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....ArangoDB Database
 
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022ArangoDB Database
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022ArangoDB Database
 
ArangoDB 3.9 - Further Powering Graphs at Scale
ArangoDB 3.9 - Further Powering Graphs at ScaleArangoDB 3.9 - Further Powering Graphs at Scale
ArangoDB 3.9 - Further Powering Graphs at ScaleArangoDB Database
 
GraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDBGraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDBArangoDB Database
 
Webinar: ArangoDB 3.8 Preview - Analytics at Scale
Webinar: ArangoDB 3.8 Preview - Analytics at Scale Webinar: ArangoDB 3.8 Preview - Analytics at Scale
Webinar: ArangoDB 3.8 Preview - Analytics at Scale ArangoDB Database
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDBArangoDB Database
 
Getting Started with ArangoDB Oasis
Getting Started with ArangoDB OasisGetting Started with ArangoDB Oasis
Getting Started with ArangoDB OasisArangoDB Database
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBArangoDB Database
 
Hacktoberfest 2020 - Intro to Knowledge Graphs
Hacktoberfest 2020 - Intro to Knowledge GraphsHacktoberfest 2020 - Intro to Knowledge Graphs
Hacktoberfest 2020 - Intro to Knowledge GraphsArangoDB Database
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?ArangoDB Database
 
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoDB Database
 
ArangoDB 3.7 Roadmap: Performance at Scale
ArangoDB 3.7 Roadmap: Performance at ScaleArangoDB 3.7 Roadmap: Performance at Scale
ArangoDB 3.7 Roadmap: Performance at ScaleArangoDB Database
 
Webinar: What to expect from ArangoDB Oasis
Webinar: What to expect from ArangoDB OasisWebinar: What to expect from ArangoDB Oasis
Webinar: What to expect from ArangoDB OasisArangoDB Database
 
ArangoDB 3.5 Feature Overview Webinar - Sept 12, 2019
ArangoDB 3.5 Feature Overview Webinar - Sept 12, 2019ArangoDB 3.5 Feature Overview Webinar - Sept 12, 2019
ArangoDB 3.5 Feature Overview Webinar - Sept 12, 2019ArangoDB Database
 
Webinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDBWebinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDBArangoDB Database
 
An introduction to multi-model databases
An introduction to multi-model databasesAn introduction to multi-model databases
An introduction to multi-model databasesArangoDB Database
 
Running complex data queries in a distributed system
Running complex data queries in a distributed systemRunning complex data queries in a distributed system
Running complex data queries in a distributed systemArangoDB Database
 
Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?ArangoDB Database
 

More from ArangoDB Database (20)

ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
 
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
 
ArangoDB 3.9 - Further Powering Graphs at Scale
ArangoDB 3.9 - Further Powering Graphs at ScaleArangoDB 3.9 - Further Powering Graphs at Scale
ArangoDB 3.9 - Further Powering Graphs at Scale
 
GraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDBGraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDB
 
Webinar: ArangoDB 3.8 Preview - Analytics at Scale
Webinar: ArangoDB 3.8 Preview - Analytics at Scale Webinar: ArangoDB 3.8 Preview - Analytics at Scale
Webinar: ArangoDB 3.8 Preview - Analytics at Scale
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 
Getting Started with ArangoDB Oasis
Getting Started with ArangoDB OasisGetting Started with ArangoDB Oasis
Getting Started with ArangoDB Oasis
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDB
 
Hacktoberfest 2020 - Intro to Knowledge Graphs
Hacktoberfest 2020 - Intro to Knowledge GraphsHacktoberfest 2020 - Intro to Knowledge Graphs
Hacktoberfest 2020 - Intro to Knowledge Graphs
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
 
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
 
ArangoDB 3.7 Roadmap: Performance at Scale
ArangoDB 3.7 Roadmap: Performance at ScaleArangoDB 3.7 Roadmap: Performance at Scale
ArangoDB 3.7 Roadmap: Performance at Scale
 
Webinar: What to expect from ArangoDB Oasis
Webinar: What to expect from ArangoDB OasisWebinar: What to expect from ArangoDB Oasis
Webinar: What to expect from ArangoDB Oasis
 
ArangoDB 3.5 Feature Overview Webinar - Sept 12, 2019
ArangoDB 3.5 Feature Overview Webinar - Sept 12, 2019ArangoDB 3.5 Feature Overview Webinar - Sept 12, 2019
ArangoDB 3.5 Feature Overview Webinar - Sept 12, 2019
 
3.5 webinar
3.5 webinar 3.5 webinar
3.5 webinar
 
Webinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDBWebinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDB
 
An introduction to multi-model databases
An introduction to multi-model databasesAn introduction to multi-model databases
An introduction to multi-model databases
 
Running complex data queries in a distributed system
Running complex data queries in a distributed systemRunning complex data queries in a distributed system
Running complex data queries in a distributed system
 
Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?
 

Recently uploaded

Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 

Recently uploaded (20)

Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 

Software + Babies

  • 1. Software + Babies How to design software and APIs for parallelism on modern hardware Richard Parker Mathematician and freelance computer programmer in Cambridge, England Cologne, 20.01.2016
  • 2. Maths research 2011-2014. • My original (hobby) project was to work with matrices mod 2, obsessively fast. • I considered even a 5% speedup compulsory. • Highly optimized, but up to 30-year-old programs have been used for many years. • Mine were ~300 times faster in the end.
  • 3. I learned a lot of tricks. • I tried to consider every possible trick. • I had to understand how all the parts of the (x86) microprocessor worked. • And Intel AVX-512F when it comes. • I have been blogging to meataxe64 (on wordpress) if anyone is interested.
  • 4. Is this useful commercially? • I felt there must be situations where performance improvements of x10 or x100 would be useful in the real world. • I have recently been applying these ideas in the context of ArangoDB database. • My conclusion is that there are three main software considerations which can have a major (x10) influence on performance.
  • 5. 3 pillars of performance. • Multi-Core. If you use a lot of cores, you can get a lot more done per second. • Use Cache. Fetches from real memory take x100 as long as from cache. Real-memory bandwidth matters too. • Single thread parallelism Even one processor can do many things at once.
  • 6. Multi-Core • 16 cores can do a lot more work per second than one core. • Using them is not always easy, and the results are often disappointing. . . • because the memory system often gets saturated, perhaps even with 4 cores, and using more cores doesn't help.
  • 7. Make better use of cache. • This often requires a new algorithm. • This work is still in its infancy, but it is gradually becoming mainstream. • Two algorithms with the same “complexity” such as quicksort and heapsort can easily differ by a factor of 10 because of cache.
  • 8. Single Thread Parallelism • This Haswell processor is designed to issue four instructions every clock cycle. • Three of those four can be “vector” instructions, operating on the 256-bit vector registers. • The result is that, roughly speaking, the processor can do 20-40 similar things in the same time it takes to do one.
  • 9. Higher level implications. • In a sense, these are all low-level tricks. • But to make good use of all these low-level tricks, the high-level software must also make changes. . . and one is critical. • any subroutine call should do a batch of stuff, not just one thing.
  • 10. Fred Brooks. • In “the mythical man month”, Brooks wrote "nine women can't make a baby in one month". • He was talking about writing programs, not running them, but the same principle is important, and getting more important, for program execution.
  • 11. Ask for more Babies. • In its simplest form, don't write for(i=0;i<20;i++) MakeOneBaby() • instead write MakeSomeBabies(20) • The first version takes 15 years to execute. • It must surely be clear that the second version can be done faster, depending on the resources available. :-)
  • 12. Example - Cosine • The computation of the cosine of an angle takes about 50 clock cycles. double a = cosine(double ang) • The “babies” version can manage about one cosine every 5 clock cycles once it gets going. Ten times faster. void cosines(int ct, double * ang, double * a)
  • 13. Why is Babies cosine faster? • 4 identical operations can be done in the vector registers at the same speed as one. This gives an immediate factor of 4. • The hardware can start another cosine rather than wait for an intermediate result. • Duplicated work (in this case loading the coefficients of the series expansion) can be done just once.
  • 14. Even if count is 1 . . . • With just a little care, the babies version with a count of 1 is no slower than the non- babies version. • For example, the routine could start by testing whether the count is 1, and branching to new code if the count is not 1. • The (not-taken) branch will be correctly predicted if the count always is 1, and usually no time at all will be lost.
  • 15. Throughput, not Latency. • for(i=0;i<20;i++) makeonebaby() This depends on the Latency - time from starting one to finishing that one. • makesomebabies(20) This depends on the Throughput - number that can be done on average in unit time.
  • 16. Latency constant. Throughput increasing. • We seem to have hit laws of physics with latency. It is quite hard to get electronics to (say) do a double-precision multiply and get the answer faster than ~2 nSec. • But there is little to prevent it doing a lot at once. Indeed it can do forty since Sandy- Bridge.
  • 17. Think of x86 as many units. • Each single x86 core . . . • Has a huge number of execution units just waiting for the chance to do something useful for the program. • If you can use a lot more of them at once, you can get a lot more done per unit time!
  • 18. Not just arithmetic • For example, fetches from L1 cache take two clock cycles . . . • But fetching 32 bytes is no slower than fetching 8 (or less), and you can issue two of these every clock cycle • So the throughput of loading 8-byte data is sixteen times greater than the latency.
  • 19. Much the same with memory • The fetch of a (64-byte) cache-line on this computer has a 180 clock-cycle latency. • But it can usually do three at once • and then you have 180 clock cycles to be getting on with other work that does not depend on those fetches. • If, that is, you have something to do.
  • 20. Example - database updates. • When applying updates to a database, particularly if it is distant (i.e. has a high ping-time) it can be important to batch them up and send off the whole batch at once. • This is an example of the “babies” concept that is already mainstream - DTO.
  • 21. Example - heap. • A heap allows you to put a key/value pair in, and take out the one with the smallest key. • I could tell you about a version that runs considerably faster. You need to put a few pairs in, and take a few pairs (sorted) out at each step. • But even if you don't know (yet) how to do that, it is sensible to do use the “babies” interface anyway. • Perhaps you'll think of something later.
  • 22. Example - Hash table • Here the dead time is waiting for a memory fetch. • It should be clear what the interface should be. . . • You look-up (insert, delete, whatever) many things in one call. Give it a chance to be clever and fast.
  • 23. Example - GeoIndex • I have very recently looked at how one can find the nearest points in a GeoIndex. • There are several slowish steps - memory fetch, trig functions, distance computations, sorting the results etc. • Every one of these steps can benefit greatly by handling multiple points rather than just one.
  • 24. Compilers good without babies. • If one implements, say, the cosine function (one-at-a-time interface) the code is dependency bound anyway. The exact algorithm wanted must be carefully coded to minimize the dependency chain, but if this is done in (say) “c”, it will run well. • An assembler programmer will be hard pressed to get more than a few percent improvement.
  • 25. “Babies” changes everything. • If you want to write a “babies” version of cosine, it is an uphill struggle to get a compiler to generate good code. • The use of vector instructions, the partial loop unrolling, the use of conditional moves rather than branch etc. etc. etc. make the challenge too hard for even a modern compiler.
  • 26. Compilers may get better. • Once programmers are asking more of their compilers (by writing babies versions) this may give a wake-up call. • I would love to see a language which helps me write top-performance code! • As of today, however, the answer is often to write the “leaf” babies routines in assembler code.
  • 27. “Ninja Assembler” • Today's assemblers have changed little in 50 years. . . • But it not hard to imagine a language designed for writing high-performance x86 code, but still offering many of the luxuries of a high-level language. • One issue is portability. Portability is a big obstacle to making best use of the machine you actually have!
  • 28. Portability problems. • Laying out data for the vector instructions. • Aligning parts of a data structure. • Specializing types for vector use. • When is a branch unpredictable? • What is the width of the vector registers? • When may a data fetch miss cache. • Cramming data into a single cache-line. • At the 5% level there are many more.
  • 29. The 80/20 rule. • Even for “peak performance” code, ~80% of the time it is in ~20% of the code. • We accept writing 20% in assembler • But this determines the data layout. • So it is very hard, at the moment anyway, to write the 80% in a high-level language. • Mainly because of portability!
  • 30. Strategic view • As of today, it seems to me that a properly programmed x86 can outperform anything. • What I am less clear about is whether the software community can afford the extra time and cost needed to program it properly. • My aim is to try to reduce that cost.
  • 31. Summary • Try to use “babies” interfaces at all times - even if you can't immediately see how or why you might want to make it faster • As of today, a small, assembler, babies routine can make a big difference to the speed of a whole system. • And we can hope for better software development tools in future.