I have about fifteen years of experience in data mining and data analysis. I’ve worked in a variety of industries: financial services, pharmaceuticals, internet companies.
And I’ve written a couple books on data analysis. Today’s talk isn’t about a subject in either book, but it is inspired by a passagein the second book.
Before I start today’s talk, I want to explain to you why I’m talking about this topic.In my book, one of my chapters is devoted to performance tips. One of my performance tips was about how to quickly look up a value in a table of values.
Then, I was reading through some old comments on R mailing lists and ran into this message.How many people in the room own a copy of this book? <Pick up MASS book> (For those who don’t, how many have used the MASS library?)So, the guy who wrote this email is the guy who wrote this book (and the MASS package)This made me feel really nervous that I had written something incorrect, so I decided to take a closer look at how tables are implemented in R.Today, I’m going to tell you about how lookups in R work, how I tested their performance, and how you can use this information to help you write faster R code.
Today, I’m going to tell you the story of how I tested the performance of different lookup methods in RI’m going to give a short introduction to different types of objects in R,Then explain to you how I tested performance(testing performance used some interesting features in R)Next, I will tell you about the results And if you’re all still awake, I will tell you how to optimize your program’s performance
Everything in R is an object. We will start by looking at a few simple data types in R.The data type that you will probably encounter most frequently in R is the numeric vector.Numeric vectors represent numeric values.The class function tells you the class of an object; the class tells R what methods (or functions) can be applied to an object
Here is another example of a data type in R: integers.Notice that I use the function as.integer to explicitly request an integerIf you were to just type 4, R would return a numeric value
Here is another important example of an objectCharacter vectors represent text valuesIn many other languages, these are called strings
Another example data type is the logical vectorAll of the example so far have been vectors with one elementBut of course, vectors can have multiple elements. Let’s look at a couple examples
The colon operator is used to define a sequence of values. It always returns integers. (A trick to return a single integer is to just have a range from one value to itself.)The combine function (“c”) is used to combine a set of values together into a vector.
If you need to represent a heterogenous collection of objects, you can use a list.A very common type of list is a data frame. Data frames are like database tables (or tables in Excel); they contain multiple columns representing different variables in a data set.
Everything in R is an objectEven functions
Let’s move on to another important type of object.If you work with R, you have probably used vectors and lists. You have also used environment objects, but you may not have realized itAt any time in R, there are a set of objects that you can access. You may have given these objects names. R represents these relationships as environments.In the example session that I show here, I created three objects, named “one”, “two” and “three”R stored information mapping these names to these values in an environment called the global environmentI assigned the symbol “e” to point to the global environment (environments are just objects, like everything else in R)Then I showed the class of “e”I also used the objects function to show the objects defined in this environment. Notice that the objects include one, two, three, and e.
Now, let’s talk about how you look up a value in an object in R.To do this, we’ll define a simple example vector. Here, I defined a vector named “a” with ten valuesYou can use the bracket operator to refer to a specific location. In this example, I looked up the third item in a, which was the value 3.
(next page shows algorithm)(then walks through example)
As an example, we will show how R looks up the value with the label “F” in the array “a.20”To do this, R iterates through each value in the names array to find the index of the correct value. Then R returns the correct value. <next slide>
R looks up the first item in the names array, which does not match.
Then, R looks up the second item and checks if it matches.
R continues to iterate through the names array until it find the match.
Ah, found the matching value. The index for the match is 5
Here is a simple example of how hash table workI’m leaving out some important details here.- Most importantly, I don’t explain what to do when two labels hash to the same value (this is called a hash collision).- Nor do I talk about how you choose the hash table size, or the hash function.- A full discussion of hash functions is beyond the scope of this talk. (It’s beyond the scope of most algorithms classes!)
Notice that R doesn’t print out environment objects in a friendly way.
For testing, I generated a set of different arrays and environments with between 1024 and 32768 elementsI generated one object for each power of two<go to next page>
To test the lookup speed, I wrote a function called “test expressions” that would Print a message Time how long it took to apply a function to a set of different sized data objects many times You can specify the message, the function, the set of data objects, and the number of repetitions (for each objectNotice that this function takes another function as an argument!In the example here, I show how I tested the performance of looking up the first value in each object by index. (I calculated a sum rather than just returning values.)
Here are the results from my tests.How many people think that I should use a chart to present this data?As a show of hands, how many people in this room have read Tufte’s books?How many people raised you hand for both?Seriously, I don’t think that this is enough data to bother plotting. It’s hard to read on the screen (because the type is small), but the trends are so clear that you can see them by just looking at numbers.Let me show you some interesting trends.
First, let’s look at the array lookups by name. Notice that these values increase linearly with the number of elements in the array
Now, let’s focus on the results for the biggest arrays<change to next slide>
There are two key takeaways.First, looking up a single value in an array (by index), or an environment (by symbol) is very fast, regardless of table size.Next, notice that lookups by name are much, much slower in arrays. The only exception is looking up the first value in an array by double bracket. Double bracket notation is a little faster.So, what does this mean? <turn to next page>
You could always use environment objects instead of vectors to store tables of values.But I think that will lead you to write more code.You should use whatever method is simplest and easiest to implement your program. When you know that it runs correctly, then you can optimize it.Here is the process that I use to write efficient code.
By the way, even R language expressions are objects in R. That’s how I can show how R parses this expression here.