I dream of data.tables
by Stefano Ugliano, Lead AI Engineer at 5Analytics
A few reasons why I’m in love with R’s smartest class
As one’s proficiency increases, R becomes more than a simple tool or programming language – it starts being a way of thinking, of approaching problems. One of the first examples of such proficiency, as one can find in heaps of answers on StackExchange, is the use of loops: good coding in R almost entirely avoids loops, although not much because of running speed issues, but because there’s usually a better, neater, simpler way (the apply function is only one of many examples). 
Another, less-described symptom of your R-game advancing is growing affectionate towards certain packages, sometimes on the boundary of devotion or lust (or both). Certain packages considerably extend the potential of R, others polish or simplify what is already there, often speeding up both the coding work for us humans and the run time for our beloved fellow machines. 
In my case the data.table package did both and then some more, by turbo-boosting the already brilliant native dataframe class beyond what I could possibly imagine, and adding an extra flavor of functional programming. What I will present here is definitely not an extensive guide, nor a proper introduction, to the package; you can see it more as a teaser, or as a well-deserved praise after all it has done for me in the last months.
The data.table class, around which the package was first ideated around 10 years ago, is in many aspects very similar to base R’s dataframe, but it holds the major advantage of being able to modify an object by reference rather than by copy. This would be already enough to make data.table excel at dealing with “big-ish data” – meaning those data objects that R could process, if memory was well managed and we had plenty of time – but there’s actually much more. Take for example the seldom pleasant experience of importing data from a large .csv file: my personal measurements show that data.table’s own fread() is faster than the base function read.csv() by a quite gorgeous factor 20, and other users report similar if not better improvements!
Almost-instant data retrieval is another of the aces up data.table’s sleeve. By setting one or more columns of the table as key, the whole object is pre-sorted so to allow for much faster subsetting: will you be querying your big table for all the transactions performed by this or that customer? Just set customer as key, and you’ve saved yourself and your code precious time – how much time? We are talking binary tree vs vector scan, meaning O(log n) vs O(n). One simple time measurement from the previously mentioned introduction reports an improvement of approximately factor 1400. (And all of a sudden that factor 20 from before is not so impressing anymore ...)
I’m sure that the why of such package is now clear to everyone, so I’d like to briefly present the how. As previously stated, all these speed improvements go alongside with a slightly different syntax that allow data.table to do much more than its sibling classes. The canonical example of how to approach a data.table object is
DT[i, j, by]
- DT is the object itself;
- i is the criterion according which the rows will be subset;
- j is the column selection;
- by is the aggregation rule.
These rules have an immediate equivalence in SQL queries, with the correspondence between the two languages being the following:
R vs SQL
i = WHERE
j = SELECT
by = GROUP BY
To see this in action, here’s a brief example that uses the omnipresent mtcars data:
> dt <- data.table(mtcars)
First we can pick all the information for the cars whose “cyl” is 6; instead of the classical
dt[dt$cyl == 6, ] we are allowed to save a little bit of typing (I love that) and write directly
> dt[cyl == 6]
which would be equivalent to the query SELECT * FROM dt WHERE cyl = 6;
To get the average weight (“wt”) of all the above cars, we simply add within the brackets:
> dt[cyl == 6, mean(wt)]
Isn’t it brilliant? And now let’s go fancy: we want the average weight of the cars once we group them by “gear” – and furthermore, the rows should be ordered by increasing average weight:
> dt[cyl == 6, .(avgwt = mean(wt)), by = gear][order(avgwt)]
Notice here the strange use of the period before the parentheses enclosing a more complex request for j (it would have also been necessary in order to query more columns at once). Also, we could directly add a second pair of brackets to specify the order. Theoretically we could use a new complete “i, j, by“-triplet on the data.table returned by the first pair of brackets – and then a new one after, and so on ...
There’s plenty more to show and tell about data.table, but I won’t delve any deeper. Rather, I will consider myself satisfied if I managed to spark your curiosity about this masterpiece of a library, and I will let you have fun by exploring it by yourselves. I am sure, though, that you won’t regret trying it out!
Additional material and further reading:
I can’t pretend I have been the first to write about this wonderful package; on the data.table github wiki, in the articles section, you will find plenty of links to the works of my predecessors. The very same wiki collects plenty of interesting material to be used as help when approaching data.tables for the first time
An interesting place to learn some basics of data.table in the shape of video lectures is the first chapter of this datacamp course, taught by Matt Dowle (the inventor of data.table himself!). Only the first chapter is free, but it’s always better than nothing ...
Another school of thought prefers the modular formalism of dplyr, a package – and who am I to judge them just because of their sins? Here’s an interesting comment by Hadley Wickham, father of dplyr (amongst many others) on the data.table vs dplyr feud. Although I am quite charmed by the “pipe formalism”, I am planning to stick on this side for a while.
 It is a sort of an urban legend that a single for loop will drag the whole code to a halt: R can actually handle them quite decently, unlike my dreaded memories of loops in Mathematica ...
 You know, it’s better getting ready for the Singularity anyways.