Notes from the Messyverse: How to tidy nested lists in R
Hi! I'm Ryan Moore, NBA fan & PhD candidate in Eric Wommack's viral ecology lab @ UD. Follow me on Twitter!
You’re a fairly recent convert to the Tidyverse, and you’re still using an unholy amalgamation of Tidy verbs and base R throughout your code. What’s more, you’ve got reams of legacy code that’s not going to magically tidy itself up anytime soon. Don’t feel bad about you’re heathen love of the apply family (after all, even Hadley says that
map is just a fancier version of
lapply). Rather, embrace those old-school list of lists! All your nice, modern tibbles are just a few short verbs away.
So you’re walking down the hall to the office breakroom when you overhear a conversation some of your colleagues are having. You’re coming closer and things must be starting to get a little intense in there. One of them starts to shout. “By Jove, it’s lists all the way down!” By this time you’re starting to hurry…are they really talking about…? You fly around the corner and see a grad student with Professor Gamgee and his flowing grey beard huddled in front of an old CRT workstation, with what looks like–gasp–an Emacs buffer filled top to bottom with–no, wait is that a JSON file?–the biggest, most nested, list-of-lists you’ve ever seen!
“Oh good, you made it,” one of them says.
“Yes, you’re the one who makes all those pretty graphs right? With something called the ggplot?”
You cough, “Maybe a little.”
“Well take a seat then! Take a seat!”
The CRT flickers as you take the offered chair. Gamgee gives it a quick thwack. It clears up, and you give it a look:
“What’s with the all-caps variable names?”
“Huh? Oh that, well, you see, I used to love Common Lisp, and, well, the reader was always case converting, so….”
“Wow, Common Lisp, eh? I guess, that explains the Emacs….”
“Alright kid, so we’ve got the watermelon data in…you know the experiment, right? No? Well, we were testing out some new fertilizer on our melons. As you can see, we did a couple of different experiments. Each experiment has two groups. The
Control group used the standard husbandry procedures, and the
Treatment group got our new fertilization strategy. Got it? Alright then, we’ll leave the plotting to you. See you in an hour or so.”
Stomach growling, you shoot a sideways glance at the fridge. Soups and sandwiches will just have to wait.
First things first, you get rid of all those upcased variable names. A few quick Emacs macros and you’re in business:
Being a lover of all things Tidy, you think, “Wow! lists are okay, but I really prefer tibbles….it’s soo easy to plot them with ggplot!” You remember someone mentioning a function to convert untidy things to tibbles. What was that again? Oh yeah,
as_tibble. Well, why not give it a try? You decide to take it step-by-step so you pull out the first experiment and work with that to start.
Hmm, that’s not quite what you want–it looks like there is a column for each list. You seem to remember a function for applying other functions to elements of vectors. Ah yes, it’s called
map from the purrr package. But wait, you think, I have lists not vectors, and the title of the help page is clear:
Apply a function to each element of a vector
It turns out that lists are just vectors. (Try running
list() %>% is.vector.) Oh and look at that, there are special versions of map called
map_dfc which return data frames by row-binding and column-binding respectively.
map returns a vector the same length as its first argument (e.g.,
map returns a
map_int returns an integer vector, and
map_dfr returns a data frame by row binding.) Since you want each of the nested lists to be a row in your data frame, you’ll need
map_dfr. But what kind of function to you need to apply to each of the lists? It turns out that you need a function that returns its argument as is, somthing like
function(l) l. You give that a try.
Hey that worked! But writing that whole
function(l) l seems kind of unnecessary, so you wonder if there is a better way. The help page for
map mentions that you can write anonymous functions with formulas (e.g.,
~ .x + 2, would be converted to
function(x) x + 2). Given your love of syntax sugar, you give it a shot.
Now, you could write
~ . since it is just a single argument function, but Hadley says you should avoid this. The
~ .x thing looks kind of cool, but it is definitely a little obscure for someone who isn’t too familiar with the Tidyverse.
Just then, you remember that base R has a function called
identity, which returns its argument as is. Identity functions are much beloved by functional programmers and mathematicians alike, and Tidyverse feels pretty functional. Not to mention that it’s clearer to be explicit about what you’re doing rather than to use sweet syntax sugar. So you adjust your code once more.
That’s not too shabby. But wait…isn’t it kind of overkill to use
map_dfr if you’re just passing in the identity funciton anyway? Back on the help page you see this:
map_dfc()return data frames created by row-binding and column-binding respectively.
So if you’re only passing in the identity function to
map_dfr, then really you’re just exploiting
map_dfr for its ability to make data frames (or tibbles!) through row-binding. In that case, couldn’t you just use
Yes! Now this is programming!
bind_rows trick works with your data because each sublist has names.
If you remove the names and try it again, you’ll see that
bind_rows doesn’t work anymore.
Luckily, your collegues are very well-organized and each of the lists you were given has names, so you stick with the
Now, you’re thinking that it would be a good idea to include the sample name and the experiment name in the tibble. That will make it easier to color and group elements of the figure. For that, you use mutate.
While that’s a good solution for a single expeiment, you remember that you’ve got a whole list of experiments. Remember that
map will return something that is the same size as your input, which will be the
Now, that list contains three experiments, each of which is just like your
experiment_a test case. What you want to do is map that little pipeline that you used to convert
experiment_a into a tibble onto each of the lists in
experiments. Sound good? But first, you decide to encapsulate the process into a named function so it’s easier to work with.
Now you’re ready to map this function onto each of the elements of the
Whoops, that’s not quite right. Back to the
map help page. Okay, so it looks like you can pass additional arguments to the mapped function. Something like this:
It worked at least, but you want the actual experiment names in there. So you try something else:
Nope, that doesn’t work either. Hold on…what you need is a function sort of like
map except that it can map over multiple inputs instead of just one. You pop over to your web browser and search for
map over multiple arguments tidyverse.
The purrr package has a function for this:
map2. It works more or less the same way as the regular
map variants, except that you can iterate over multiple arguments simultaneously. This way, you can pass in the experiment names as an additional argument before the function argument. (Arguments that come before the function argument
.f will all be vectorized, but function arguments coming after the
.f are supplied to each call directly.) Something like this:
That works, but wouldn’t it be nice if there was a
map variant that could apply a function to each element in the
experiments along with the name of the experiment automatically? You flip over to Google once more and are rewarded with the
imap function. The first line of the help function looks promising:
Apply a function to each element of a vector, and its index
As before, you want to iterate over a list and not a vector, but that’s okay, since you know that deep down, lists are just special vectors. In fact, the help page assures you that all will be well:
imap_xxx(x, …), an indexed map, is short hand for map2(x, names(x), …) if x has names
That’s perfect for your use case. In fact, that’s exactly what you were doing, so you just replace the
imap_dfr and drop the
names(experiments) bit like so:
Alright, now you’re cooking with Crisco! Finally you’ve got a nice tibble ready to pipe into ggplot.
Phew! That wasn’t so bad, was it? Right on cue, Professor Gamgee turns the corner into the breakroom and sits down at your table. “How’s the data looking, kid?”
“Tidy professor. It’s looking Tidy.”
If you enjoyed this post, consider sharing it on Twitter and subscribing to the RSS feed! If you have questions or comments, you can find me on Twitter or send me an email directly.← Go back