I was recently helping a friend who was trying to summarise the results of their analysis without resorting to repeated and manual calculations. We were working in
R, but what’s in this post is relevant to any language (as long as creating bespoke functions is possible).
We spent some time working through what was needed, and discussing the syntax of how to do it in
R. After a while it became clear that several of the summaries were essentially the same calculation, just of different fields and perhaps broken down by different categories in the data.
So I wrote a general function that we could use to get at pretty much all the summaries that were needed in just a few lines of code.
Afterwards we were chatting about my approach and my friend asked “how did you know to do it like that? I just wouldn’t have thought to do it!” I said that they shouldn’t be so hard on themselves and that a lot of it is just knowing how to “think in data” and comes with experience. But it got me thinking about my approach, specifically about when to stop ad-hoc coding and write a custom function.
Little snippets of instructions (code) that can be run multiple times for slightly different data, functions are actually quite new to me. I’ve been working with
R for about 3 years, but only started really creating my own functions day-to-day within the last year. I’m not sure what caused me to start using them more (maybe just the type of work I have on at the moment) but I tend to turn to them a lot now.
Here’s a list of the main situations in which I’d write a function. It’s definitely not an exhaustive list, and there are other reasons out there, I’m sure, but maybe these reasons might be helpful to someone else beyond me and my friend!
The times I pause and write a function
Copy-pasting the same code over and over, but changing one or two small bits. A sure-fire sign I need to take a step back and write a function. Occurs if I’m trying to create the same summaries of some data but split by different categorical fields; or if I want to perform the same aggregation on multiple, similarly structured, datasets. In these situations a simple function can help, with the arguments to the function being the one or two small bits that need to be changed. This is probably the most important reason to write a function.
The analysis will create a mess of intermediate objects. Some analyses can be performed in a single step (or sequence using
%>%pipes), but others can leave a host of useless intermediate objects trailing in their wake. In situations like this, I like to create a function (even one with no arguments) that runs my analysis, even if it’s just to keep my workspace tidy and save me the headache of cleaning up.
I want to increase code reuse. If I’ve written a little snippet of code that I find really useful, or I know that I’ll want to run the same analysis at a later date, but maybe with data that’s named differently, I write a function. If they’re really general I can save the functions/snippets on GitHub, if they’re more bespoke I’ll save them locally. If the functions are general this also helps me share code with others who are using the same/similar data to me.
Avoiding loops with manual calculations and hard-coded variables. Several times on my MSc. the classes I was in were advised to write loops to perform analysis. For beginner programmers these were probably ok, but they were very manual and inflexible to changes later in our projects. Using smarter functionality (either from
pandasin Python) and some custom functions I was able to write neater snippets that were more general, robust to changes in the data, and easily shared with my classmates.
That’s it for now. Happy programming!