Lessons learned teaching R

Whilst studying for an MSc in Analytics at Imperial College London this year, I’ve been involved with the Data Science Society (ICDSS), part of Imperial’s Data Science Institute.

My involvement has been in producing and delivering several tutorial sessions to get students started with R. This is something that I’d done whilst I was still working. I’d enjoyed it a lot and so was excited about helping out with the society, too.

My sessions for ICDSS are done now and I thought I’d write this (brief) post with some of my thoughts and tips from my experiences teaching R and/or data analysis. If you’re interested, you can grab all my course materials over on GitHub. Note that these sessions were aimed at complete beginners and I am slowly revising and adding to them based on some of the thoughts in this post.

Teach data

The first thing I learned about teaching R was to not actually teach R. This may sound odd, but teaching people how to “think in data” is critical, and more important that technical syntax. This is especially for a data-focussed language like R.

I knew data before I knew R, and that made picking up R a LOT easier. Being able to think “my data looks like this, it needs to look like that” is a crucial skill, and helping students “get” that should be your first port of call when teaching beginners.

I’m not advocating ignoring the language all together, but I am saying that any new syntax you introduce in an early stage class should be linked to data. It should help people move from raw data to something more useful. This provides context, interest and gives people something to “do” with the code. Knowing how to write a for loop is great, but knowing how to extract information from raw data is even better.

Packages are great, but…

Packages are what make R great. They make tasks easier, add functionality and keep R relevant. They can also slow a beginner down, generate horrible error messages when they don’t install properly, and introduce all sorts of quirks that you weren’t expecting.

Make sure you allow plenty of time in an early session to explain what packages are, and to have students install them. I had provided instructions for installing R ahead of my sessions and a nice script from the O’Reilly Machine Learning for Hackers book to help people get set up ahead of time. But things didn’t go as smoothly as planned and a small number of students had problems getting set up.

I’d decided to use the readr package for reading data, and this hiccup meant that a few students were a bit stuck until the (fantastic) TA’s helping me got around to them. If I’d instead just used read.csv there would have been less of a problem.

Lesson learned - use packages when it simplifies things, but be prepared for the headaches they might bring.

Don’t fear complexity

It’s true that beginners need the basics, but in my experience students can deal with a lot of content thrown at them really quickly, even if it’s complicated. So don’t be afraid to get complex if you think you need to.

It’s important to keep things in perspective, though. Part of the code for my sessions used regular expressions for tidying some messy data. At first this seemed like a lot to throw at complete beginners, but I was able to put that complexity into context to keep things moving. I used simplifying language, saying “this is just a way to search for patterns in the data, don’t worry about how it works right now, you can figure that out later”. This meant that students had the information they needed to understand the complexity later, but didn’t need to worry about the details in the session.

Students seem to like this: the vast majority of the class feedback said that the pace and depth was “just right”. When planning your own sessions remember that complexity is ok if you can simplify it for your students. I’d originally intended to cut that content. Now I’m glad I didn’t.

Provide messy data, ask a few questions, tell a story

Most students learn best from doing, rather than listening, when it comes to programming and data analysis. So give them something to do. I was lucky in that working with my Oyster card data had given me a ready-made, messy, real-life data set that I could tell some simple stories from.

I provided the data in a pretty-much-raw format (I’d combined 100’s of weekly files in to one larger .csv) and the sessions used it whilst discussing basic R syntax, dplyr and ggplot2.

Working with the iris data is fine, but it’s not the most stimulating data set out there. Try to find data on something that interests you, and use that in your tutoring. Being able to pose questions of the data beyond “what is the average value of this field” drives interest in the session and helps engage students. It also helps with my first point about “teaching data” as it gets students thinking about what transformations they will need to perform to answer the questions you’ve set.

Wrapping up, I’d like to thank the Imperial College Data Science Society for inviting me to teach with them. It’s something I get a lot out of and it was great to see so many students in the room engaging with R and with data analysis. I’d also like to thank all those students who got in touch via email or twitter with questions, comments and ideas that they were working on. If you’re a tutor about to put on some R sessions, good luck, teaching is great fun and pushes you to learn more about the language too. If you’re a student, even better; get out there, find some data and start doing something interesting with it!