Create function and adapt structure

2021-11-30

Introduction

The basis of reproducibility is writing R functions instead of duplicating your code. Using functions helps you to have cleaner scripts and helps others to better understand your code: if you have weel-written functions with associated documentation, then others only have to read the documentation of your functions to understand your code. It’s way better than reading thousands of comments, right? No, let’s have a look at this wonderful and easy code technique that will revolutionize your code!

Pokemon Artwork from Le Hollandais Volant

A function is a R object which take as inputs objects that will feed the function, apply to these inputs some operations you define and then give as outputs the results you want to have.

In this tutorial we will create together one function from A to Z and the associated documentation which is really important. This function will clean the Pokemon Dataset and only keep Pokemons from the first, second and third generations and …

1 - Creating a new R file that will contain the function

You have to create a new R script in the R folder of your Research Compendium. This R folder must contain all the functions you code by yourself and want to use in your analysis.

At the begining of this new R script, you can add rows which explain what is in the script, when it was coded, by who etc as follow:

Screenshot of the begining of a R script

The name of this R script that will contain a function, must have a really clear name (according to what you do in your script) and you can use a number at the begining of the script name to know thereafter in which order scripts and functions must be used. In our example, can name the R script “01_clean_data_function.R”. The name explain that in this script which have to be used in the first place, a function is coded to clean a dataset

Then, your file is ready to welcome your wonderful new function!

2 - Defining your function

The first step when writing a function is to define it.

To do so you just have to:

chose the name you will give to your function: in our example, the function will be called pokeclean as it cleans the Pokemon dataset.
chose which objects are needed as inputs for your function to work: in our example, we just need the Pokemon dataset called poke
follow the following coding style:

function_name <- function(list here the inputs) {

}

The definition of the function cleaning the Pokemon dataset will thus looks as follow:

pokeclean <- function(poke){
  ...
}

Great, you have defined your first function!

Meme from Chualbox

3 - Coding your function

Then, the next step is to complete the function with all the operations we want to do, replacing the “…” in the above code.

To do so, you can create intermediate objects that help you to store intermediate objects needed for the computation of your function, you can call other functions that you have already created or that belong to R base or R packages etc. Never forget to comment your function! Because if other people (or you several months from now) want to understand each line of your function, they should be able to do so!

In our example the function looks as follow:

pokeclean <- function(poke){
  
  # only keep 1st, 2nd and 3rd generations:
  poke2 <- poke[-poke$generation %in% c("4", "5", "6", "7"), ]
  
  # remove japanese name and classification informations: 
  poke3 <- poke2[,-which(names(poke2) %in% c("japanese_name","classfication"))]
}

When you code your function, never forget these Coding Styles:

if you call a function from a R package, use package_name::function_name() instead of only function_name(). Why? If you use function_name(), then you will have to call the package it belongs to by using the library(package_name) command. But, you should never use library(package_name) because different functions belonging to different packages can have the same name, thus using this coding style helps the code readability and prevent conflicts.
if you call a function from a R package, you should add this package to the DESCRIPTION file. Do do so, you can directly write in the consol usethis:use_package(package_name) and it will automatiquely add it to the DESCRIPTION file.

Then, you have to chose the object(s) to return as output(s) of your function. To do so, you use the return() function (quite easy right?).

In our example, we want to return the cleaned dataset called poke3, thus the full function looks as follow:

pokeclean <- function(poke){
  
  # only keep 1st, 2nd and 3rd generations:
  poke2 <- poke[-poke$generation %in% c("4", "5", "6", "7"), ]
  
  # remove japanese name and classification informations: 
  poke3 <- poke2[,-which(names(poke2) %in% c("japanese_name","classfication"))]
  
  # return the cleaned dataset:
  return(pokecleaned = poke3)
}

Your first function is now finished, congratulations!

Artwork from Pinterest (https://www.pinterest.fr/pin/626774473124355096/)

You now need one last step which is essential if you want others (or you in several months) to use your function…

3 - Documenting your function

… Yes, you need to document your function which means:

explain what is the purpose of your function
describe each input your function needs, that is to say describing its type and what it represents
describe what the function returns
add any note you think is interesting for the user

It will allow you to have an acces to the help menu doing ?myfunction.

To do so, you must create a Roxygen squeleton created by the the roxygen2 package. Don’t worry, it is very simple: move your cursor inside the function you just created (like the following picture) and then, in Rstudio, go on Code > Insert Roxygen Squeleton (you can also write the roxygen squeletton by yourself).

Move your cursor inside the function definition

In our example, it will render something like that:

#’ Title
#’ @param poke
#‘
#’ @return<br> #’
#’ @export
#‘
#’ @examples

Now we are going to go through each row:

the first row #’Title must be the … Title of your function. This question was quite difficult I know ;) In our example, we will write: Function which clean the Pokemon dataset by selecting only 1st, 2nd and 3rd generations and removing japanese name and classification In fact, you can also have a first row refering to the global name of your function (In our example, we can write: Cleaning the Pokemon dataset) and then a second row going more into details about your function (In our example, we can write: this function cleans the Pokemon dataset by selecting only 1st, 2nd and 3rd generations and removing japanese name and classification).
the second row #’@param must be followed by the name of the input. You should have as many @param lines as you have inputs. It must look like that @param name_of_the_input description of the input. The description of the inputs must describe the type of input (list, vector, dataframe, matrix etc), what it contains and any information you think is usefull to understand the input. In our example, we will write: @param poke a dataframe gathering the information about Pokemons. It was downloaded at https://www.kaggle.com/rounakbanik/pokemon. In this function pokeclean, we do not have any other inputs so we will only have one line called @param.
the third row #’@return must contain the name of the output(s) and what they contain. You only have one @return row even if the function as several outputs. In fact, if your function returns several outputs, they will be gathered in a unique list or vector, thus the output will be one unique object. In our example, we will write: @return pokecleaned a dataframe gathering the information about Pokemons with only Pokemons from the 1st, 2nd and 3rd generations and with no information about their japanese names and classifications.
the fourth row, #’@export tells the roxygen2 to add this function to the NAMESPACE file, so that it will be accessible when you want to use it.
the fifth row #’@example can contain an working example of how to use your function. It is not mandatory and here we will not give any example.
then, you can add other rows (called tags in the the roxygen2 language)

So, in our example, the updated roxygen squeleton will look as follow (with no empty lines:

#’ Cleaning the Pokemon dataset
#’
#’ This function cleans the Pokemon dataset by selecting only 1st, 2nd and 3rd generations and removing
#’ japanese name and classification
#’
#’ @param poke a dataframe gathering the information about Pokemons. It was downloaded at
#’ https://www.kaggle.com/rounakbanik/pokemon.
#’
#’ @return pokecleaned a dataframe gathering the information about Pokemons with only Pokemons from the #’ 1st, 2nd and 3rd generations and with no information about their japanese names and classifications.
#’
#’ @export

And now it is done! You have successfully coded and documented your function, you are an intermediate warrior in code reproducibility!

Artwork from Amino App

5 - Let’s use this new function!

Now that you have coded your function, you can use it! To do so you create a new R script in the analysisfolder. As said in 1 - Create a research Compendium:

the analysis folder of your Research Compendium gathers scripts used for analysis. In these scripts you can call functions you have coded.
functions you have coded are gathered in the R folder of your Research Compendium

As when creating a new R script for function, the top of your R script can contain some rows to explain what is doing the script, who wrote it etc. It is very recommended to do so for each R script you are coding.

You must also give a file name which clearly explains what is inside the analysis file and you can use a number at the begining of the script name to know thereafter in which order scripts must be used. For example, we will call this new analysis script “01_clean_data.R”

Then, you can write code for analysis and comments (never forget that commenting is the basis of any reproducible workflow ;))!
In this analysis, the code is really short and just call the function we just coded as you can see below:

First R analysis script

As we just said, it is a very very simple example: most of the time, the script has several lines which call or not one or several function coded by the owner of the research compendium… or not! Just keep in mind that you gather in such a script the coding stuff for a given analysis.

So yay! First R analysis created! Now, you have to create as many functions and as many analysis as needed in your Research Compendium and then, have a look at the Create an automatised workflow: using make.R tutorial.

6 - Create new functions and new analysis scripts

In this Research Compendium, we have created three other R functions saved in the R folder of the Research Compendium:

one for cleaning a pokemon dataframe (prep.pokeplot() function)
one for plotting the frequence pokemons’s types and relationships between attack capacities and pokemon’s weight according to Pokemon’s generation given two pokemon datasets (poke.plot01() function). What is interesting in this function is that we used a save argument which can be set to TRUE or FALSE to save or not the plot in the outputs folder. It is really easy to do so, you can have a look at the associated code in the poke.plot01 function: in this function we used the ggsave function from the ggplot2 package but you can use whatever function you want:

How to save a graph/table according to a save argument

one for computing the linear regression model of attack ~ weight (regression.attack.weight() function)

And created two associated analysis scripts saved in the analysis folder of the Research Compendium::

one for plotting the correlation between weight and attack using prep.pokeplot() and poke.plot01() functions
one for computing the linear regression model of attack ~ weight using the regression.attack.weight() function

You can have a look at these functions and analysis scripts to be sure to fully understand the function structure and their use!