A Data Scientist's Guide To Packaging Custom R Functions

What is R?

Open-source statistical analysis tool typically used by data scientists. 

Why package custom R functions?

Some analytical procedures are extremely specific and can only be used under a very specific set of conditions. However, most operations performed in R are meant to be reused, either by various data scientists or on various datasets. In order to reduce the amount of time and effort spent writing the main logic and handling edge cases, R allows the user to create a workspace and then group the datasets and functions into a custom R package so that they may be shared as appropriate. The created package contains placeholders for documentation that will be visible natively through R, so that the user will know exactly what the custom package contains and is meant to do.

What do I need to get started?

For Linux users, the only software that is required is a functional R installation.
Windows users will need to install RTools, which is available through the CRAN website.

I have everything, now what?

This post will go through a contrived example to illustrate the steps. In this case we will create the SpryR package that will contain a small dataset and some functions to get basic information about the dataset.

Note: The dataset we make will be comprised of the currently available versions of various Hadoop packages for two Hadoop distributions (Apache and CDH3).

Step 1: Create the functions and datasets that should be packaged


IMPORTANT NOTE: Make sure to start with a freshly R session so that we don't accidentally include anything that shouldn't go into the package.

First, let's create the datasets that we want to include in the package.

packages = c('Hadoop','Hive','Pig','Oozie','Sqoop','Whirr','Zookeeper','HBase')
apache= c('1.1.1','0.10.0','0.10.1','3.3.1','1.99.1','0.8.1','3.4.5','0.94.4') 
cdh3u5 = c('0.20.2+923.421','0.7.1+42.56','0.8.1+28.39','2.3.2+27.23','1.3.0+5.88','0.5.0+4.14','3.3.5+19.5','0.90.6+84.73') 
distributionGrid = data.frame(package=packages,apacheVersion=apache,CDH3u5Versions=cdh3u5)

Next, let's create two functions: one to get the current Apache version of a particular package, and the other for the cdh3u5 version.

spry_getApacheVersion <- function(packageName) {
   return(distributionGrid[packages==packageName,2])
}

spry_getCDH3U5Version<- function(packageName ) {
   return(distributionGrid[packages==packageName,3])
}

Step 2: Create the package skeleton


Now that we have our package containing our small dataset and functions, we are ready to package things up. In the session with everything that should be packaged, run this command:

package.skeleton(name="SpryR")

This command tells R that we would like to create a new package called "SpryR" containing everything in the current workspace.

Navigate to the folder it created (on Windows it may be under C:\Users\<Username>\Documents), and examine the files it has created automatically. The data folder contains several .rda files, each one representing one of the datasets we defined in the workspace. The R folder contains one file for each function we defined. The man folder contains empty R documentation files (.Rd) that we will need to modify. In the root directory there are a few additional files, one of which is called "Read-and-delete-me". You can just follow the guidance for that one.

Step 3: Edit all documentation template files


As mentioned a little bit earlier, there are several documentation files that need to be modified so that the help function can work correctly for this package inside R. The files to look for are "DESCRIPTION" in the root directory of the package, and all of the .Rd files inside the man folder. 

If you have experience working with LaTeX files, then the .Rd format is very similar. If you haven't had the opportunity to work with LaTeX, here are some guidelines: 
  • In the .Rd files, the lines beginning with "\" are section definitions, enclosed by braces. For example, "\title{}", "\description{}", etc. You should replace the current contents with what is appropriate for your package.
  • Lines beginning with "%" are considered to be comments and will not show up in any interface reading these files. All other lines will be seen - including the lines that start with "#" in the examples section, so be sure to replace those lines with something more relevant.

Build the package

Once the documentation is written the way you would like, the package can be built into a format that R can directly load.

On the command line, in the directory containing the root of the package, run this command:

$ <path_to_R>\R CMD build SpryR

This will create a SpryR_1.0.tar.gz file that can be distributed across platforms containing all datasets and functions that we wanted packaged.

How do I know it worked?

In a clean R session, run the following commands:

install.packages("<path_to_package>\\SpryR_1.0.tar.gz", repos=NULL, type="source")
library(SpryR)

Once the libraries have been loaded successfully, attempt to use one of the functions we defined at the beginning. If you see that it works, everything was successful! If it doesn't work, go through the values used inside the documentation files to ensure that there are no mistakes.

Where can I go for more information?

These are just the basics, the R packaging tool is very sophisticated. For all of the details on everything that is possible with packaging R functions and datasets, head to the R-Project website.

Feel free to leave your question or comment below!

No comments:

Post a Comment