Chapter 3 The Computing Environment

The data-driven business analyst we aspire to be has to master the business analyst workflow depicted in Figure 3.1. While many books focus on the modelling component of the business analyst workflow, they do so while neglecting the context within which modelling is done. This is a fatal flaw as exclusively data-driven insight, absent of things like strategy, interpretability, or causal reasoning, is often of little use.

Figure 3.1: The business analyst transforms strategy and data into actionable insights that improve business outcomes.

To ensure our models are not isolated from having real-world impact due to poorly integrated modelling tools, we will learn an eco-system of tools that enable us to speed through the entire business analyst workflow without obstacles or impediments. While I wish I could say there was one magic bullet tool to aid us, this does not exist. What does exist, however, is a rich eco-system of tools that play fairly well with each other and can handle all of the handoffs (i.e. arrows in Figure 3.1) of the business analyst’s workflow. At the core of this eco-system is the R programming language (R Core Team 2018R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.) and this chapter will create a computational foundation on which we will build in subsequent chapters. Please follow the below instructions and set-up your computing environment with installations of R and RStudio.

3.1 Installing R

R logo. Figure 3.2: R logo.

R is a programming language built with statistical calculation as its primary goal. It is free and maintained/extended by an open-source community link. We will be writing code[^5] in this language to aid the transformation of data into actionable insight. Before you can use R on your computer1 Preferably, you will have access to a computer where you can install your own software and is running a Windows, (Mac) OS X, or Linux operating system. An alternative setup is to use RStudio cloud service where you only need a browser and an internet connection. (See https://rstudio.cloud/.), you need to make the language accessible through your computer by installing it on your operating system. Here are the basic steps (if asked to choose a mirror - i.e. the location from which you will download your installation files - just pick a location that is somewhat close to you):

  1. Navigate to the R download site: https://cran.r-project.org/.
  2. Click “Download R for <your operating system>”.
  3. Follow on-screen instructions. Accept all default values.

3.2 Installing RStudio Desktop

RStudio logo. Figure 3.3: RStudio logo.

Even though R is our programming language of choice, we will not rely on the supporting suite of software that accompanies the R installation to take advantage of its power. Instead, all interaction with the R language will be done using RStudio - the result of another free and open-source software project. RStudio is an integrated development environemnt (IDE) and designed to make a programmer’s life easier. And as business analysts who program, RStudio will be our best friend. I promise it will astound you with its ease of use and capabilities. For now, let’s just get it installed:

  1. Navigate to the RStudio website: https://www.rstudio.com/.
  2. Click “Download RStudio”.
  3. Find and click the download button for RStudio Desktop - Open Source License.
  4. Download and then run the Rstudio installation file.

3.3 Getting Help

The best thing you can do is use Google and YouTube to walk you through the installation process in more detail. Searching YouTube for “installing r and rstudio on <your operating system name>” where you replace <your operating system name> with Windows, Mac, or Linux will get you some great resources for a slower walkthrough of the process than is provided here.

3.4 Verify the installation

After progressing through the above install steps, check your installation is working by following these steps:

Even though people say they are using R, most people access R through the RStudio integrated development environment.  We will always use RStudio as our entry point to the R programming environment; so when accessing R using an icon, always use the icon on the right. Figure 3.4: Even though people say they are using R, most people access R through the RStudio integrated development environment. We will always use RStudio as our entry point to the R programming environment; so when accessing R using an icon, always use the icon on the right.

  1. Open RStudio. If opening via an icon, make sure you are selecting one that looks like the icon on the right in Figure 3.4 and not like the one on the left which opens R directly (i.e. open RStudio, not R). If no icon is available, use the Windows search box or the MAC Finder sidebar and search for RStudio.

  2. Start a new R-script by clicking the following menu options: File -> New File -> R Script. You should now see four panels as shown in Figure 3.5.

Figure 3.5: The RStudio user interface.

The RStudio user interface.
  1. The bottom-left panel is known as the Console window and you will use this window to execute commands that will not be part of your final data analysis program. To test this window, type 2 + 2 <ENTER> as shown below:

    > 2 + 2
    ## [1] 4

    If your Console window looks similar to the above and you see the resulting answer of 4, then your R and RStudio installations have been successful.2 There is slighlty more printed on the screen than simply the answer of 4. The ## precedes output resulting from an executed command and the [1] signals that the first element of the output is being shown.

3.5 Install and verify R-packages

R’s strength as an analytics environment is largely due to its extendability through packages. A package is simply a container used to distibute code and data - like specialized statistical techniques, cool graphical capabilities, simplified reporting capabilities, interesting datasets, etc.

We will install the following packages into our R-environment:

  • ggplot2 ggplot2 is an enhanced data visualization package for R.
  • dplyr dplyr makes manipulating data intuitive and fast.
  • tidyr tidyr puts data into a clean format for munging (with dplyr), visualization (with ggplot2) and modeling (with R’s hundreds of modelling packages).
  • lubridate lubridate is an R package that makes it easier to work with dates and times.
  • stringr The stringr package aims to provide a clean, modern interface to common string operations.

To install these packages, navigate to the Packages tab in the lower right panel of RStudio:

and press the Install button. In the dialog box that opens, enter the package names (case-sensitive), separated by commas, as shown in Figure 3.6. Press Install.

Figure 3.6: Installing packages via RStudio’s user interface.

Installing packages via RStudio's user interface.

Your system will now download the computer code for those packages. This process may take several minutes. After completion, verify that all packages are installed. As an example, we will verify that the ggplot2 package installed properly: Most of the time you will see R-users use the install.packages() function instead of using the Install button of the user interface. We show the install button here to keep things simple for now.

  1. Ensure that the ggplot2 package is visible in the Packages tab as shown in Figure 3.7. It will be unchecked.

  2. Click the checkbox for the ggplot2 package as shown in Figure 3.7. Ensure that the checkbox remains checked. Even though the package is on your computer, this makes the package available for use in your current R-session (this is equivalent to running the command library(ggplot2).

    Figure 3.7: Making the ggplot2 package available in your current R session.

    Making the ggplot2 package available in your current R session.

The below code uses the qplot function from the ggplot2 package to produce a plot:

Sepal length vs. petal length, colored by species. Figure 3.8: Sepal length vs. petal length, colored by species.

> library("ggplot2")
> qplot(Sepal.Length, Petal.Length, data = iris, color = Species)

Type the above code into your console window and press <ENTER> after entering each line. You should have a nice plot appear (similar to the plot in Figure 3.8). It shows data used for classifying Iris flowers. The dataset is described here: https://en.wikipedia.org/wiki/Iris_flower_data_set

Congratulations!!! Your computer is now ready to start your journey. To supplement this text, it is recommended you familiarize yourself with R using the short tutorial available here: https://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf.