The data-driven business analyst we aspire to be has to master the business analyst workflow depicted in Figure 3.1. While many books focus on the modelling component of the business analyst workflow, they do so while neglecting the context within which modelling is done. This is a fatal flaw as exclusively data-driven insight, absent of things like strategy, interpretability, or causal reasoning, is often of little use.
To ensure our models are not isolated from having real-world impact due to poorly integrated modelling tools, we will learn an eco-system of tools that enable us to speed through the entire business analyst workflow without obstacles or impediments. While I wish I could say there was one magic bullet tool to aid us, this does not exist. What does exist, however, is a rich eco-system of tools that play fairly well with each other and can handle all of the handoffs (i.e. arrows in Figure 3.1) of the business analyst’s workflow. At the core of this eco-system is the R programming language (R Core Team 2018R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.) and this chapter will create a computational foundation on which we will build in subsequent chapters. Please follow the below instructions and set-up your computing environment with installations of
R and RStudio.
Figure 3.2: R logo.
R is a programming language built with statistical calculation as its primary goal. It is free and maintained/extended by an open-source community link. We will be writing code[^5] in this language to aid the transformation of data into actionable insight. Before you can use
R on your computer1 Preferably, you will have access to a computer where you can install your own software and is running a Windows, (Mac) OS X, or Linux operating system. An alternative setup is to use RStudio cloud service where you only need a browser and an internet connection. (See https://rstudio.cloud/.), you need to make the language accessible through your computer by installing it on your operating system. Here are the basic steps (if asked to choose a mirror - i.e. the location from which you will download your installation files - just pick a location that is somewhat close to you):
Rdownload site: https://cran.r-project.org/.
<your operating system>”.
Figure 3.3: RStudio logo.
R is our programming language of choice, we will not rely on the supporting suite of software that accompanies the
R installation to take advantage of its power. Instead, all interaction with the
R language will be done using RStudio - the result of another free and open-source software project. RStudio is an integrated development environemnt (IDE) and designed to make a programmer’s life easier. And as business analysts who program, RStudio will be our best friend. I promise it will astound you with its ease of use and capabilities. For now, let’s just get it installed:
The best thing you can do is use Google and YouTube to walk you through the installation process in more detail. Searching YouTube for “installing r and rstudio on
<your operating system name>” where you replace
<your operating system name> with
Linux will get you some great resources for a slower walkthrough of the process than is provided here.
After progressing through the above install steps, check your installation is working by following these steps:
Figure 3.4: Even though people say they are using R, most people access R through the RStudio integrated development environment. We will always use RStudio as our entry point to the R programming environment; so when accessing R using an icon, always use the icon on the right.
Open RStudio. If opening via an icon, make sure you are selecting one that looks like the icon on the right in Figure 3.4 and not like the one on the left which opens
R directly (i.e. open RStudio, not
R). If no icon is available, use the Windows search box or the MAC Finder sidebar and search for RStudio.
Start a new
R-script by clicking the following menu options:
File -> New File -> R Script. You should now see four panels as shown in Figure 3.5.
The bottom-left panel is known as the
Console window and you will use this window to execute commands that will not be part of your final data analysis program. To test this window, type 2 + 2
<ENTER> as shown below:
##  4
Console window looks similar to the above and you see the resulting answer of
4, then your
R and RStudio installations have been successful.2 There is slighlty more printed on the screen than simply the answer of 4. The
## precedes output resulting from an executed command and the
 signals that the first element of the output is being shown.
R’s strength as an analytics environment is largely due to its extendability through packages. A package is simply a container used to distibute code and data - like specialized statistical techniques, cool graphical capabilities, simplified reporting capabilities, interesting datasets, etc.
We will install the following packages into our
ggplot2is an enhanced data visualization package for
dplyrmakes manipulating data intuitive and fast.
tidyrputs data into a clean format for munging (with
dplyr), visualization (with
ggplot2) and modeling (with
R’s hundreds of modelling packages).
Rpackage that makes it easier to work with dates and times.
stringrpackage aims to provide a clean, modern interface to common string operations.
To install these packages, navigate to the
Packages tab in the lower right panel of RStudio:
and press the
Install button. In the dialog box that opens, enter the package names (case-sensitive), separated by commas, as shown in Figure 3.6. Press
Your system will now download the computer code for those packages. This process may take several minutes. After completion, verify that all packages are installed. As an example, we will verify that the
ggplot2 package installed properly: Most of the time you will see R-users use the install.packages() function instead of using the Install button of the user interface. We show the install button here to keep things simple for now.
Ensure that the
ggplot2 package is visible in the
Packages tab as shown in Figure 3.7. It will be unchecked.
Click the checkbox for the
ggplot2 package as shown in Figure 3.7. Ensure that the checkbox remains checked. Even though the package is on your computer, this makes the package available for use in your current
R-session (this is equivalent to running the command
The below code uses the
qplot function from the
ggplot2 package to produce a plot:
Figure 3.8: Sepal length vs. petal length, colored by species.
Type the above code into your
console window and press
<ENTER> after entering each line. You should have a nice plot appear (similar to the plot in Figure 3.8). It shows data used for classifying Iris flowers. The dataset is described here: https://en.wikipedia.org/wiki/Iris_flower_data_set
Congratulations!!! Your computer is now ready to start your journey. To supplement this text, it is recommended you familiarize yourself with
R using the short tutorial available here: https://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf.
In order to get the most out of RStudio, a few customization options are helpful. Within RStudio, navigate and click on
Tools -> Global Options (see Figure 3.9). A dialog box will open whose left-side has categories of options (e.g. General, Code, Appearance, etc.). Navigate through the left-side categories specified below and ensure the following option selections:
Figure 3.9: Finding the global options screen.
Restore.RData into workspace at startup.
Save workspace to .RData on exit.
Soft-wrap R source files(in the
Use tab for multiline autocompletions(in the
While you might not understand all of the above at this point in your journey, the above customizations will provide an easier-to-use experience with RStudio than just leaving all the default values.