Crash course in R and BioConductor

bcbbslides 1,249 views 79 slides May 03, 2017
Slide 1
Slide 1 of 79
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79

About This Presentation

Crash course in R and BioConductor


Slide Content

Bioinformatics and Computational Biosciences Branch  
NIAID Office of Cyber Infrastructure and Computational Biology 
 
NIH Intranet: http://bioinformatics.niaid.nih.gov 
[email protected] 






Training Manual

Crash Course:
R & BioConductor



Jeff Skinner, M.S.
Sudhir Varma, Ph.D.














Download a copy of this manual and all related training materials at:
http://collab.niaid.nih.gov/sites/research/SIG/Bioinformatics/

Crash Course: R & BioConductor


Table of Contents

Ch. 1. Introduction to R ......................................................................................................................................... 1
 
1.1 What is R? What is BioConductor? ........................................................................................................ 1  
1.2 A Brief History of R ................................................................................................................................. 1  
1.3 Download R and BioConductor ............................................................................................................... 2  
1.4 Licensing Concerns .................................................................................................................................. 3  
1.5 Helpful Resources .................................................................................................................................... 3  
Ch. 2. Basics of Using R ........................................................................................................................................ 4  
2.1 Computing Environments ........................................................................................................................ 4  
2.1 R GUI ....................................................................................................................................................... 5  
2.3 Basic Arithmetic ....................................................................................................................................... 8  
2.4 Searching the Help Menus ..................................................................................................................... 12  
2.5 Installing R Packages and Source Scripts .............................................................................................. 14  
2.6 Entering and Importing Data .................................................................................................................. 16  
2.7 Data Types .............................................................................................................................................. 22  
2.8 Manipulating Data in R .......................................................................................................................... 26  
2.9 Saving and Exporting Data .................................................................................................................... 31  
2.10 Changing Directories .......................................................................................................................... 32  
2.11 Sample Problems for Students ............................................................................................................ 32  
Ch. 3. Graphics and Figures in R ......................................................................................................................... 34  
3.1 Basic Types of Graphics and Figures ..................................................................................................... 34  
3.2 Custom Titles, Subtitles and Axes Labels .............................................................................................. 40  
3.3 Custom Color and Layout Options ......................................................................................................... 44  
3.4 Multi-step Graphics ................................................................................................................................ 48  
3.5 Figure Legends and Overlaid Text ......................................................................................................... 54  
3.6 Multi-panel Layouts ............................................................................................................................... 58  
3.7 Exporting R Graphics ............................................................................................................................. 60  
3.8 Sample Problems for Students ............................................................................................................... 60  
Ch. 4. Basic Statistical Tests and Analyses in R.................................................................................................. 62  
4.1 Student’s T-test ...................................................................................................................................... 62  
4.2 Linear Regression and ANOVA ............................................................................................................ 63  
4.3 R Commander ........................................................................................................................................ 68  
4.4 Sample Problems for Students ............................................................................................................... 69  
Ch. 5. Writing Basic Scripts in R ......................................................................................................................... 70  
5.1 Text Editors ............................................................................................................................................ 70  
5.2 Hello World! .......................................................................................................................................... 71  
5.3 Use Scripts to Automate and Save Workflows ...................................................................................... 72  
5.4 Computation and Output Options .......................................................................................................... 73  
5.5 Sample Problems for Students ............................................................................................................... 77  
Literature Cited ..................................................................................................................................................... 77  

Crash Course: R & BioConductor


1

Ch. 1. Introduction to R

1.1 What is R? What is BioConductor?

Many biologists and researchers have heard about the powerful analysis and visualization capabilities of
R and BioConductor, even if they have not used R or BioConductor themselves. It can be tempting to think of
R as a typical statistics software package, but that would belie its true power and capabilities. R combines an
open source software platform for statistics and data visualization with a powerful scripting language that can
be used to create new analyses and workflows. Both the software package itself and its scripting language are
called R. While most people use R for statistical analyses and data visualization, R can be used for matrix
algebra computations, data management and enterprise reporting. Advanced users will find that R interacts well
with databases, some commercial statistics software packages and many programming languages (e.g. Perl,
Python, Java, Fortran, C, HTML, TeX and LaTeX), so R can be utilized in many complicated computing
problems.
BioConductor is an open source software development project that creates new tools for the analysis and
comprehension of genomic. The BioConductor project is almost entirely concerned with the development and
distribution of R package libraries for the analysis of microarray and other genomic data. Packages are
available for doing various kinds of annotation, normalization, filtering, statistical analysis and visualization of
the experimental data from microarray and other genomic studies.

1.2 A Brief History of R

The history of R begins at AT&T Bell Laboratories in the 1970’s with the development of the S
statistics package by John Chambers, Richard Becker and others. In the early 1970’s, researchers and
statisticians at Bell labs were using a library of FORTRAN programs called Statistical Computing Subroutines
(SCS) to compute all their statistical analyses (Becker 1994). This FORTRAN library was preferable to the
commercial statistics packages available in the 1970’s, because the statisticians at Bell Labs were constantly
developing new statistical methods and they wanted specialized reports of their statistical results. However, this
SCS library was too cumbersome for many simple statistical analyses and graphs, like Student’s t-tests or linear
regression methods. The S statistics package was created to provide an interactive programming language and
computing environment to simplify the procedures in the SCS FORTRAN library, while still providing a
flexible platform to program and develop new statistical and graphical methods.
To make statistical computing more interactive, the S programming language was designed to have the
most natural grammar and syntax possible. The goal was to create a higher-level programming language that
would be similar to regular English. Most users would write their S code using basic function statements in this
higher-level S language, while more advanced users and developers could still create new code in lower-level
languages like FORTRAN. The original S language featured advanced text editing and powerful graphics, with
the usual statistical tests.
The S software was used internally at Bell labs in the late 1970’s and it was distributed publically by the
early 1980’s. The first textbook about the S language was published in 1984 (Becker and Chambers 1984) and
the S software was made publically available through AT&T’s software sales group. Later, the S software was
rewritten in C and combined with another quantitative computing project at Bell Labs to create New S (Becker
et al. 1988). By the early 1990’s, then S statistics software had found thousands of users and a handful of books
had been published on the S language.
The R statistical software package was initially published and released in 1996 (Ihaka and Gentleman
1996). The goal was to create a flexible statistics software and programming language that utilized the best
features of the S statistics software package and a functional programming language called Scheme (Sussman
and Steele 1975). The name R was chosen to represent the first names of its developers, Ross Ihaka and Robert
Gentleman, and also as a play on the name of the S software and programming language (Hornik 2008).

Crash Course: R & BioConductor


2

Shortly after its release, the Comprehensive R Archive Network (CRAN) was opened and R became an official
part of the GNU Project (http://www.gnu.org). The first stable release of R was offered in February, 2000.

1.3 Download R and BioConductor

If you need to download R, visit the Comprehensive R Archive Network (CRAN) website (http://cran.r-project.org/) and look for the download links (Figure 1). There are at least three platform-specific download
links for Linux, Mac OSX and Windows operating systems. Click these links to download a ready-to-use
installation of R software. There are additional links to download individual source or binary files, so expert
users can build their own custom installation of R. These links to the source and binary files also include “daily
snapshots” of future versions of R. Remember that R is open source software, so everyone is welcome to
modify its code and contribute to upcoming versions of the software. Most biology researchers should probably
use the platform-specific download links, but remember that the custom installation options are available.
After you have downloaded R, you may want to download the BioConductor packages from their
website (Figure 2). If you follow the installation instructions from the BioConductor website, you need to type
the commands

> source("http://bioconductor.org/biocLite.R")
> biocLite()
into the R command line. These two commands will download and install all of the basic BioConductor packages on your computer. There are many additional BioConductor packages available for download, but
you will want to download them as needed. You can also download the
biocLite() packages one at a time,
but many of the
biocLite() packages are required more advanced BioConductor packages therefore it makes
sense to download
biocLite() now. Installing R packages will be covered in greater detail in Section 2.5 of
the manual.


Figure 1. The Comprehensive R Archive Network (CRAN) website.

Crash Course: R & BioConductor


3


Figure 2. The BioConductor website.

1.4 Licensing Concerns

It is important to remember that R is open source software, distributed under the GNU Public License
(GPL). It may be a good idea to review the terms of the GPL before delving into a project using R. If you just
intend to use existing R packages to analyze and visualize data in published experiments, then you probably do
not have much to worry about. However, if you want to modify and distributed R software, or more
importantly if you want to use R software as a part of a patented process or product, you should review the
license very carefully.

1.5 Helpful Resources

Because R is a free, open source software program, there is no corporate office to call or email for
technical support. However, there are many resources available to help users learn to use R. Visit the R project
website (http://www.R-project.org
) to find free manuals, a FAQ page, a list of published books on R, the R
Wiki and various mailing lists. You can find extensive documentation of individual R functions by using its
help() commands, as demonstrated in section 2.4 of this manual. Some historic books on the S and R software
packages include “the blue book” (Becker et al. 1988), “the white book” (Chambers and Hastie 1992) and “the
green book” (Chambers 1998), but there are now dozens of statistics and programming text books for the R and
S languages. The R-help mailing list is sometimes your best bet for person-to-person help with R and its
functions, but it is important to read their posting guide before posting new messages to the mailing list.

Crash Course: R & BioConductor


4

Ch. 2. Basics of Using R

2.1 Computing Environments

Most biologists or researchers will use R and BioConductor on a Windows PC or Mac computer using
the standard R GUI interface for their platform. However, R and BioConductor can also be run as command
line applications on a PC, Mac, Linux or Unix machine, a server or even a high performance parallel computing
cluster. There is no real advantage or disadvantage to using R from the command line or the R GUI. The
features of R remain the same, no matter how you choose to access R. However, some users with programming
experience may feel more comfortable using R from the command line.

2.1.1 MS Windows Command Prompt

On a Windows PC, you can access R from the command line by opening the MS Windows Command
Prompt (Figure 3). For many Windows PC users, the command line can be found by clicking > Start > All
Programs > Accessories > Command Prompt. At the Command Prompt, type the capital letter “R” and hit
the <Return> key to open the R software package. You should immediately see a message from R, reporting
your version number and the license information. If R does not open at the Command Prompt, you may need to
specify a path within the MS Windows operating system.


Figure 3. The MS Windows Command Prompt.

2.1.2 Mac OSX Terminal

On an Apple Macintosh computer, you can access R from the command line by opening the Terminal
(Figure 4). You will likely find Terminal.app in the Utilities folder within your Applications folder in the OSX
Finder. If you cannot find Terminal.app, try searching for “terminal” in the OSX Spotlight. At the Terminal,
type the capital letter “R” and hit the <Return> key to open the R software package. You should immediately
see a message from R, reporting your version number and the license information.

2.1.3 UNIX shells and SSH clients

Linux and UNIX users can access R from the command line in the Bourne shell (sh), Bourne-Again
shell (bash), C shell (csh) or other command line terminals. Type the capital letter “R” and hit the <Return>
key at the command line to open the R software package, and you should immediately see the message from R

Crash Course: R & BioConductor


5


Figure 4. The Apple Macintosh OSX 10.5.5 Terminal.

to report your version number and license information. Another possible option is to access R from a secure
shell (SSH) client. This option allows you to install R on a powerful UNIX machine or server, then access the
R software on this powerful machine from another machine connected to the internet.

2.1 R GUI

If you are not comfortable accessing R from the command line, you can access R from the R GUI that is
included in the usual Windows or Mac download. The R GUI provides a few point-and-click buttons and
menus to help you open and save files, download new R packages or even edit R scripts. All the features from
these point-and-click buttons and menus can be accessed from the command line, but some users may prefer to
have these commonly used features accessible from a GUI button instead memorizing their specific commands.
Note that specific R GUI features differ slightly between the MS Windows GUI and the Apple Mac OSX GUI.
Both are described below.

2.2.1 Windows PC GUI

The current R GUI in MS Windows features seven clickable menus and eight clickable buttons to help
you access commonly used features (Figure 5). The File menu allows you to Source R code, create a New
script, Open script…, Display file(s)…, Load workspace…, Save workspace…, Load history…, Save history…,
Change dir…, Print…, Save to file… and Exit. Note the menu options Open script…, Load workspace… and
Save workspace… are also available in the first three clickable buttons from the left on the Windows R GUI,
while the Print option is available as the last button on the right of the R GUI. The Edit menu allows users to
Copy, Paste, Paste commands only, Copy and Paste, Select all, Clear console, open the Data editor… or
change the GUI preferences…, if necessary. Note the Copy, Paste and Copy and Paste commands are also
available as the fourth, fifth and sixth clickable buttons from the left. The View menu is used to hide or display
the Toolbar of buttons at the top of the R GUI window and the Statusbar of system messages at the bottom of
the R GUI (Figure 6). The Misc menu is used to Stop current computation, Stop all computations, Buffered
output, Word completion, File name completion, List objects, Remove all objects and List search path. The
Buffered output option prevents R from printing any messages while a command is running. The Word
completion and File name completion options allow you to complete the names of R commands and file names,

Crash Course: R & BioConductor


6


Figure 5. The MS Windows R GUI.


Figure 6. The Toolbar and Statusbar of the Windows R GUI.


respectively, by hitting the <TAB> key after typing the first letters of a command or filename. The List objects,
Remove all objects and List search path options are equivalent to the
ls(), rm(list = ls()) and
search()commands, respectively. Note the option to Stop current computation can also be accessed by the
second clickable button from the right.

2.2.2 Mac OSX GUI

The R GUI in Mac OSX includes a menu bar at the top of your screen (Figure 7), and the GUI itself
includes 10 clickable icons to provide access to commonly used features or features specific to the Mac OSX R
GUI (Figure 8). The stop sign icon allows you to stop processing the most recently submitted R command, or
Interupt current R computation. This is a useful feature, because some R procedures can require lengthy
processing times that may stall or freeze some computers. The R icon is used to Source script or load data in R.
If you would like to write your own scripts to automate analysis workflows or create new analyses, this button
will allow you to quickly load the source files for your scripts. It also provides an easy way to load data. The
bar chart icon allows you to Open a new Quartz device window. On a Mac computer, the quartz window is
used to produce all graphics figures, like scatterplots and histograms. Opening a Quartz graphics device will
allow you to view your graphics figures as you build them. The X11 icon allows you to open an X11 window in
Mac OSX. This is a critical feature for the R GUI in Mac OSX, because many important R functions require an
open X11 window to work properly. The lock icon is used to Authorize R to run system commands as root.
This allows you to overwrite protected files and directories, so use this option with extreme caution. The table

Crash Course: R & BioConductor


7


Figure 7. The Mac OSX R GUI Menu Bar.


Figure 8. The Mac OSX R GUI console.


icon is used to Show / Hide R command history (Figure 9). The command history allow all of the commands
recently submitted to R, which can be a useful feature during lengthy R sessions when older commands may run
off the screen. The color wheel icon is used to Set R console colors (Figure 10). These options allow you to
change the color schemes within the R GUI console, the R GUI editor and the R GUI Quartz window. The R
sheet and blank sheet icons are used to Open document in editor and Create new, empty document in the editor,
repectively. The R GUI editor is a text editor environment within the Mac OSX R GUI that is typically used to
edit R source scripts. The R GUI editor for Mac OSX includes some helpful automatic text formatting features
to help you write and edit R code. The print icon is used to Print this document, which will print the R console.
Be careful with the print button, because the R console could contain hundreds of statements and produce a
lengthy printout. The switch icon is used to Quit R, which closes the current R session and the R GUI.

Crash Course: R & BioConductor


8


Figure 9. The R command history window for the R GUI in Mac OSX.


Figure 10. The Set R console colors menu.

2.3 Basic Arithmetic

2.3.1 Addition, subtraction and other basic operations

Before you open your first R data set, it may be useful to explore some basic arithmetic operations in R.
Type a simple addition statement (e.g. 3 + 4) in the R prompt and hit <Return> to view the result (Figure 11).

Crash Course: R & BioConductor


9


Figure 11. An arithmetic operation entered into the R GUI.

From this point forward, I will describe all R commands in boxed Courier New text as shown below:

> 3 + 4
[1] 7
>
Note that user entered commands will be preceded by the “>” character, while output from R will typically be
preceded by an index number (e.g.
[1]) or it will be displayed with special formatting.
Some keyboard characters are reserved for special functions in R. One special character in R is the “#”
symbol, which is used to add notes to R commands, scripts and code. In most situations, all text or code
preceded by the the “#” symbol will be ignored in R. Try it for yourself by entering

> # 3 + 4
>
into the R prompt. Notice the addition statement was not evaluated as before, because the sum was not
calculated. You can enter any kind of information after the “#” symbol, without any fear of an error or ruined R
scripts. You can even add these notes after a valid command in the same line of code, as seen below:

> 3 + 4 # This command produces the sum of 3 and 4
[1] 7

Now, enter the following commands into R to explore some basic arithmetic operations:

Crash Course: R & BioConductor


10

> 3 + 4 # Addition
[1] 7
> 3 - 4 # Subtraction
[1] -1
> 3*4 # Multiplication
[1] 12
> 3/4 # Division
[1] 0.75
> 3^4 # Exponents
[1] 81
> 3**4 # Another way to enter exponents
[1] 81
> log(3) # Natural logarithm (i.e. log base e)
[1] 1.098612
> log10(3) # Log base 10
[1] 0.4771213
> log2(3) # Log base 2
[1] 1.584963
> log(81,base=3) # Logarithms computed to any other base
[1] 4
> exp(1) # Base of the natural logarithm
[1] 2.718282
> pi # The constant pi
[1] 3.141593

You can use the equal sign “
=” or an left-facing arrow “<-“ to define variables and compute simple
algebraic expressions in R. Note the equal sign is sometimes used inconsistently in R and sometimes creates
problems in lengthy scripts.

> a = 4 # Define a = 4
> b <- 3 # Define b = 3 (alternative coding)
> a*b # Multiply a*b
[1] 12
2.3.2 Inequalities Beyond these simple mathematical operations, R can be used to evaluate many mathematical
inequalities:

> 3 < 4 # Strictly less than or greater than
[1] TRUE
> 3**4 >= 100 # Greater than or equal to
[1] FALSE
> log(81,3) == 4 # Equal to
[1] TRUE

Note that the double equal sign command “==” is used to evaluate an equality statement, because the single
equal sign command “=” is used to define variables. Soon, you will see that inequalities can provide a powerful
means to identify subsets of data, when used with an index.

2.3.3 Matrix algebra

More advanced users may want to use R for matrix algebra computations, like matrix multiplication or
matrix inversions. If matrix computations interest you, I believe you will find that R is a very powerful
platform for matrix computations that nearly rivals Matlab and similar platforms.

Crash Course: R & BioConductor


11

> aa <- c(1,2,3,4) # Define a column vector “aa”
> aa # Display the column vector “aa”
[1] 1 2 3 4
> t(aa) # Transpose “aa”
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4

The command
c() is used to create a column vector in R. In the example above, the command aa <-
c(1,2,3,4)
was used to define a column vector “aa” with entries 1, 2, 3, and 4. The command t() is used to
transpose vectors and matrices, so
t(aa) will convert “aa” from a column vector to a row vector. Notice the
difference between how the column vector
aa and the row vector t(aa) are displayed. The column vector aa is
displayed in one line of output, where the vector is preceded by
[1] and its entries are only separated by spaces.
The row vector
t(aa) is displayed on two separate lines, which report its column and row entries. The output
[1,] denotes an entry on the first row of a matrix, and the output [,1] denotes an entry on the first column of a
matrix.

> aa*t(aa) # Element-wise multiplication
[,1] [,2] [,3] [,4]
[1,] 1 4 9 16
> t(aa)*aa # Element-wise multiplication
[,1] [,2] [,3] [,4]
[1,] 1 4 9 16
> aa%*%t(aa) # Matrix multiplication
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 4 6 8
[3,] 3 6 9 12
[4,] 4 8 12 16
> t(aa)%*%aa # Matrix multiplication
[,1]
[1,] 30

The usual multiplication operator “
*” is not used for matrix multiplication. The “*” operator multiplies
vectors and matrices element-wise, while the “
%*%” operator is used for matrix multiplications. Both functions
can be useful, but be careful to use the correct operator symbol for your calculations.

># Define a 3 x 3 matrix
> bb <- matrix(c(1,2,3,4,0,5,1,3,1),nrow=3,ncol=3)
>#
># Display the matrix
>#
> bb
[,1] [,2] [,3]
[1,] 1 4 1
[2,] 2 0 3
[3,] 3 5 1
>#
># Invert the matrix
>#
> solve(bb)
[,1] [,2] [,3]
[1,] -0.6521739 0.04347826 0.52173913
[2,] 0.3043478 -0.08695652 -0.04347826
[3,] 0.4347826 0.30434783 -0.34782609

The
matrix() command is used to define a matrix from a vector. The matrix command is very robust
and can be used to create simple or complicated matrices with a few keystrokes. E.g. the command

Crash Course: R & BioConductor


12

matrix(4,nrow=3,ncol=5) would create a 3 x 5 matrix with every entry equal to 4. More details on these
techniques will be given in section 2.6. The command
solve() will invert most symmetric matrices, but more
complicated Cholesky inverse and generalized inverse methods are also available.

2.4 Searching the Help Menus

There are many books, guides and manuals available to help you learn how to use R, but inevitably
every R user must search the help menus. The R help menus can help you find new functions or provide more
detailed explanations of the inputs and outputs of a function you have already used. There are several useful
help commands in R to find the documentation that you need.

2.4.1 Help documentation with help() and ?

> help(t.test) # Find documentation for the function t.test
> ?t.test # (same as above)
The two functions above are used to find help documentation for a specific function, when you already
know the function’s command. Try entering
help(log) or ?log for another example. These help commands
are most useful if you need a detailed explanation of a function you already use or if you would like to
investigate a function you found in a paper or on the internet. The two commands are equivalent. Both produce
an HTML-formatted manual for the specified function (Figure 12). You will notice that most of the help
documentation for R functions follows very strict formatting. Each help page provides you the command
information to call the function, the function’s name, a description of the function purpose, details about its
usage within R, details about its arguments, details about its output and typically a specific example that you
copy-and-paste into R for demonstration purposes.


Figure 12. Help documentation for the function
t.test.

Crash Course: R & BioConductor


13

2.4.2 Keyword searches with help.search()

Another type of search is required when you do not know the command of a specific function. Type
help.search(“keyword”) to search for keywords and find all the command names of functions related to your
keyword. For example, the command
help.search(“students t test”) will produce a list of all functions
related to the student’s t-test (Figure 13). In this example, only the
t.test() function is found by our search,
but other requests may generate many results.


Figure 13. List of help files from help.search keyword search.

2.4.3 Google and other search engines

One final suggestion is to use Google.com or other search engines to find help with R. I recommend
that you always include the keyword “CRAN” or “BioConductor” in your R-related Google searches, because it
can help direct you to search results that are most directly related to R packages and concepts (Figure 14). The
search engine Rseek (http://www.rseek.org/
) is a search engine that only queries the R help files and related
websites.

Crash Course: R & BioConductor


14


Figure 14. Searching Google for help with R packages.

2.5 Installing R Packages and Source Scripts

Help searches and basic arithmetic functions are included in the base R software package, but often
researchers need to use specialized research tools that are not included in the base R software. These
specialized tools are often available as free downloadable packages or source scripts in R. Packages are user-
submitted R scripts and functions that have are made been posted online by CRAN or BioConductor. Packages
are downloaded and installed from the R GUI or the command line. The code for these packages is typically
downloaded as a source file, written in the R programming language, or as a binary, written in a compiled
language like C or FORTRAN. Some functions and scripts have not been submitted as packages to CRAN or
BioConductor, but they still may be loaded into your installation of R as a source file.

2.5.1 R packages

Click > Packages > Install Package(s)… to install packages from the Windows R GUI. If you are
installing packages for the first time, you may be prompted to Set CRAN mirror… or Select
Repositories…before you continue (Figure 15). Remember, the CRAN mirror site is a server that contains the
most recent R software downloads and packages. You want to choose one of the CRAN mirrors nearest you for
convenience. The package repositories are specific lists of packages from CRAN and BioConductor. Choose
only the repositories you need, because selecting more repositories will create a longer list of packages for you
to browse.

Crash Course: R & BioConductor


15


Figure 15. Set CRAN mirror…, Select Repositories… and R packages menus in Windows R GUI.

Use the scroll bar to browse through the list of R and BioConductor packages and select the packages
you need for installation (Figure 15). You can hold the Ctrl key to select multiple packages, if necessary. Note
the list of packages can be very long, especially if several repositories were selected. Once you have selected
the R packages you need, click OK to download and install the packages.
Alternatively, if you know the name and repository address of the packages you need to download, you
can download and install a package from the command line using the command
install.packages(). There
is no advantage or disadvantage to using the R GUI or the command line to install a package, but the
install.packages() command can be very helpful when scripting. Using the install.packages()
command in your source code will ensure any users of your script will have all the necessary R packages. As
the packages download, you may see some log messages in your R console to keep you informed of the
download progress and any potential errors. After the packages have finished downloading, you will want to
enter a
library() or require() command for the package to load the contents of the package into your R
workspace. Both commands have the same function, but the
require() is preferred for use within R functions.

> install.packages("gtools",repos="http://cran.r-project.org")
trying URL 'http://cran.r-project.org/bin/windows/contrib/2.6/gtools_2.4.0.zip'
Content type 'application/zip' length 157621 bytes (153 Kb)
opened URL
downloaded 153 Kb

package 'gtools' successfully unpacked and MD5 sums checked

The downloaded packages are in
C:\Documents and Settings\skinnerj\Local
Settings\Temp\RtmpIQ1HTc\downloaded_packages
updating HTML package descriptions
> library(gtools)

Crash Course: R & BioConductor


16

2.5.2 Source scripts

This manual will introduce the idea of R source scripts in Chapter 5, but keep in mind that you can also
upload new functions using R source scripts. If you have found a R source script that you need to upload, click
> File > Source R code… on the Windows R GUI or > File > Source File… on the Mac OSX R GUI. Use the
command
source(), if you prefer to load the source script from the command line. Most casual R users will
only use the
file parameter of the source() command.

> # Load a source script file (.R extension)
> source("~/example.R")

2.6 Entering and Importing Data

There are dozens of ways to enter data into R. Many famous and historical datasets are already
uploaded and available for use in the base R software or in an R package. There are hundreds of functions that
make it easy to type and enter small data sets manually into R. Other functions can help to generate huge
amounts of random or simulated data. Finally, there are a variety of functions available to help you upload your
own data files, whether they are stored as R workspace data (.Rdata), as plain text files (.txt, .csv, …), in
proprietary data file formats from other popular statistics packages (.sav, .sas7DAT, …), as MS Excel
spreadsheets (.xls, .xlsx, …) or even tables from a database (e.g. MS Access, MySQL, …).

2.6.1 Base and package data in R

Enter the command
data() or click > Packages & Data > Data Manager in the Mac OSX R GUI to
view a list of the datasets currently available on your installation of R (Figure 16). Try to find the data set “iris”
in the list. Select the “iris” data set from the list, or enter the help command
?iris, to read the documentation
describing this data set. The “iris” data set in R is the famous “Fisher’s iris data”, originally collected by
biologist Edgar Anderson. This classic data set has been used in countless statistics and biology textbooks, and
I will use it later in this manual.


Figure 16. A list of R data sets.

Crash Course: R & BioConductor


17


Enter the command
iris to view the Fisher’s iris data set, already stored in R.

> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa

146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
>

The iris data set includes data for 150 iris plants, with 50 plants each from the Iris setosa, I. versicolor L.
and I. virginica L. species. Four measurements were taken from each of the 150 plants to record their sepal
length, sepal width, petal length and petal width. This data set is often used as an example in classification and
clustering problems, where statisticians use the data as a training set to predict the species of an iris plant based
on its sepal length, sepal width, petal length and petal width measurements.

2.6.2 Entering data manually

The base and package data available in R can be a useful resource, but most R users need to upload their
own data into R. Most researchers will already have their data stored in a large data file. However, for some
small data sets, it may be easiest to enter the data manually from the command line. In other situations,
researchers may need to simulate large amounts of data using procedures from the command line. Some
common R procedures will be used to generate a small dataset concerning the Alpha-fetoprotein (AFP) levels of
20 medical patients.

> # generate a list of subject IDs, numbered from 1 to 20
> #
> subject <- 1:20
> #
> # create 10 entries for male subjects
> #
> males <- rep("male",10)
> #
> # create 10 entries for female subjects
> #
> females <- rep("female",10)
> #
> # combine male and female entries into one column vector
> #
> gender <- c(males,females)
> #
> # bind subjectID and gender columns together
> #
> afp.data <- data.frame(subjectID,gender)
> afp.data

Recall the command
subject <- 1:20 can be used to generate the sequence of integers from 1 to 20.
Alternatively,
subject <- seq(from = 1, to = 20, by = 1) could have been used to generate the same
sequence. These numeric sequences could be used to specify a series of patient ID’s or subject ID’s. Here the

Crash Course: R & BioConductor


18

ID’s have been stored as a variable named subject. The command
males <- rep("male",10) is used to
generate a vector of 10 replicated string values to label the male patients. A similar replicated vector is created
to identify 10 female patients, then the male and female labels are combined with the column vector command
c() to store a variable named gender. These two variables can be joined together to create the subject ID’s and
gender labels of a new data set named
afp.data using the command data.frame(), which will be defined
later.

> # generate 10 male and 10 female random normal heights
> #
> height <- c(rnorm(10,70,2.5),rnorm(10,64,2.2))
> #
> # generate 10 male and 10 female random uniform weights
> #
> weight <- c(runif(10,155,320),runif(10,95,210))
> #
> # compute body mass index (BMI) for 10 men and 10 women
> #
> BMI <- (weight*703)/(height**2)
> #
> # enter five treatment levels of a new drug (ng/mL)
> #
> drug <- rep(x = seq(from = 0, to = 20, by = 5), times = 4)
> #
> # manually enter Alpha-fetoprotein (AFP) levels for 20 patients
> #
> AFP.before <- c(0.8,2.3,1.1,4.8,3.7,12.5,0.3,4.4,4.9,0.0,1.8,
2.4,23.6,8.9,0.7,3.3,3.1,0.5,2.7,4.5)
> #
> afp.data <- data.frame(afp.data,height,weight,BMI,AFP.before)
> afp.data


We can use random data procedures to create additional factors for the
afp.data data set in R. Suppose
we want to randomly generate height and weight values for our male and female patients. Furthermore, assume
male heights are normally distributed with mean 70 inches (i.e. 5 foot 10 inches and standard deviation 2.5
inches, and female heights are normally distributed with mean 64 inches and standard deviation 2.2 inches. The
command
rnorm(10,70,2.5) is used to randomly generate 10 new observations from a Gaussian normal
distribution with mean 70 and standard deviation 2.5 to represent the heights of our male patients, while
rnorm(10,64,2.2) will generate the data for our female patients. Next, we want to generate weight values for
male and female patients, assuming male weights are uniformly distributed between 155 pounds and 320
pounds (i.e. each mass between 155 lbs. and 320 lbs is equally likely), while female weights are uniformly
distributed between 95 lbs and 210 lbs. The commands
runif(10,155,320) and runif(10,95,210) will
generate the male and female data, respectively. We could use the height and weight variables to compute a
new variable to represent body mass index (BMI) from the usual formula BMI = (703 * mass (lbs) ) / (height
2
).
Next, suppose we want to add a variable to represent five increasing concentrations of a new drug, from 0
ng/mL to 20 ng/mL. The
seq() command can be used to generate the sequence of drug concentrations 0
ng/mL, 5 ng/mL, 10 ng/mL, 15 ng/mL and 20 ng/mL, while the
rep() statement will repeat that sequence of
drug concentrations four times to fill out the rest of the column. Finally, we could enter a vector of pre-
treatment AFP values to complete the data set. Join all the new variable columns together with the earlier
afp.data data set using the data.frame() command to finish, then view the results by entering the data set
name
afp.data at the command line.

Crash Course: R & BioConductor


19

2.6.3 Importing previously saved R data workspaces (.RData)

Not surprisingly, it is possible to save and load R data sets in a their own file format (file extension
.RData). If you need to import a previously saved .RData file, you can use the
load() command. The load()
command only includes two parameters, the
file parameter used to specify the filepath of the .RData file that
will be imported and the more complicated
envir parameter that specifies an environment for the uploaded R
workspace. Briefly, an R
environment is a collection of named objects in R. The user’s R session workspace
is an
environment, for example. If during one session of use, an R user defines the variable aa = 42.7, then
any reference to the variable
aa will return the value 42.7 until the variable aa is redefined or until the
workspace is closed. If you load or import a previously saved R data workspace, all the variables and objects
defined in that workspace will be retained.

2.6.4 Text file data (.txt, .csv) with
read.table() and scan()

The most efficient way to import data into R is to upload a text file. Text files are typically smaller than
proprietary data formats, like MS Excel spreadsheets. Since plain text files are not organized by columns, rows
and cells like a MS Excel spreadsheet, users will need to specify a character to separate the values of different
fields (i.e. a delimiter to separate columns) and a character to mark the end of each line of text (i.e. the end of a
row). Often, the first row of a text data file is used to name the fields (i.e. columns) of a data table. It is also
common to enclose strings of character data with single or double quotation marks to avoid possible conflicts
with delimiter symbols and numeric fields. All these issues will be addressed in the parameters of the text file
import procedures.
The most popular way to import a plain text data file is with the
read.table() command. The
read.table() command is the most general method to read table style data from a plain text data file. Suppose
you have two different text file data sets. The first data set is a tab-delimited text file named . The second data
set is a comma separated value (.csv) plain text file named . Both files can be opened with the
read.table()
command.

> # Import a tab-delimited text for data set named Expression
> #
> Expression <- read.table(file = “~/expression.txt,
header = TRUE, sep = “ ”,
nrows = 40000,
stringsAsFactors = FALSE)
> Expression
> #
> # Import a comma separated value text file data set named ‘AE’
> #
> AE <- read.table(file = "~/AdverseEvents.csv", header = TRUE,
sep = ",", stringsAsFactors = TRUE)
> AE

The
read.table() procedure include the parameter file to specify the quoted file path of the data file
that will be imported. The
header = TRUE parameter indicates that the data file has a header line to define the
column (i.e. variable) names of the data table. If the header is not specified, column names can be entered using
the parameter
col.names. The parameter sep = “ ” indicates that the Expression data file should be
imported as a tab-delimited text files (i.e. columns are separated by tabs). Likewise, the parameter
sep = “,”
indicates that fields are separated by commas in the AE data. The parameter
nrows specifies the number of
rows of data in the text file to help speed up the import process for a large microarray expression data set.
When
stringsAsFactors = FALSE, all string variables in the data file will be stored as character data rather
than factor data. Many statistical and graphing procedures require character data to be specified as factors, but
it may be easier to modify data tables if the string variables are stored as character data. The
read.table

Crash Course: R & BioConductor


20

command includes many other parameters that could be useful. For example, the parameter
dec = “,” could
be used to specify that numbers are recorded with European-style “comma” decimals (e.g. 2.63 = 2,63).
The
read.csv() and read.csv2() commands are equivalent to read.table() with default parameters
optimized for comma separated value text files. Specifically, the
read.csv() command is optimized for
‘American’ formatted .csv files, where fields are separated by commas and decimals are separated from integers
with a period. ‘European’ formatted .csv files, where fields are separated by semicolons and decimals are
separated from integers with a comma, should be imported with
read.csv2(). Similarly, the read.delim()
and
read.delim2() commands are equivalent to read.table() with default parameters optimized for
importing tab-delimited text files. Specifically, the
read.delim() command is optimized for ‘American’
formatted .txt files, where fields are separated by tabs and decimals are separated from integers with a period.
‘European’ formatted .txt files, where fields are separated by tabs and decimals are separated from integers with
a comma, should be imported with the
read.delim2() command.
The command
scan() can also be used to import data tables from text files. The primary differences
between the
read.table() commands and the scan() command is that the scan command reads data as a
single large vector or list that must be shaped into a data table or data frame later. The
scan() command can be
more difficult to use than
read.table() and similar commands when the text file data is already formatted in
some way. The command
read.fwf() is used to import text data files that are stored in fixed width format,
where fields are not separated by a specific character like tab or comma, but instead each field is read from a
specified number of characters from left to right in the text file (e.g. characters 1-6 store the first field,
characters 7-8 store the second field, characters 9-12 store the third field, …). In other words, fields might be
separated by 0 or more space characters.

2.6.5 MS Excel files and other proprietary formats
Historically, it has been difficult to import data from MS Excel spreadsheets into R. Most people will
convert their MS Excel spreadsheets into tab-delimited text files (.txt) or comma-separated value text files
(.csv), then import these text files into R using the
scan(), read.table() or read.csv() commands.
However, converting MS Excel spreadsheets into text files may be tiresome, if dozens of files need to be
converted, and some users may not have access to MS Excel to convert .xls spreadsheets into .txt files. It is
now possible to import MS Excel spreadsheets directly using the
read.xls() command from the xlsReadWrite
package library.
First, try uploading MS Excel data by converting the MS Excel spreadsheet into a tab-delimited .txt file
or a .csv file. Start with a simple MS Excel data file (Figure 17).


Figure 17. A MS Excel spreadsheet data set

Crash Course: R & BioConductor


21

Click > File > Save As… to open up the Save As menu (Figure 18) and use the Save as type: drop down
menu to save your file as Text (Tab delimited) or CSV (Comma delimited). This will covert your MS Excel
spreadsheet into a tab-delimited text file or comma separated value text file that can be uploaded easily into R.


Figure 18. Saving a MS Excel spreadsheet as a tab-delimited text file.

Next, open R and import the text file with the read.table() or read.csv() statements seen below:

aa <- read.table(file = “C:\sample.txt”, header = TRUE, sep = “ ”)
Here, the statement aa <- read.table() implies that we are defining a data set named aa. The parameter
statement
file = “H:\BCBB tips\sample.txt” specifies the file path of our tab-delimited text file
containing the data. The parameter statement
header = TRUE specifies that the first row of data contains the
column headings of our data set, while the statement
sep = “ ” specifies that the different columns (or fields)
of our data table are separated by tab characters (i.e. the file is tab-delimited). You can use a similar command
to upload a .CSV file with the
read.csv() command.
Now, open the same file using directly from MS Excel. Click > Packages > Install package(s)… on
the MS Windows R GUI or click > Packages & Data > Package Installer on the Mac OSX R GUI to find and
install the
xlsReadWrite package library. Enter the command library(xls.ReadWrite) to load the package
library to your workspace. Upload your MS Excel data using the command:

bb <- read.xls(file=“C: sample.xls”,colNames=TRUE,sheet=1)

Note, the statement bb <- read.xls() implies that we are defining a data set named bb, just like the previous
example. Similarly, the parameter statements
file = “H:\BCBB tips\sample.xls” , colNames = TRUE
and
sheet = 1 indicate the file path of the MS Excel spreadsheet, choice to read column names from the first
row of data and the choice to only read data from the first sheet of the MS Excel file, respectively.

Crash Course: R & BioConductor


22

Other R package libraries are available to open data files created or saved using commercial statistics
software packages like SAS or SPSS. For example, SAS datafiles and SAS XPORT format libraries can be
imported with the commands
read.ssd() and read.xport() from the foreign package library. It is also
possible to import SAS data sets and SAS Transport files using the
sas.get() and sasxport.get()
commands from the
Hmisc package library. Similarly, spss.get() and read.spss() from the package
libraries
Hmisc and foreign, respectively, can both be used to open SPSS data files (.sav file extension) in R.
The command
stata.get() from the Hmisc package library can be used to import Stata datasets into R, etc.
Data sets from most commercial statistics software packages can be imported directly into R.
2.7 Data Types

2.7.1 Simple object types (E.g. numeric, character and logical)

One potentially frustrating problem with R is that you must carefully specify and manage how data is
stored within R. Consider the following R statements:

> a = 4.23 # Define a numeric object “a”
> b = "Fred Flintstone" # Define a character object “b”
> c = TRUE # Define a logical object “c”

The R commands above define three variables:
a, b and c. Each of these three variables are stored as an object
within the R framework. All objects in R have a specific storage mode within R. The variable
a was defined to
be the real number 4.23, so the variable
a will be stored as a numeric object in R. Likewise, the variable b was
defined to be the character string “Fred Flintstone”, therefore it will be stored as character object in R. Note
that character string objects are entered within double quotes in R. Finally, the variable
c was defined as the
logical outcome TRUE, so it will be stored as a logical object in R. Note the logical values TRUE and FALSE
can be entered and stored without quote marks in R to create a logical object, but the character strings “TRUE”,
“False” and other variations within double quotes would be stored as character objects in R. More specific
classes of objects exist, like
complex() R objects for storing complex numbers, integer() objects for storing
integer numeric data or
factor() objects for character string data that will be used as factor effects in statistical
tests and graphs.
Objects from different storage modes have different properties within R. If an R user tried to compute
the sum 4.23 + “Fred Flintstone”, then R would return an error message. This is a good thing, because
obviously the sum 4.23 + “Fred Flintstone” does not make sense. However, the different properties of these
object types can sometimes cause conflicts, especially when data gets entered incorrectly. For example, the
numeric value 4.23 can also be entered as the character string “4.23”. Numeric data sometimes gets into R, or
other software programs, as character string data because of differences in how various software programs
handle missing values or other problems. Incorrectly storing R objects can create unwanted errors in R
statistical and graphing procedures, so it is often helpful to check the storage mode of an R object using the
storage.mode() command. You can test whether an object belongs to a specific class using commands like
is.numeric() or is.character(). You can also find the storage mode of an object using the command
class(). Objects can often be coerced from one class into another using commands like as.numeric() or
as.character().

> a = 4.23 # Define a numeric object “a”
> is.numeric(a) # Test if “a” is a numeric object
[1] TRUE
> a = as.character(a) # Coerce “a” into a character object
> a # View the object “a”
[1] "4.23"

Crash Course: R & BioConductor


23

> is.numeric(a) # Test if “a” is a numeric object
[1] FALSE
> class(a) # Find the storage mode class of “a”
[1] “character”
2.7.2 Larger object types (E.g. data frames, matrices and lists)

Larger entries of multiple values are also considered objects within R. For example, the data sets
iris,
afp.data, Expression and AE defined in Section 2.6 are all R objects. Many collections of values can be
stored as
vector(), matrix(), array(), table(), data.frame() or list() objects within the R framework.
Again, these object storage modes each have unique properties within R to ensure proper handling of different
types of data within R. You can also identify the storage mode of larger R objects, like arrays or data frames,
with the command
class() or with specific tests like is.data.frame(). You can coerce these larger R
objects into different storage modes using commands like
as.list() or as.numeric().
A vector is a one-dimensional, mathematical list of values that can be used in linear algebra (or matrix
algebra) operations. The
vector() and c() commands in R are used to define column vectors; row vectors
must be created using the transpose command
t() or by defining a 1 x n matrix, where n is the length of the
row vector. A matrix is a two-dimensional list of values organized into m rows and n columns. Generally,
vectors and matrices in R should only used for matrix algebra manipulation of numeric data. It is possible to
enter character or logical data into a vector or matrix, but it is not possible to mix different types of data (e.g.
numeric, character, logical,…) in the same vector or matrix. An array is similar to a vector or matrix, except the
array can exist in higher dimensions. The higher dimensions of an n-dimensional array could be described as
panels, pages, chapters, books, etc. For example, a six-dimensional array might have 3 books, 10 chapters, 150
pages, 16 panels, 8 rows and 45 columns.

> # Display a vector with length = 6
> Vector <- c(45,50,53,47,44,52)
> Vector
[1] 45 50 53 47 44 52
> # Report the length of Vector
> length(Vector)
[1] 6
> # Display a matrix with 4 rows and 6 columns
> Matrix <- matrix(x = x, nrow = 4, ncol = 6, byrow = TRUE)
> Matrix
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 49 54 49 47 57 53
[2,] 49 45 50 50 49 46
[3,] 45 56 54 52 46 51
[4,] 47 46 48 48 55 48
> # Report the number of rows and columns of matrix
> nrow(Matrix)
[1] 4
> ncol(Matrix)
[1] 6
> # Display an array with 4 rows, 6 columns and 2 panels
> Array <- array(data = x, dim = c(4,6,2))
> Array
, , 1

[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 50 48 52 52 48 44
[2,] 52 46 45 56 44 59
[3,] 45 41 51 51 48 44
[4,] 46 52 53 52 45 51

Crash Course: R & BioConductor


24

, , 2

[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 49 40 50 50 55 51
[2,] 44 50 46 50 53 57
[3,] 49 49 50 54 49 51
[4,] 56 55 45 49 51 47
> # Report the dimensions of Array
> dim(Array)
[1] 4 6 2

The combine function
c() can be used to create column vectors in R. Even though vectors are
displayed in rows when printed in the R workspace, mathematically they will behave as column vectors. The
size of a vector is reported using the
length() command in R. The matrix() command builds a matrix from a
vector
x, using the parameters nrow and ncol to specify the number of rows and columns. The parameter
byrow is used to determine whether the matrix will be filled with the vector values row-by-row or column-by-
column. The number of rows and columns in a matrix can be reported with the functions
nrow() and ncol(),
respectively. The
array() command builds an array from a vector data, where the dim parameter is used to
specify both the number and length of its dimensions. For example, the entry
dim = c(4,6,2) indicates the
array should have three dimensions of lengths 4, 6 and 2, respectively. The function
dim() is used to report the
dimensions of an array object.
The
table() storage mode is used to store n-dimensional tables of frequency data for two or more
categorical variables. The tables might be reported directly to display the frequency distribution among two or
more factors, or the tables could be used as the data format for specific statistical tests and graphical methods.
The
table() comma is helpful, because it computes cell frequencies automatically. Unfortunately, the
table() command does not compute statistics (e.g. mean, sum, …) for numeric arguments, so it cannot be used
to compute pivot tables or statistical summaries.

> # Build a two-way table from AE data
> Table <- table(Gender,Severity)
> Table

Mild Moderate Severe
Female 6 16 9
Male 14 15 4
> # Build a three-way table from AE data
> Table <- table(Region,Severity,Gender)
> Table
, , = Female

Mild Moderate Severe
Midwest 1 6 1
Northeast 0 3 7
Northwest 0 4 0
Southeast 0 3 1
Southwest 5 0 0

, , = Male

Mild Moderate Severe
Midwest 0 7 1
Northeast 1 5 1
Northwest 0 1 0
Southeast 0 2 2
Southwest 13 0 0

Crash Course: R & BioConductor


25

The
table() command uses two or more vectors of factor() or character() data to generate an n-
way table. The numbers in each cell are counts representing the number of observations with each combination
of factor levels. For example, a two-way table of gender versus severity reveals that there are 6 female patients
with mild AE symptoms. Breaking the data down further into a three-way table with region, severity and
gender, there is only one female patient with mild symptoms from the northeast region and there are 5 female
patients with mild symptoms in the southwest region. No additional parameters were specified. However, one
interesting parameter is
exclude, which allows you to hide the results for specified factor levels. For example,
exclude = “Midwest” would hide all the results for the midwest region.
Most R data should be stored with the
data.frame() storage mode. The vector, matrix, array and table
objects can only store individual values from one storage type (e.g. all numeric data, all character data, …). A
data frame is very useful, because it can store data sets with multiple columns (i.e. variable) that maintain their
own unique storage modes (e.g. separate
numeric, character and logical variables). The only limitation of
the data frame is that each variable, or column, must be the same length. Commands like
read.table() and
read.csv() will automatically store imported data as a data frame. Data frames share some of the properties of
the matrix, array and list storage modes, but you want to be careful about using data frames in matrix algebra
calculations and other methods. The properties of data frames will not always work with commands that
require matrix data, and vice versa.

> # Define a data frame from AFP data
> afp.data <- data.frame(subjectID,gender,stringsAsFactors=FALSE)
> gender
[1] "female" "female" "female" "female" "female" "female"
[7] "female" "female" "female" "female" "male" "male” "male"
[14] "male" "male" "male" "male" "male" "male" "male"
> # Define a data frame from AFP data with strings as factors
> afp.data <- data.frame(subjectID,gender,stringsAsFactors=TRUE)
> gender
[1] female female female female female female female female
[9] female female male male male male male male male male male male
Levels: female male

The
data.frame() command is used to join several vectors of the same length together to form a single
data set. Again, individual vectors can have different storage modes, such as numeric or character. The only
limitation is that the vectors must share the same length. The
stringsAsFactors parameter of the
data.frame() command is used to store any character string variables in the data frame as factor() objects.
The diffence between a vector of character objects and a factor object is shown above. A vector of character
data is simply a collection of character strings. If we wanted to add new rows of data to the character variable
gender, then we could add a new character string to the variable (e.g. “Did not report” or “intersex”). However,
a factor object is a variable of string or numeric data with a fixed number of levels or outcomes. When gender
is stored as a factor, there are only two possible levels, female and male. If a new character string (e.g. “Did not
report”) were added to the factor variable gender, it would produce an error. Many statistical and graphical
procedures in R require factor data.
The
list() storage mode is used to store a collection of ordered or named objects in R. A list shares
some properties in common with a vector, except the list can be used to store a collection objects with different
storage modes. For example, a single list could contain one numeric entry, three character entries, one logical
entry and an entry that is itself a vector or matrix. Lists can also have entries that are named, as well as ordered.
Finally, a list can be built one element at a time. For this reason, lists can be a handy way to store the output
from a statistical function, since tests often produce diagnostic results and data that may not be immediately
useful to all users (e.g. lists can store the residuals from a linear regression analysis).

Crash Course: R & BioConductor


26

> # Build a list of four numeric entries
> List <- list(45,13,21,87)
> List
[[1]]
[1] 45

[[2]]
[1] 13

[[3]]
[1] 21

[[4]]
[1] 87

> # Build a list of character, numeric and vector objects
> List <- list(Day = "Tuesday", Temperature = 70, WinningLotto = c(17,23,44,39,7))
> List
$Day
[1] "Tuesday"

$Temperature
[1] 70

$WinningLotto
[1] 17 23 44 39 7

> # Add a new list element to the list
> List[["Time"]] <- "4:30"
> List
$Day
[1] "Tuesday"

$Temperature
[1] 70

$WinningLotto
[1] 17 23 44 39 7

$Time
[1] "4:30"

The
list() command allows you enter a series of values or R objects to build the list, similar to the
vector() command. The first example shows a list of four numeric values. When the list is displayed in the R
workspace, each entry of list is identified first in double brackets, then the individual rows of the entry are
displayed on the line below with the entry values on that row. For example, the third entry of the first list is
identified
[[3]] with the value [1] 21. Since all of the entries are numeric, this list is equivalent to a numeric
vector of length 4. In the second example, the entries of the list are assigned names using the
= operator.
Notice that the entries include a character object, a numeric object and a vector of numbers. Finally, new
entries to the list can be added after it has been defined, by adding a new name and value, e.g.
List[["Time"]]
<- "4:30"
.

2.8 Manipulating Data in R

Once you have a data set imported and stored correctly in R, you may still need to manipulate the data
set to add data, remove data or to meet the formatting requirements of a graph or statistical test. You may want
to remove outlier values, or rename rows and columns of data, or maybe merge two data sets together. The R
command line language includes a wide variety of procedures for these needs. Understanding these commands

Crash Course: R & BioConductor


27

can be crucial, because R data is not stored in a viewable spreadsheet format, with simple copy-cut-and-paste
functions, like MS Excel and other programs.

2.8.1 Indexing

One of the most important concepts in R is the idea of indexing, because it applies to so many types of R
objects. Vectors, matrices, data frames, arrays and lists can all be indexed using similar command notations.
The index of an R object refers to the specific location of a value in a vector, matrix, array, data frame or list.
You can generalize this concept by thinking of the index as the row and column number of any value entry in a
spreadsheet, but remember that some R objects can have more than two dimensions or fewer than two
dimensions. Here are some examples:

> # Report the third entry from a vector of length = 6
> Vector[3]
[1] 53
> # Report the entry from the 2nd row and 5th column of a matrix
> Matrix[2,5]
[1] 49
> # Report the 3rd row, 2nd column and 2nd panel of an array
> Array[3,2,2]
[1] 49
> # Report the 3rd row, 2nd column and 'Female' gender of a table
> Table[3,2,"Female"]
[1] 4
> # Report the 1st entry of the 1st column from afp.data
> afp.data[1,1]
[1] 1
> # Report the 2nd entry from the 'WinningLotto' vector in a list
> List[["WinningLotto"]][2]
[1] 23
Generally, you refer to an indexed entry of an R object by adding square brackets after the objects name
(e.g.
Vector[3] refers to the 3
rd
entry of the object Vector). The dimensions of an object are separated by
commas (e.g.
Matrix[2,5] refers to the 2
nd
row and 5
th
column of the object Matrix). If the dimensions of an
object are named instead of numbered, then those dimensions can be specified with a quoted character string
(e.g. specify the
"Female" of the gender dimension). The examples above use indexing to report single values
from vectors, matrices, arrays, tables, data frames and lists, but an index can be used in more complicated ways.

> # Rows 2-3 and columns 1, 2 and 6 of a matrix
> Matrix[2:3,c(1,2,6)]
[,1] [,2] [,3]
[1,] 49 45 46
[2,] 45 56 51
> # Overwrite one value from a matrix
> Matrix[3,3] <- NA
> Matrix
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 49 54 49 47 57 53
[2,] 49 45 50 50 49 46
[3,] 45 56 NA 52 46 51
[4,] 47 46 48 48 55 48

Crash Course: R & BioConductor


28

> # Identify observations with % Body fat less than 10%
> AE[Percent.Body.Fat <= 10,]
Region Gender Severity Age Weight Percent.Body.Fat
2 Southwest Male Mild 34 148.5672 7
30 Southwest Male Mild 36 155.3823 8
49 Midwest Male Moderate 34 151.3767 9
A sequence of row or column numbers can be entered into an index to view more than one row or
column from a data table. These sequences can be entered using colon symbol notation (e.g.
1:5) or the
combine function (e.g.
c(1,4,7)) and other methods. The individual indexed values of a matrix, array or data
frame can be overwritten without affecting any other values in the matrix, array or data frame. Sequences of
row numbers or column numbers can be generated with an inequality or conditional statement to find special
subsets of data. Indexing is a very powerful tool within R.

2.8.2 Column references and attach()

Indexing can be a great way to create, view or modify subsets of your data, but often it might be more
helpful to refer to specific columns, or variables, within a large data frame. We have already seen that the
objects in a list can be called by their names using double square brackets and the quoted name (e.g.
List[[“Time”]] yields the value “4:30”). We can also call a specific column of a data frame using the
reserved dollar sign symbol (e.g.
AE$Gender yields the gender column of the AE data set). Column references
and list name references can be simplified using the
attach() command to “attach” a specific data frame or list
to the current R workspace. Once the object has been attached to the workspace, individual variables or list
items can be called by name.

2.8.3 Binding rows and columns

Indexing and column references allow you to manipulate smaller subsets of a large data set. The
functions
cbind() and rbind() allow you to add columns and rows to data sets, respectively. It is also
possible to add columns to a data frame recursively, by redefining the data frame with its original data and the
new columns of data
Frame = data.frame(Frame,NewColumn) . Obviously, new elements can be added to
lists at any time, by adding new named elements.

2.8.4 Sort and order data

It is often helpful to sort the results of a vector, array or data frame to reorganize a data set or result for
better insights. For example, you might need to sort the results from several statistical tests by their p-values, so
the most statistically significant results are easy to identify. The
sort() function is used to sort the actual
values of a single vector or list in ascending or descending order. The
order() command is used to generate an
index of the sorted rows numbers from a data frame sorted by one or more variables in either ascending or
descending order.

> # Sort a single vector of values
> a
[1] 4 4 0 5 4 1 0 1
> sort(a)
[1] 0 0 1 1 4 4 4 5
> sort(a,descending = TRUE)
[1] 5 4 4 4 1 1 0 0
> index = order(a,b,c)
> index
[1] 3 7 6 8 2 5 1 4

Crash Course: R & BioConductor


29

> frame[index,]
a b c
3 0 0.0 89
7 0 0.6 92
6 1 0.2 99
8 1 0.8 83
2 4 0.0 84
5 4 0.4 100
1 4 0.9 81
4 5 0.1 92
> index = order(b,c,a)
> frame[index,]
a b c
2 4 0.0 84
3 0 0.0 89
4 5 0.1 92
6 1 0.2 99
5 4 0.4 100
7 0 0.6 92
8 1 0.8 83
1 4 0.9 81
2.8.5 Replace values
The
replace() command is used to replace the values of a matrix, array or data frame according to an
index
list, which identifies the values that need to replaces, and a vector, matrix or array of replacement
values. For example, you could use an equality statement to recode the values of a character or factor variable
(e.g. recode “m” and “f” to “male” and “female”). Alternatively, you could use an inequality to identify and
remove outliers from a numeric variable.

> # display a simple data frame
> frame
a b
1 f 3
2 f 7
3 m 1
4 f 1200
5 m 6
> # generate an index to identify “f” values for replacement
> indx.f <- frame == "f"
> # replace “f” values with “female”
> frame <- replace(frame,indx.f,"female")
> # generate an index to identify possible outlier values of b
> indx <- frame$b > 1000
> # replace outliers with NA to remove outliers
> frame$b <- replace(frame$b,indx,NA)
> # view results
> frame
a b
1 female 3
2 female 7
3 m 1
4 female NA
5 m 6
2.8.6 Stack, unstack and reshape data Often data sets need to be stacked or split to reorganize data for use in statistical tests or to simply the
recording of new observations. For example, suppose you were to record blood pressure, cholesterol and other

Crash Course: R & BioConductor


30

medical results for 6 patients on Monday, Tuesday and Wednesday of one week. It would make sense for the
doctor to record the measurements in three separate columns for Monday, Tuesday and Wednesday. But most
statistical tests would need one column of dates and one single column for each type of measurement (e.g. one
column of blood pressure measurements). The
stack() and unstack() commands are used to stack and split
these kinds of data sets. More complex stack and split operations can be performed using
reshape().

> # Display a data frame
> frame
Monday Tuesday Wednesday
1 96 76 156
2 100 78 163
3 102 80 163
4 106 82 159
5 105 82 153
6 103 78 162
# Stack the data from Monday, Tuesday and Wednesday
> frame = stack(frame)
> frame
Date Measurement
1 Monday 96
2 Monday 100
3 Monday 102
...
7 Tuesday 76
8 Tuesday 78
9 Tuesday 80
...
16 Wednesday 159
17 Wednesday 153
18 Wednesday 162
> # Unstack the measurement data by dates
> frame = unstack(frame)
> frame
Monday Tuesday Wednesday
1 96 76 156
2 100 78 163
...

2.8.7 Merge data sets Two data frames can be joined together using the merge() command. The default option is to join the
data frames using any columns that share the same name among all the data frames. However, specific columns
can be matched to one another using the
by, by.x or by.y parameters. For example, it might be necessary to
combine the medical records of a general practitioner, a cardiologist, a dentist and a psychologist according to
their patient id numbers or patient names. The merge command can also be used for more complicated join
operations among database tables.

> psych
id ssri therepy
1 1 Y Y
2 1 Y N
3 2 N N
4 2 N Y
5 3 N Y
6 3 Y Y
7 4 N N
8 4 N Y

Crash Course: R & BioConductor


31

> cardio
id exercise lipids
1 1 Y high
2 1 Y high
3 1 Y norm
4 1 Y norm
5 3 Y norm
6 3 Y norm
7 4 Y norm
> records = merge(psych,cardio)
> records
id ssri therepy exercise lipids
1 1 Y Y Y high
2 1 Y Y Y high
3 1 Y Y Y norm
4 1 Y Y Y norm
5 1 Y N Y high
6 1 Y N Y high
7 1 Y N Y norm
8 1 Y N Y norm
9 3 N Y Y norm
10 3 N Y Y norm
11 3 Y Y Y norm
12 3 Y Y Y norm
13 4 N N Y norm
14 4 N Y Y norm

2.9 Saving and Exporting Data
2.9.1 Save workspace data with save() With any software program, it is important to save your data. There are several options available to save
your data in R. Most of the options should be familiar from the data import commands in section 2.6 of this
manual. Click > File > Save Workspace... on the MS Windows R GUI or click > Workspace > Save
Workspace File... on the Mac OSX R GUI to save the entire R workspace. Alternatively, you could enter the
command
save() or save.image() to save the R workspace from the command line. If you use the save()
command, you can specify a list of R objects to be saved (e.g.
save(AE) to save only the AE data set),
otherwise the
save() command will save all the R objects defined in the R workspace. Use the command ls()
to view the objects in your R workspace, and use the command
rm() to remove individual objects from the
workspace. The command
rm(list=ls()) will remove all objects from the R workspace.

> ls()
[1] "AFP.after" "AFP.before" "BMI"
[4] "InternetTest" "Monday" "R2HTML.test"
[7] "Tuesday" "Wednesday" "a"
[10] "aa" "afp.data" "b"
[13] "biocLite" "biocinstall" “biocinstallPkgGroups"
[16] "biocinstallRepos" "c" "cardio"
...
> rm(a,aa,b,c)
> save(file="~/workspace.RData")

2.9.2 Save data tables with write.table()

You can save an R data frame with the commands
write.table(), write.csv() or write.delim().
The parameter options are similar to the
read.table() commands, but here you will choose whether the save
text file should have a header or row names, if the fields should be separated by commas or tabs, etc. The
na

Crash Course: R & BioConductor


32

parameter is important, because it controls how missing values will be saved in the text files; you may want to
be carefully choose the symbol for missing data, if you want to open the text file in another software package
like MS Excel, SAS, etc. Another powerful option in write.table is the
append parameter, which allows you to
add new data to an existing text file. This can be a useful option when you need to save large amounts of data
from a script or analysis that involves long computations. It is often helpful to save a data file one-piece-at-a-
time to avoid losing data during lengthy computations or to avoid problems when trying to save one gigantic
data file.

> # Save AE data as tab-delimited text optimized for SAS import
> write.table(AE,file="~/ae.txt",sep=" ",na=".",row.names=FALSE)
> # Save the AFP data as tab delimited text using write.delim()
> write.delim(afp.data)

2.10 Changing Directories

When opening and saving files from R, it may be helpful to change the working directory. Changing the
directory will often allow you to specify only a file name, rather than a complete file path, when opening and
saving data files or source scripts. Change directories from the MS Windows GUI by clicking > File > Change
dir... on the menu bar; click > Misc > Change Working Directory... to change directories on the Mac OSX R
GUI. The Mac OSX GUI also allows you to click > Misc > Get Working Directory to find the current
working directory. The commands
getwd() and setwd() allow you to find the current directory and change
the working directory, respectively.

2.11 Sample Problems for Students

#1. {Fisher’s iris data} Sir Ronald A. Fisher famously used this set of iris flower data as an example to test
his new linear discriminant statistical model. Now, the iris data set is used as a historical example for
new statistical classification models.

A) Search the help menu for the keyword “linear discriminant”, then report the names of the
functions and packages you find.

B) Search the help menus or a search engine for additional classification models that could be tested
with the iris data.

C) The measurements from the iris data set were made in centimeters, but suppose a researcher
wanted to compare the performance of their classifier for measurements in both cm and inches.
Remember 1 cm = 0.3937 inch and create a new iris data set with measurements in inches.

D) Use indexing to verify the 77
th
plant (i.e. row 77) has petal length of approximately 1.89 inches.

#2. {AFP data} Suppose alpha-fetoprotein (AFP) is a potential biomarker for liver
cancer and other cancer types. A researcher might be interested in AFP levels
before and after taking a new drug in one of four concentrations.

A) The example in section 2.7.2 of the manual provided a list of 20 AFP levels before drug
treatment. Use your own methods to enter a new column of 20 AFP levels after drug treatment,
then enter another column with the difference between the pre- and post-treatment AFP levels

B) Verify the storage mode of the data set
afp.data. Verify the storage mode of the variable drug.
Verify the storage mode of the variable
gender. Convert the storage mode of drug to factor.

Crash Course: R & BioConductor


33


C) Create a subset of the AFP data that only includes male patients with
BMI > 25.5 or weight >
180 lbs
. How many men are included in the data subset?

D) Sort the entire data subset created in part C) by the BMI variable in an descending order. What
is the row ordering of the sorted data subset? Save the data subset as a comma separated value
(.csv) text file, then remove the subset from your R workspace.

#3. {AE data} Doctors, epidemiologists and other researchers look at adverse events to explore the
symptoms and medical conditions affecting patients. A researcher might choose to look for associations
between adverse events and diet.

A) One of the adverse events in the data table is “Malaise”. Recode the AE data table, such that all
entries for “Malaise” read “Discomfort” instead.

B) Look at the results of your recoded adverse events. How many different types of adverse events
are there? Look through their names. Do you see any potential problems? Fix any problems
that you might find.

C) Create an adverse event table to examine relationship between different adverse event symptoms
and their severities. Make sure the “Discomfort” AE shows up in the table, instead of “Malaise”.

D) Search the help menus for the functions rowSums and colSums. Use these functions to count up
the number of patients with each adverse event and the number of patients with mild, moderate
and severe symptoms.

E) Define a new variable AEmatrix by converting the AE table into the matrix storage mode.
Define two new matrix variables using the commands LL = matrix(1,1,17) and RR = c(1,1,1).
Look at all these new matrices. Compute the products of LL by AEmatrix; AEmatrix by RR;
and LL by AEmatrix by RR.

Crash Course: R & BioConductor


34

Ch. 3. Graphics and Figures in R

3.1 Basic Types of Graphics and Figures

You can use R to produce dozens or hundreds of different kinds of graphics and figures. Many popular
types of graphs, like pie charts and histograms, have their own dedicated commands and procedures in the
graphics package library. Other types of graphs, like multifactor XY scatterplots, are most easily produced
using multiple commands from general graphing utilities, like
plot() and legend(). Often, specialized
package libraries will include graphics commands that can help streamline the graphing process. Other graphs
can only be produced in the context of the appropriate statistical analysis. Several simple examples are
provided below.

3.1.1 Pie charts

Pie charts are used to quickly display the frequencies of each outcome of a single categorical variable.
The relative size of each slice of the pie chart represents the relative frequency of its respective outcome in the
sample. For example, we could use a pie chart to examine the proportion of samples from each iris species in
Edgar Anderson’s iris data (Figure 19). We could also use a pie chart to explore the frequencies of each
adverse event in our AE data set (Figure 20).

Figure 19. Pie chart of Edgar Anderson’s iris species. Figure 20. Pie chart of the adverse events (AE) data.

># Create the labels for the iris data pie chart
> labels <- levels(iris$Species)
># Create a vector with all three species counts
> counts <- summary(iris$Species)
># Define a vector with three color choices for the pie chart
> colors <- c("red","blue","yellow")
># Define a main title for the pie chart
> main <- "Pie Chart of Iris Species"
># Call the pie() command to produce the pie chart
> pie(x = counts, labels = labels, col = colors, main = main)

Crash Course: R & BioConductor


35

># Create the labels for the adverse events (AE) data pie chart
> labels <- levels(as.factor(AE$Adverse.Event))
># Create a vector with counts for all adverse events
> counts <- summary(as.factor(AE$Adverse.Event))
># Define a main title for the pie chart
> main <- "Adverse Events Pie Chart"
># Call the pie() command to produce the pie chart
> pie(x = counts, labels = labels, main = main)

The
pie() command includes the parameter x to define the counts or frequencies in each slice of the pie
chart, the parameter
labels to define the text labels in each slice of the pie chart and the parameter col to
define the colors of each slice in the pie chart. You can also add generic graphing parameters, like
main and
others, to customize the pie chart with a main title and other features. Notice that the
labels and the x (counts
or frequencies) parameters could have been computed and entered manually, but instead the commands
levels() and summary() were used to define labels and x, respectively. The levels() command lists all the
outcomes of a factor variable, while the
summary() command adds up all the counts for each outcome of a
factor variable. Note, the species variable of the iris data set was already defined as a factor variable, while the
adverse events variable from the AE data set needed to be converted to a factor variable first, using the
as.factor() command. Also notice, in the second pie chart, that the col parameter was left undefined and R
automatically generated the color choices for each of the 18 adverse event slices.

3.1.2 Histograms

Histograms are used to quickly display the distribution of a single continuous numeric variable. Often
researchers want to determine if a variable might be normally distributed or non-normally distributed. Other
researchers want to estimate descriptive statistics like means, medians, variances or ranges. A key issue in the
construction of a histogram is the choice of the histogram “bins” or groupings. If too many bins are used, the
true shape of the distribution will be lost because the histogram will be too sparse, but if too few bins are used,
the true shape of the distribution will be lost because the bins are too dense to remain informative. The location
of the bin mid points and break points can also be important to the shape of the histogram. The importance of
binning is shown in two histograms of the height measurements from the AFP dataset shown below (Figure 21
and Figure 22).

Figure 21. Histogram of height from the AFP data set
with default number of bins.
Figure 22. Histogram of BMI from the AFP data set
with a larger number of bins.

Crash Course: R & BioConductor


36


> # Define a vector of BMI data
> height <- as.numeric(afp.data[,3])
> # Define a main title for the histogram
> main <- "Histogram of height from AFP data"
> # Call the hist() command to produce the histogram
> hist(x=height,xlab="height (inches)",main=main,col="wheat")
> # Call hist() command with extra breaks for a second histogram
> hist(x=height,breaks=30,xlim=c(15,45),...)

The
hist() command includes many parameter options. The parameter x must be specified, to identify
the sample of continuous data displayed in the histogram. The
breaks parameter specifies the number of bins
used in the histogram. The number of histogram bins can be specified using one of three automated binning
algorithm choices (i.e. “Sturges”, “Scott” or “Freedman-Diaconis”), a single number (i.e. breaks = 30 will
produce 31 bins), a vector of specific break points or a formula. In the first histogram, the default “Sturges”
method produced a histogram with six bins, which appears to show a normal distribution. In the second
histogram, the command breaks = 30 specified that 31 bins should be used, and the resulting histogram was
sparse and uninformative. The command
xlab specifies the label for the x-axis. As before, the commands
main and col specify the main title and the color of the plotted bars, respectively.

3.1.3 Box plots

Box plots are an alternative to the histogram for researchers who want to quickly summarize the
distribution of continuous numeric variables. Box plots were introduced by statistician John Tukey in his
historic book Exploratory Data Analysis
(Tukey 1977). The box plot is a graphical representation of the five
number summary, where the central line in the box plot represents the median of a sample, the outer edges of
the box in the box plot represent the 25
th
and 75
th
percentiles of the sample and the whiskers of the box plot
represent the minimum and maximum of a sample. Alternate versions of the box plot often use dots or asterisks
to identify outliers beyond the whiskers, which might represent the 5
th
and 95
th
percentile of a distribution or the
smallest and largest “non-outlier” values of a distribution. Generally, a single box plot (Figure 23) provides less
information about the shape of a distribution than an analogous histogram. For example, a box plot cannot be
used to identify a bimodal distribution of female and male heights, while a histogram can. However, box plots
are often more appropriate than histograms when researchers want to compare the distributions of several
samples in the same figure (Figure 24).

> # Define a vector of height data
> height <- as.numeric(afp.data[,3])
> # Define a main title for the boxplot
> main <- "Boxplot of height from AFP data"
> # Call boxplot() command for boxplot of height from AFP data
> boxplot(x=height,main=main,xlab=”height (inches)”,col=”wheat”)
> #
> # Call boxplot() command for boxplot of calories from AE data
> boxplot(formula=Calories~Region,data=AE,range=1.5,...)

The
boxplot() function can be used in at least two different ways, with a single vector of continuous
data or with a formula to produce side-by-side box plots. A simple box plot of the height variable from the AFP
data set is produced using the
boxplot() command with parameter x = height to specify a single vector of
numeric data for the
Calories ~ Region was used to create a graph with side-by-side box plots of the calories
variable for each of the five regions of the categorical region variable. The parameter
data = AE is used to
specify that we only want to use variables from the AE data set, which is why we could call the calories and
region variables with out defining them as vectors before the
boxplot() command. The range parameter is

Crash Course: R & BioConductor


37

Figure 23. Box plot of patient height from AFP data

Figure 24. Box plot of calories among five regions.

used to identify outliers in the box plot. The parameters
main, xlab and col were used to specify the main title,
x-axis label and the color of the boxplot, respectively, as seen in the previous examples. In the second box plot,
the parameter
formula = on the box plot figure. Here, range = 1.5 implies that any calorie measurement
smaller than Q1 – 1.5*IQR and any measurement larger than Q3 + 1.5*IQR will be identified as an outlier,
where Q1 represents the 25
th
percentile, Q3 represents the 75
th
percentile and IQR represents the interquartile
range (i.e. Q3 – Q1, the middle 50% of the data). Another parameter
ylab = “Calories” was entered to
specify the y-axis label, while the
main and col were used to define a main title and the box plot color as above.

3.1.4 Simple bar charts

Researchers and statisticians often use the phrase “bar chart” to describe two subtly different types of
graphs. Sometimes a bar chart is used as an alternative to the pie chart to display the relative frequencies of
different outcomes from a categorical variable (e.g. gender or region), but in other situations a bar chart with
error bars might be used to display the mean response levels of a continuous variable (e.g. weight,
concentration) and its standard error among several categories as an alternative to a box plot. E.g. a bar chart
can be used to plot the frequencies of each adverse event in the AE data set (Figure 25), or a bar chart could be
used to plot the mean BMI levels for male and female patients in the AFP data set (Figure 26).

> # Create a vector of counts for the AE bar chart
> counts <- summary(AE$Adverse.Event)
> # Define a main title for the bar chart
> main <- "Bar chart of Adverse Events"
> # Call the barplot() command for a bar chart of adverse events
> barplot(height=counts,main=main,ylab=”Counts”)

Crash Course: R & BioConductor


38


Figure 25. A bar chart of adverse events from the AE data set


Figure 26. A bar chart of female and male BMI from the AFP data set.

> # compute mean BMI for male and female patients from AFP data
> BMI.females <-mean(AFP[AFP$gender=="female",5])
> BMI.males <-mean(AFP[AFP$gender=="male",5])
> mean.BMI <- c(BMI.females,BMI.males)
> # define labels for female and male bars
> names(mean.BMI) <- levels(as.factor(AFP$gender))
> # Define colors for female and male bars
> colors <- c("pink","sky blue")
> # Specify a main title for the bar chart graph
> main <- "Bar chart of mean BMI by gender for AFP data"
> # Call the barplot() command for a bar chart of BMI responses
> barplot(height = mean.BMI,ylab="BMI", main=main,col=colors)

Crash Course: R & BioConductor


39


Before creating a bar chart of mean BMI levels for male and female patients from the AFP data, we first
need to compute the individual BMI means for male and female patients using subscripting. Names were
assigned to the vector of BMI means using the
names() function, so the appropriate gender labels will show up
below the male and female BMI data. Colors and a main title were defined for the chart, and the
barplot()
function was called with the appropriate options.
Note that more advanced bar charts of several categorical variables can be easily created with the
barchart() command from the lattice package library. Multiple categorical variables can be summarized
with the
table() command, then the table of categorical variables is entered into the barchart() command
for easy clustered, stacked or paneled bar charts. Notice that numeric variables will not work appropriately in
the
barchart() command. However, quick and easy bar charts with error bars can be created with the
bargraph.CI() command from the sciplot package library. Other helpful packages may exist to create more
variations on these types of bar charts.

3.1.5 Simple scatter plots and line plots

Scatter plots are used to display the relationship between two continuous variables that might be
analyzed using linear regression or nonlinear regression models. E.g. an XY scatter plot might be used to
examine the relationship between % body fat and weight (lbs) from the AE dataset (Figure 27). Line plots are
often used to plot survival curves, probability density functions (PDFs), cumulative distribution functions
(CDFs) and other continuous functions of interest. E.g. you might need a plot of the standard normal curve for
a class lecture or a statistics textbook (Figure 28).

Figure 27. Scatter plot of % Body Fat vs Weight (lbs) Figure 28. Plot of (Gaussian) normal density function.

> # Define a main title for a scatter plot
> main <- "Simple scatter plot of % Body Fat vs. Weight (lbs)"
> # Simple scatter plot of % Body Fat vs. Weight
> plot(AE$Weight,AE$Percent.Body.Fat,xlab="Weight (lbs)",ylab="% Body
Fat",main=main)

Crash Course: R & BioConductor


40

> # Define a continuous sequence Z ranging from -5 to +5
> Z <- seq(from=-5,to=5,length=8000)
> # Define a sequence representing the density of a normal curve
> fZ <- dnorm(Z)
> # Plot a normal curve
> plot(Z,fZ,type="l",ylab="Density",main="Normal Curve")
A main title was defined for the XY scatter plot of % body fat vs. weight (lbs) from the AE data, before
calling the
plot() command with its xlab and ylab options to define the X- and Y-axis labels, respectively.
Since the probability density of a standard normal distribution is really a function f(x), two new variables
Z and
fZ were defined to create a line plot of the standard normal density. First, the variable Z was defined as a
sequence of 8000 evenly spaced rational numbers from -5 to +5 using the
sequence() command. Second, the
variable
fZ was defined as a sequence of 8000 numbers resulting from the function f(x) using the dnorm()
command in R. Finally, a line plot was created from the
plot() function by using the parameter type =”l”.

3.2 Custom Titles, Subtitles and Axes Labels

Most graphics procedures (e.g.
pie(), hist(), plot(), ...) have some common parameters that allow
users to add specific text for the main titles, subtitles and axes labels. There are additional commands that allow
you to customize the look and feel of these labels for a more professional look. The following sections reveal
some helpful tips about customizing the labels on a graph.

3.2.1 Adding and removing groups from a factor variable

Take a close look at the pie chart (Figure 7) and bar chart (Figure 12) created from the adverse events of
the AE data. You may have noticed a possible typo in the data set, because the data contains two very similar
groups “myalgia” and “mylagia”. The “mylagia” group is a typo, but can it be removed from the plot?

> # Examine the 18 levels of the Adverse.Event variable
> AE$Adverse.Event
[1] Tenderness Arthralgia Mylagia Erythema Erythema Anemia
Anemia
...
[57] Nausea Headache Nodule Anemia Swelling
Leukopenia Elavated CH50
[64] Headache
18 Levels: Anemia Arthralgia Dimpling Ecchymosis Elavated CH50 Erythema Headache
Induration ... Tenderness
> # Store the list of variable names as a new variable
> new.labels <- levels(AE$Adverse.Event)
> # Verify the list still has 18 levels
> length(new.labels)
[1] 18
> # Use indexing to replace the “Mylagia” label with “Myalgia”
> new.labels[12] <- "Myalgia"
> # Assign these new labels to the levels of Adverse.Event
> levels(AE$Adverse.Event) <- new.labels
> # Verify Adverse.Event now has only 17 levels
> AE$Adverse.Event
[1] Tenderness Arthralgia Myalgia Erythema Erythema Anemia
Anemia
...
[57] Nausea Headache Nodule Anemia Swelling
Leukopenia Elavated CH50
[64] Headache
17 Levels: Anemia Arthralgia Dimpling Ecchymosis Elavated CH50 Erythema Headache
Induration ... Tenderness

Crash Course: R & BioConductor


41

> # Redefine the counts to create a pie chart
> counts <- summary(AE$Adverse.Event)
> # Generate a new pie chart of Adverse Events
> pie(x=counts,main=“Adverse Events”,sub=“Factor level typo ‘Mylagia’ has been
corrected”)

The first step of the process is to display the factor variable
AE$Adverse.Event to view the number of
factor levels and their names. There are 18 levels, including both the levels “Myalgia” and “Mylagia”. Next,
the
levels() command is used on the right-hand side of the assignment arrow to define a new variable
new.labels, which is a list of all 18 factor levels. Indexing is used to replace the 12
th
element of the
new.labels variable, which contains the incorrect label “Mylagia”. After the incorrect label has been replaced,
the
levels() command can be used on the left-hand side of the assignment arrow to re-define the factor level
definitions of the
AE$Adverse.Event variable. After re-defining the factor levels, the AE$Adverse.Event
variable is displayed to reveal there are now only 17 factor levels. Finally, the
summary() and pie()
commands are used to generate a corrected pie chart of the data (Figure 16).

3.2.2 Changing fonts, colors and label sizes

Later in this manual, the
par() command will be used to create some multi-panel figures and make
other changes to multiple graphs. For now, you should read some of the help documentation for the
par()
command, because many graphing parameters in
par() can be called within other graphing procedures like
pie(), hist(), plot(), etc. Among other things, these graphing parameters can be used to change the fonts,
colors and label sizes of the main title, subtitle, X- and Y-axis labels in pie charts (Figure 29), histograms
(Figure 30) and other graphics (Figure 31).



Figure 29. A pie chart with custom fonts and colors Figure 30. Histogram with custom fonts and colors

Crash Course: R & BioConductor


42


Figure 31. Bar chart with custom fonts and colors

> # Generate a pie chart from AE data with customized labels
> pie(x=counts, main=main, sub=subtitle, font=2, family="serif", cex=0.8,
cex.main=2.6, col.main="red", col.sub="blue")
> # Generate a histogram from AFP data with customized labels
> hist(x=height, xlab=xlab, main=main, col="wheat", col.axis="green",
col.lab="purple", fg="red",font.main=3)
> # Generate a barplot from AE data with customized labels
> barplot(counts,main=”Adverse Events”,ylab=”Counts”,las=2)

The
pie(), hist() and barplot() commands were used to generate pie charts, histograms and bar
charts in previous sections, but additional graphing parameters from the
par() procedure are shown here to
customize the graph labels. The
font parameter specifies normal text (font = 1), boldface text (font = 2),
italicized text (
font = 3), bold and italicized text (font = 4) or symbol text (font = 5). The related
parameters
font.axis, font.lab, font.main and font.sub can assign text styles to the axis units, the axis
labels, the main title or the sub title, respectively. The
family parameter will change the font style for the
entire figure with options for monospace font (
“mono”), serif fonts (“serif”), sans-serif fonts (“sans”) and
symbol fonts (
“symbol”). Additional font families are available using Hershey vector fonts (e.g. family =
“HersheyScript”
). The cex parameter is used to shrink or enlarge the text by a percentage (e.g. cex = 0.8
yields text at 80% its original size). Related parameters
cex.axis, cex.lab, cex.main and cex.sub affect axis
ticks, axis labels, main titles and sub titles, respectively. The
col parameter typically affects objects in a graph
(e.g. pie chart slices or histogram bars), but the
col.axis, col.lab, col.main and col.sub parameters control
font colors in the axis ticks, axis labels, main titles and sub titles. The
fg parameter adjusts the color of the
foreground, which typically includes the graph axes. Finally, the
las parameter is used to adjust the orientation
of axis tick labels (e.g.
las = 2 creates labels perpendicular to the x-axis).

3.2.3 Subscripts, superscripts and custom symbols

Sometimes, it will be necessary to use subscript or superscript text in the text of a graph or figure. For
example, you may want to plot the residuals from a linear regression against log
10-transformed predictor

Crash Course: R & BioConductor


43

variable (Figure 32) or against the quadratic term from a polynomial regression (Figure 33). You may also need
to use special characters, like Greek symbols or math symbols, mixed with regular text (Figure 34). These
custom labels can be generated using
expression() and paste() commands. In practice, the expression()
command is used to generate mathematical equations for use as text labels in graphs, figures and reports. The
paste() command is used to concatenate, or join, two or more strings of characters together to form a single
character string. Together, these functions can be used to combine regular text with mathematical display
styles, like superscript or subscript text, true fractions, Greek symbols, etc.

Figure 32. Scatter plot figure with subscript in X-axis
label
Figure 33. Scatter plot figure with superscript in X-
axis label


Figure 34. Line graph with Greek symbols and formulas in title.

Crash Course: R & BioConductor


44

> # Generate a vector of simulated "residuals"
> residuals <- rnorm(20,0,3.5)
> # Generate log10 transform of drug variable
> log10drug <- log(AFP$drug,10)
> # Generate drug-squared quadratic term
> drugSq <- AFP$drug*AFP$drug
> # Residuals plot with subscript labels
> plot(log10drug, residuals, main="Residuals Plot",
xlab=expression(paste(log[10]," drug concentration")))
> # Residuals plot with superscript labels
> plot(drugSq, residuals, main="Residuals Plot",
xlab=expression(paste("drug concentration"^2)))
> # Generate sequence from 0 to 1
> x <- seq(from=0,to=1,length=8000)
> # Define a function f(x)
> f <- function(x) 23*x^2 / (1 + 10^((6*log10(x) - 15)*0.7))
> # Plot function with math formula in title
> plot(x,f(x),type="l",main=expression(paste("Plot of ",f(x) == frac(beta*x^2 ,
(1 + 10^((psi*log[10](x) - theta)*gamma))))))
In the first graph (Figure 19), square brackets are used to specify the subscripted text within the
expression() command and the paste() command is used inside the expression command to join the
mathematical expression “log
10” with the rest of the text string. In the second graph (Figure 20), the carrot
symbol “^” identifies the superscripted text. The third graph (Figure 21) features a much more complicated
mathematical formula in the
expression() command. Note how the double equal sign “==” is used to create a
regular equal sign within the mathematical expression and the function “frac(a,b)” is used to generate a large
fraction graphic in the main title. Greek symbols are generated within the
expression() command using the
usual English phrases (e.g.  = “theta”,  = “pi”).

3.3 Custom Color and Layout Options

The previous sections of this chapter described how to produce basic graphs and customize their labels,
but often the graphic itself needs to be customized with new colors or a more informative layout. For example,
you have already seen that histogram graphs can provide very different interpretations of the same sample of
data just by changing the number of break points, or bins, in the histogram. A histogram with too few bins or
too many bins may not be very informative. Similarly, the layout of the X- and Y-axes of a boxplot, bar chart,
scatter plot or line plot figure can impact the effectiveness of a graph. For example, the Y-axis in the bar chart
of mean BMI from the AFP data (Figure 13) is too short, so it is difficult to estimate the mean BMI for the male
patients. You may have also noticed that the default layout can sometimes produce missing or incomplete axis
ticks and labels in figures. For example, the bar charts of adverse events from the AE data (Figure 12 and
Figure 18) either have missing horizontal labels or vertical labels that run off the page. Finally, the default
coloring options for the pie chart of adverse events (Figure 7 and Figure 16) are not particularly eye-catching,
but it seems difficult to specify 17 different colors by name for an effective layout. This section will describe
options to change the layout and coloring of these figures.

3.3.1 Custom X- and Y-axes

There are several features to customize in the X- or Y-variable of a graph, including the minimum and
maximum values of the axis and the scale of the numbered “major” ticks of an axis. Most of these features can
be directly modified using the parameters from the
par() command. For example, it will be easier to interpret
the boxplot of mean BMI for male and female patients if the Y-axis extends past 30 (Figure 35). It may be
easier to interpret the XY scatter plot of weight and % body fat from the AE data if the X- and Y-axes had more
tick marks to provide finer resolution of the individual weight and % body fat values (Figure 36). More

Crash Course: R & BioConductor


45

complicated changes to the X- or Y-axes, like axis breaks or separate left and right Y-axes, may require new R
package libraries.

> # Call the barplot() command for a bar chart of BMI responses
> barplot(height = mean.BMI,ylab="BMI", main=main,col=colors, ylim = c(0,50),
yaxp=c(0,50,10))
> # Simple scatter plot of % Body Fat vs. Weight
> plot(AE$Weight,AE$Percent.Body.Fat,xlab="Weight (lbs)",ylab="% Body Fat",
main=main,xlim=c(85,290),xaxp=c(100,275,7),ylim=c(5,50),yaxp=c(5,50,9))

Figure 35. Bar chart with customized Y-axis limits and Y-axis tick intervals
Figure 36. Scatter plot with customized X- and Y-axis
limits and tick intervals

The
ylim parameter was used to extend the Y-axis of the barplot (Figure 35) from BMI = 30 to BMI =
50. Notice both
ylim and analogous xlim parameter require users to define the lower and upper limits of each
axis with a vector input (e.g.
ylim = c(0,50) is used to specify a lower limit of 0 and an upper limit of 50).
Both the
xaxp and yaxp parameters are used to add finer resolution tick marks to the XY scatter plot (Figure
36). The
xaxp and yaxp parameters require users to specify the lower and upper limits of the tick marks and the
number of ticks in between (e.g.
xaxp = c(100,275,7) indicates that the ticks should start at 100 and end at
275 with 7 ticks in between). Notice the lower and upper limits of
xlim and xaxp or ylim and yaxp can be
different, if the tick marks should not start at the very beginning of an axis.
Some other features related to axis customization include the
par() parameters bg, bty, fg, tck, tcl,
xlog and ylog. The parameters bg and fg control the colors of the background and foreground of the graph,
respectively. The
bty parameter determines the style of the box drawn around the graph. The tck and tcl
parameters adjust the length of the axis ticks relative to the size of the plotting region and the size of a single
line of text, respectively. The
xlog and ylog parameters are used to plot continuous variables on a log 10-
transformed axis.

3.3.2 Custom graph sizes and positions

Controlling the size of an image can be critical, especially if the image is going to be printed or exported
to another software program for further editing. Explicit control over the size of your graph, the graph margins
and other features will ensure your figures look professional and can be easily understood. For example, you

Crash Course: R & BioConductor


46

may have noticed that the group labels of the adverse events variable of the AE data set did not have enough
space in the previous two bar charts (Figure 25 and Figure 31). Alternatively, there may be too much white
space in the pie chart figures (e.g. Figure 29). These problems can be fixed by changing the size of the figure
and its figure margins.
The
par() parameters controlling the size of the plotting region, the size of the graph and the size of the
margins typically need to be specified with a
par() command or plot.new() command before a specific
graphing procedure is called with a command like
pie(), hist() or barplot(). It may help to close all open
graphs manually or with the
dev.off() command, before you begin to resize a figure. Once all graphs are
closed, you can adjust the size parameters with a
par() command, then call the usual graphing commands to
generate the figure in the customized space.

Figure 37. Bar chart with custom margin sizes Figure 38. Pie chart with custom margin sizes

> # Adjust margins from mar = c(5,4,4,2) to mar = c(8,4,4,2) > # where mar = c(bottom, left, top, right)
> par(mar=c(8,4,4,2))
> # plot bar chart as usual
> barplot(counts,main="Adverse Events",ylab="Counts",las=2)
> # Close the current graphic device window
> dev.off()
null device
1
> # Adjust the margins for a pie chart
> par(mar=c(0.5,1,4,1))
> # Plot pie chart as usual
> pie(counts, main = main,cex.main=2.6,font=2,family="serif")

In this examples the inadequate x-axis margin of the bar chart (Figure 37) and the excessive white space
of the pie chart (Figure 38) were both corrected using the
mar parameter, which adjusts the size of the graph
margins relative to the size of one line of text. The default setting is
mar = c(bottom = 5, left = 4, top
= 4, right = 2)
, so the bar chart was given extra space in the bottom margin to accommodate the lengthy x-
axis labels and the pie chart was given smaller margins all the way around to eliminate unnecessary white
space. Margin space can also be specified in inches with the
mai parameter or by an expansion factor with the

Crash Course: R & BioConductor


47

mex parameter. Another alternative would be to adjust the size of the graph and plotting region using the fig or
fin parameters.

3.3.3 Color gradient functions and translucent colors

Previous examples have shown that
par() and other R graphics procedures allow users to change the
colors of almost every element in a graph. However, color choices must be specified at the command line, so it
is important to know the names of all the possible color choices. If you enter the command
colors() at the R
command line, you will see a list of 657 available color choices. Colors range from simple primary colors (e.g.
“red”, “blue”, “yellow”) to subtle color gradients (e.g. “gray0” to “gray99”) and sophisticated palette
choices (e.g.
“azure”, “burlywood”, “chartreuse”, etc.). Each of these color choices must be specified by
name, but they should provide more than enough options for most users.
Some R procedures, like the
pie() command, will apply default color choices to your graph if nothing
is specified by the user. The
grDevices package library includes a handful of functions that will allow you to
generate a known gradient of colors for a given number of groups. For example, the
rainbow(n=10) command
will generate a list of 10 color choices from a gradient of all possible “rainbow” colors from red to orange to
yellow to green to blue to purple and back to red again (Figure 39). The
heat.colors() command will
similarly generate a list of colors from yellow to orange to red for use in heatmap figures. These “high-level”
color gradient functions use the more generic
rgb(), hsv() or hcl() functions to generate colors using “red-
green-blue”, “hue-saturation-value” or “hue-chroma-luminance” methods. Advanced users may choose to call
these
rgb(), hsv() or hcl() colors directly, either by calling the functions themselves or by calling their output
values (Figure 40 and Figure 41). Note the
rgb(), hsv() or hcl() commands include an alpha parameter that
is used to specify transparent colors.


Figure 39. A bar chart of adverse events from the AE data set using
rainbow() colors

Crash Course: R & BioConductor


48

Figure 40. Histogram with rgb() color. Figure 41. Histogram with hsv() color

> # Use the rainbow() command to generate rainbow colors
> barplot(height=counts,col=rainbow(18),cex.names=0.7,las=2,...)
> # Use rgb() to generate a custom color for a histogram
> hist(rnorm(1000),col=rgb(red=0.7,green=0.1,blue=0.3,alpha=0.7,
maxColorValue=1))
> # View the value name of the custom rgb() color
> rgb(red=0.7,green=0.1,blue=0.3,alpha=0.7,maxColorValue=1)
[1] "#B21A4CB2"
> # Call the custom rgb() color by its value name
> hist(rnorm(1000),col="#B21A4CB2")
> # View the value name of a custom hsv() color
> hsv(h=0.3,s=0.7,v=0.8,gamma=1,alpha=0.5)
[1] "#5ACC3D80"
> # Call the custom hsv() color by its value name
> hist(rnorm(1000),col="#5ACC3D80")

3.4 Multi-step Graphics

Most graphs and figures can be produced with a single R command, but some complicated graphs may
require multiple statements. Often users will need to overlay very similar types of content onto the same graph.
For example, data from several groups might be displayed on the same scatter plot figure using points with
different colors or shapes. Other figures may require users to overlay related types of information. E.g. Error
bars representing the standard error of the mean might be added to a bar chart displaying the mean BMI of male
and female patients. Sometimes, users may want to overlay entirely different types of content. E.g. A user
might want to display the fitted density of a standard normal curve over a histogram to show that the variable is
normally distributed. The next few examples will demonstrate some general principles for creating overlay
figures.

3.4.1 Overlay two different graphs using
par(new=TRUE)

In previous sections, the
hist() command was used to generate a histogram of a single sample and the
plot() command was used to generate a plot of the density from a normal distribution. The par(new=TRUE)

Crash Course: R & BioConductor


49

command is used to overlay one figure on top of another, which can be used to add a fitted normal distribution
over a histogram (Figure 42) or to plot two histograms side-by-side using translucent colors (Figure 43).

Figure 42. Histogram with fitted normal density Figure 43. Two histograms overlay figure.

> # Generate two number sequences from -4 to +4
> aa <- seq(from=-4,to=4,length=8000)
> bb <- seq(from=-4,to=4,length=40)
> # Create a histogram from random normal data
> hist(rnorm(1000),col="#B21A4CB2",breaks=bb,freq=FALSE,xlab="X",ylim=c(0,0.5))
> # Call par(new=TRUE) to overlay a new figure on the histogram
> par(new=TRUE)
> # Overlay the density of a standard normal distribution
> plot(aa,dnorm(aa),xlim=c(-4,4),ylim=c(0,0.5),type="l", xlab="",ylab="")
> # Close the current graphics window
> dev.off()
>
> # Load the mnormt and sn package libraries to generate
> # skewed normal data for two new histograms
> library(mnormt)
> library(sn)
> # Generate a sequence from 0 to 1
> uu <- seq(from=0,to=1,length=10000)
> # Create random samples from right- and left-skewed normal data
> rr <- qsn(uu,96.6276,5,8)
> ll <- qsn(uu,103.37245,5,-8)
> # Generate a histogram from the left-skewed data
> hist(ll,breaks=30,xlim=c(85,115),ylim=c(0,2000),col="blue", main="Right-skewed
and Left-skewed distributions with Median = 100",xlab="Response")
> # Call par(new=TRUE) to overlay a new figure on the histogram
> par(new=T)
> # Generate a histogram from the right-skewed data
> hist(rr,breaks=30,xlim=c(85,115),ylim=c(0,2000),col="#FF00004F",main="",xlab="",
ylab="")
There are several important considerations to make when one graph is overlaid on another. You must
make sure that both graphs share the same X- and Y-axis ranges and tick intervals, so the two graphs match up

Crash Course: R & BioConductor


50

appropriately. You also need to apply main titles, X- and Y-axis labels and subtitles to only one of the two
graphs. With this in mind, two sequences from -4 to +4 were created to define the domain of the normal density
(
aa) function and the break points of the histogram ( bb). The break points define the range of the X-axis for the
histogram and matching
ylim parameters were used to ensure both the graphs would match properly. The
par(new = TRUE) command was called after the hist() command, to indicate that the subsequent plot()
command was displayed on top of the existing histogram graphic. The result is a histogram with a fitted normal
density (Figure 42). The
dev.off() command was entered to close the current graphic window, so the next
graph can be created.
The
mnormt and sn R package libraries were installed and and loaded with the library command to
generate the figure with overlapping histograms (Figure 43). The
qsn() command from the sn package was
used to generate histograms representing both left- and right-skewed normal distributions. Notice
ll and rr do
not represent random samples from these distributions, but instead they represent histograms of the complete
density functions generated using a uniform sequence of 10,000 numbers from 0 to 1 (
uu) and the quantile
function of the skewed normal distribution,
qsn(). The histogram of ll was created first, followed by a
par(new=TRUE) statement and the histogram of rr with a transparent color from the rgb() command. This
figure shows left- and right-skewed normal distributions that both have median = 100, which can be used as
counter example to generate a significant Wilcoxon test statistic even though both samples have the same
median.

3.4.2 Add error bars to a bar chart using
segments() or arrows()

In section 2.1.4, bar charts were used to show the frequency of outcomes for categorical variables, like
the adverse events variable in the AE data set, or the mean response for continuous variables, like the mean
BMI levels for male and female patients in the AFP data set. A bar chart of mean or median response levels is
typically used to accompany a student’s t-test, Wilcoxon test or analysis of variance (ANOVA), but these
figures are much more informative if they include error bars of some kind. The base packages in R do not
compute error bars automatically, so users will need to compute their own error statistics and manually overlay
the error bars on an existing
barplot() graphic using the segments() or arrows() command (Figure 44).


Figure 44. Bar chart with custom error bars

Crash Course: R & BioConductor


51

> # compute std dev of BMI for male and female patients
> sd.females <- sd(AFP[AFP$gender=="female",5])
> sd.males <- sd(AFP[AFP$gender=="male",5])
> sd.BMI <- c(sd.females,sd.males)
> # call the barplot() command for a bar chart of BMI
> mp <- barplot(height=mean.BMI,ylab="BMI",ylim=c(0,50),main=main,col=colors)
> # define the starting and ending points of each arrow
> X0 <- X1 <- mp
> Y0 <- mean.BMI - sd.BMI
> Y1 <- mean.BMI + sd.BMI
> # call the arrows command to overlay error bars over barplot
> arrows(X0,Y0,X1,Y1,code=3,angle=90)

Again, creating a overlay figure in R requires a little forethought. The variable
mean.BMI was defined
earlier in section 2.1.4 to compute the mean BMI response for both male and female patients. Here, similar
methods are used to compute the standard deviations of BMI for both male and female patients and store them
as
sd.BMI. Next, the barplot() command is used to generate the bar chart figure, but this time the figure is
also stored as the variable
mp so the midpoints of each bar can be stored. Each error bar will need starting and
stopping locations on both the X- and Y-axis. Here the starting location (
X0) and stopping location (X1) for the
X-axis are both defined as the midpoints (
mp) from the barplot procedure, while the starting and stopping points
on the Y-axis (
Y0 and Y1) are computed as one standard deviation above and one standard deviation below the
mean BMI for male and female patients, respectively. Finally, error bars are plotted using the starting and
stopping locations in the
arrows() or segments() commands. The arrows() command is used to produce
capped whiskers, with the
angle = 90 parameter set to produce square whisker caps. The default angle
parameter will produce arrowhead caps. The
segments() command produces whiskers with no end caps.

3.4.3 Add regression lines using
abline() or plot()

In section 2.1.5, the
plot() command was used to produce a simple scatter plot of % Body Fat vs.
Weight from the AE data set. Researchers might want to analyze this kind of data using simple linear
regression or nonlinear regression methods. Without exploring the statistical details of these models, the next
examples will demonstrate how regression lines can be added to a scatter plot graphic.

Figure 45. Scatter plot with linear regression line Figure 46. Scatter plot with nonlinear curve fit

Crash Course: R & BioConductor


52

> # Estimate a linear regression fit using lm()
> reg.line <- lm(Percent.Body.Fat ~ Weight,data=AE)
> # Create a simple scatter plot of % Body Fat vs Weight
> plot(AE$Weight,AE$Percent.Body.Fat,xlab="Weight (lbs)",ylab="% Body Fat",
main="Linear Regression")
> # Add a linear regression line using abline()
> abline(reg.line,col="red",lwd=2)
> # Close the graphics device
> dev.off()
>
> # Estimate a smoothed loess fit
> loess.line <- loess(Percent.Body.Fat~Weight,data=AE,span=0.85)
> # Create a simple scatter plot of % Body Fat vs Weight
> plot(AE$Weight,AE$Percent.Body.Fat,xlab="Weight (lbs)",ylim=c(5,50),
ylab="% Body Fat",main="Loess Fit")
> # Add the loess fit to the scatter plot
> par(new=TRUE)
> plot(sort(AE$Weight),predict(loess.line),ylim=c(5,50),type="l",lwd=2,col="blue",
xlab="",ylab="")

The linear model procedure
lm() is used to fit a simple linear regression to the continuous response
variable % Body Fat and the continuous predictor variable Weight from the AE data set (Figure 45) and the
results of the analysis are stored as
reg.line. A scatter plot of % Body Fat vs. Weight is created using plot()
and a regression line is added using the
abline() command. There was no need to use a par(new=TRUE)
command, since
abline() was called immediately after the plot() command. The abline() function is
specially designed to add straight lines to the most recent plot, using only the slope and Y-intercept of the line
as its inputs. Using the
par(new=TRUE) and plot() commands would have been much more difficult for this
simple regression line.
Fitting a curved trend to the scatter plot is a little more difficult (Figure 46). The
loess() command is
used to fit a local polynomial regression to the data and the results are stored as
loess.line. The statistical
details of this procedure are not particularly important, except that it produces a curved fit to the data. The
plot() procedure is used to produce the initial scatterplot, then par(new=TRUE) is called to add the regression
line over the current graphic. Finally, a second
plot() command is called to print the loess line, using the
sorted Weight variable as the X-values for the curve and the predicted values from the stored loess.line variable
as the Y-values for the curve.

3.4.4 Plot multiple groups using
points() or lines()

Looking at the simple scatter plot of % Body Fat vs. Weight, it might be fair to ask whether additional
variables like Gender or Region could affect the relationship between % Body Fat and Weight. It may be useful
to separate the Gender or Region groups on the scatter plot to look for differences in the relationship between %
Body Fat and Weight among these groups. Multiple groups can be added to a scatter plot graph one-at-a-time
using the
points() command. Groups can be separated by using multiple colors, multiple symbol types,
multiple symbol sizes or all of the above.

> # Create main title and other features
> main.gender = "Percent Body Fat versus Weight by Gender"
> main.region = "Percent Body Fat versus Weight by Region"
> xlab = "Weight"
> ylab = "% Body Fat"
> # Create a scatter plot of the female data in pink
> plot(AE[AE$Gender=="Female",5],AE[AE$Gender=="Female",6],xlim=c(90,350),
ylim=c(0,50),main=main.gender,xlab=xlab,ylab=ylab,col="pink")
> # Overlay the male data points in sky blue with points()
> points(AE[AE$Gender=="Male",5],AE[AE$Gender=="Male",6],col="sky blue")

Crash Course: R & BioConductor


53

Figure 47. Gender groups distinguished by colors Figure 48. Region groups distinguished by symbols

> # Create a scatter plot of the Southwest region data
> plot(AE[AE$Region=="Southwest",5],AE[AE$Region=="Southwest",6],xlim=c(90,350),
ylim=c(0,50),main=main.region,xlab=xlab,ylab=ylab)
> # Overlay data from other regions using new symbol types: pch
> points(AE[AE$Region=="Northwest",5],AE[AE$Region=="Northwest",6],pch=2)
> points(AE[AE$Region=="Midwest",5],AE[AE$Region=="Midwest",6],pch=3)
> points(AE[AE$Region=="Northeast",5],AE[AE$Region=="Northeast",6],pch=4)
> points(AE[AE$Region=="Southeast",5],AE[AE$Region=="Southeast",6],pch=5)

Just like the
abline() command in section 3.1.3, the points() command can be used to add new data
points to the existing graphic. To separate data by gender, plot the female data in pink using the
col parameter
in the
plot() command and add the male data in sky blue using the col parameter in the points() command
(Figure 47). To separate the data by region, first plot the data from the Southwest region using the default
symbol and then plot the remaining regions with new symbol types using the
pch parameter (Figure 48). The
symbol size can also be adjusted with the
cex parameter, but it should be used carefully because it also adjusts
the size of all characters added to the plot.
Groups can also be separated on a line graph using multiple colors or multiple line types using the
lines() or ablines() commands. Similar to the points() command, the lines() command can be used
wherever
plot(type=”l”) would be used. E.g. to plot probability density functions (Figure 49), nonlinear
regressions, etc. Similarly,
abline() can be used multiple times to plot several regression lines (Figure 50).

Crash Course: R & BioConductor


54

Figure 49. Seven inverse gamma density curves Figure 50. Two-color scatter plot with two line fits

> # Upload required libraries
> library(MCMCpack)
> # Create a sequence of numbers from 0 to 4
> x <- seq(from = 0, to = 4, length = 8000)
> # Plot the first probability density function
> plot(x,dinvgamma(x,1.0,0.1),type="l",col="red",xlab="Gene Expression Variance",
ylab="Density",main="Probability Density Functions of Inverse Gamma
Distributions")
> # Plot additional probability density functions
> lines(x,dinvgamma(x,1.5,0.3),col="dark orange")
> lines(x,dinvgamma(x,3.0,1.0),col="yellow")
> lines(x,dinvgamma(x,8.0,5.0),col="light green")
> lines(x,dinvgamma(x,10.0,10.0),col="dark green")
> lines(x,dinvgamma(x,12.0,18.0),col="blue")
> lines(x,dinvgamma(x,14.0,30.0),col="purple")
>
> # Fit two linear regression models
> AE.female <- AE[AE$Gender == "Female",]
> AE.male <- AE[AE$Gender == "Male",]
> reg.female <- lm(Percent.Body.Fat ~ Weight,data=AE.female)
> reg.male <- lm(Percent.Body.Fat ~ Weight,data=AE.male)
> # Create a scatter plot of the female data in pink
> plot(AE[AE$Gender=="Female",5],AE[AE$Gender=="Female",6],xlim=c(90,350),
ylim=c(0,50),main=main,xlab=xlab,ylab=ylab,col="pink")
> # Overlay the male data points in sky blue with points()
> points(AE[AE$Gender=="Male",5],AE[AE$Gender=="Male",6],col="sky blue")
> # Overlay two linear regression lines
> abline(reg.female,col="red")
> abline(reg.male, col="blue")

3.5 Figure Legends and Overlaid Text

Many of the complicated graphics from subchapter 3.1 would require a figure legend to identify multiple
groups on the same plot. Figure legends can be added to almost any figure and groups can be separated by

Crash Course: R & BioConductor


55

color, line type and symbol type features from the graph. Additionally, text labels can be added to specific
locations on any graph to label individual data points, such as outliers or specific patients.

3.5.1 Figure legends

In the previous sections,
barplot()was used to create a bar chart of mean male and female BMI with
standard deviation error whiskers. Both the
plot() and points() commands were used to create scatter plot
figures with separate colors and symbols for the gender and region variable, respectively. Figure legends could
be used in both graphs to help readers distinguish between the groups on the plots.

Figure 51. Bar chart with figure legend Figure 52. Scatter plot with three figure legends

> # Call the barplot() command for a bar chart of BMI
> mp <- barplot(height=mean.BMI,ylab="BMI",ylim=c(0,50),main=main,col=colors)
> # Define and plot the error whiskers
> arrows(X0,Y0,X1,Y1,code=3,angle=90)
> # Draw the figure legend
> legend(x="topleft",legend=c("female","male"),fill=c("pink","sky blue"))
>
> # Create indices for each plotted symbol
> indx.FSW <- AE$Gender == "Female" & AE$Region == "Southwest"
> indx.MSW <- AE$Gender == "Male" & AE$Region == "Southwest"
> indx.FNW <- AE$Gender == "Female" & AE$Region == "Northwest"
> indx.MNW <- AE$Gender == "Male" & AE$Region == "Northwest"
> indx.FMW <- AE$Gender == "Female" & AE$Region == "Midwest"
> indx.MMW <- AE$Gender == "Male" & AE$Region == "Midwest"
> indx.FNE <- AE$Gender == "Female" & AE$Region == "Northeast"
> indx.MNE <- AE$Gender == "Male" & AE$Region == "Northeast"
> indx.FSE <- AE$Gender == "Female" & AE$Region == "Southeast"
> indx.MSE <- AE$Gender == "Male" & AE$Region == "Southeast"
> # Call the plot() and points() functions to generate figure
> plot(AE[indx.FSW,5],AE[indx.FSW,6],xlim=c(90,350), ylim=c(0,50),main=main,
xlab=xlab,ylab=ylab,col="pink")
> points(AE[indx.MSW,5],AE[indx.MSW,6],col="sky blue")
> points(AE[indx.FNW,5],AE[indx.FNW,6],col="pink",pch=2)
> points(AE[indx.MNW,5],AE[indx.MNW,6],col="sky blue",pch=2)

Crash Course: R & BioConductor


56

> points(AE[indx.FMW,5],AE[indx.FMW,6],col="pink",pch=3)
> points(AE[indx.MMW,5],AE[indx.MMW,6],col="sky blue",pch=3)
> points(AE[indx.FNE,5],AE[indx.FNE,6],col="pink",pch=4)
> points(AE[indx.MNE,5],AE[indx.MNE,6],col="sky blue",pch=4)
> points(AE[indx.FSE,5],AE[indx.FSE,6],col="pink",pch=5)
> points(AE[indx.MSE,5],AE[indx.MSE,6],col="sky blue",pch=5)
> legend(x="topleft",legend=c("female","male"),fill=c("pink","sky blue"),
title="Gender")
> legend(x="topright",legend=c("SW","NW","MW","NE","SE"),pch=1:5,title="Region")
> legend(x="bottomright",legend=c("Female","Male","SW","NW","MW","NE","SE"),
col=c("pink","sky blue",rep("black",5)),pch=c(15,15,1:5),title="Legend")

The figure legend for the bar chart is easy to create using the
legend() command (Figure 51). The
legend() command has parameters x and y that are used to specify the exact location of the legend within the
plotting region (e.g.
x = 15 and y = 200 would place the figure legend at (x,y) = (15,200) in the graph, if it
exists). While this precise level of control can be useful, many user will prefer to use shortcut locations like
x =
“topleft”
or x = “bottomright” to place the legend in one of nine standard locations. See the Details
section of
help(legend) in R for more information. The legend parameter of the legend() command is used
to specify the group labels in the legend. Note that you can choose to leave certain groups off the legend, or
you can add notations for groups that are not present in the figure, if needed. Finally, the
fill parameter is
used to create filled boxes representing the colors of the male and female groups in this graph.
Note this new scatter plot graph is a little more complicated than the example in section 3.1.4 above
(Figure 52). The first step was to create a series of ten index variables to identify individuals from each region
and each gender, so these groups can be separated by symbol types and colors on the final plot. Each index was
created using two inequalities evaluated simultaneously using the ampersand symbol
&. The initial plot()
command was only called for the female patients from the Southwest region, plotting with the default symbol
and pink coloring. Remember, it is important to set reasonable x-axis and y-axis limits for this first plot with
the
xlim and ylim parameters, because all of the remaining groups will be plotted over this first figure. If xlim
and
ylim are chosen poorly, the remaining groups might not be visible. The next nine points() commands are
used to add the remaining data to the plot, one-group-at-a-time. Male and female patients are plotted in sky
blue and pink colors, respectively, and each region is plotted with its own symbol type using the
pch parameter.
Three separate figure legends were added to this complicated scatter plot figure to show multiple
concepts. The first
legend() command is identical to the command used for the bar chart in the previous
example, except a
title parameter was added to show the top right legend identifies points by gender. The
second
legend() command uses a pch parameter instead of the fill parameter, so each group can be identified
by its symbol type. A researcher might choose to display both of these two figure legends, so readers could
properly distinguish between the genders separated by color and the regions separated by symbol type. It is
important to note that two or more legends can be applied to the same figure. But what if you want all of the
information in a single legend? The third
legend() command is redundant with the first two legends, but it
demonstrates one creative way to combine both the gender and region elements into a single legend. The
fill
parameter is not used in the third legend, but a careful choice of
pch = 15 and for the “Female” and “Male”
yields a filled square symbol that can be colored with the
col parameter. The remaining region symbols are
colored black for simplicity.

3.5.2 Adding text labels inside a graph

Just like the
abline(), points(), lines() and legend() commands can all be used to add new
features to an existing graph, there is also a
text() command that can be used to add new text labels to specific
locations within a graph. This can be useful to identify outliers on a bar chart or regression plot, to identify
specific patients of interest in a medical study or just to add general comments for your readers.

Crash Course: R & BioConductor


57

Figure 53. General comments added to a scatter plot Figure 54. Outlier identified in a bar chart

> # Create multicolor, multisymbol scatter plot as above
> plot(AE[indx.FSW,5],AE[indx.FSW,6],xlim=c(90,350),ylim=c(0,50),main=main,
xlab=xlab,ylab=ylab,col="pink")
> points(AE[indx.MSW,5],AE[indx.MSW,6],col="sky blue")
> points(AE[indx.FNW,5],AE[indx.FNW,6],col="pink",pch=2)
> points(AE[indx.MNW,5],AE[indx.MNW,6],col="sky blue",pch=2)
> points(AE[indx.FMW,5],AE[indx.FMW,6],col="pink",pch=3)
> points(AE[indx.MMW,5],AE[indx.MMW,6],col="sky blue",pch=3)
> points(AE[indx.FNE,5],AE[indx.FNE,6],col="pink",pch=4)
> points(AE[indx.MNE,5],AE[indx.MNE,6],col="sky blue",pch=4)
> points(AE[indx.FSE,5],AE[indx.FSE,6],col="pink",pch=5)
> points(AE[indx.MSE,5],AE[indx.MSE,6],col="sky blue",pch=5)
> legend(x="topleft",legend=c("female","male"),fill=c("pink","sky blue"),
title="Gender")
> legend(x="topright",legend=c("SW","NW","MW","NE","SE"),
pch=1:5,title="Region")
> # Add large gray cross symbols to demonstrate centering of text
> points(c(300,280),c(18,6),pch=3,cex=4,col="light gray")
> # Add default text annotations to graph
> text(300, 18, "the text is CENTERED around (x,y) = (300,18)
by default",cex=0.8)
> # Add justification adjustments to text annotation
> text(280, 6, "or Right/Bottom JUSTIFIED at (x,y) = (280,6)
by 'adj = c(1,0)'",adj=c(1,0),cex=0.8)
>
> # call the barplot() command for a bar chart of BMI
> mp <- barplot(height=mean.BMI,ylab="BMI",ylim=c(0,50),
main=main,col=colors)
> # call the arrows command to overlay error bars over barplot
> arrows(X0,Y0,X1,Y1,code=3,angle=90)
> legend(x="topright",legend=c("female","male"),
fill=c("pink","sky blue"))
> # Add outliers to bar chart graph with points() command
> points(mp[1],41,col="black",pch=16)
> # Add text label to identify outlier point on graph
> text(mp[1],42,"Outlier!",adj=c(0,0),cex=1.5)

Crash Course: R & BioConductor


58

The first example uses the multi-color, multi-symbol scatter plot to show how general comments can be
added to any plot (Figure 53). In this special example, two large gray “cross” symbols were added in the same
location as the text comments to demonstrate how the text command uses justification. Obviously, you would
want to remove these points from the final figure, but this can be a helpful trick for lining up your text. The
second example shows how you might use the
points() and text() commands to add annotations to outlier
points on a bar chart or other graphic (Figure 54). While outliers may be plotted automatically in a box plot
figure, it may be necessary to plot the outliers manually in a bar chart or other graphics.

3.6 Multi-panel Layouts

Now that you have created some individual graphs in R, it may be necessary to organize the graphs into
consistently formatted multi-panel figures and export the results for publication. You can usually copy-and-
paste R graphics directly from the graphics window into popular software applications like MS PowerPoint or
other software, which could be used to organize and customize the graphs. You can also save graphic images to
your computer in a handful of image formats (e.g. PDF, PNG, JPG, ...) from the graphics window by clicking >
File > Save As... in the MS Windows or Mac OSX R GUI. These images can also be opened in PowerPoint
and other software. However, it may produce better results if you edit these figures within R. The
par()
command can be used to help organize multiple graphs into paneled figures. Additionally, R includes a variety
of different graphics commands to save your graphs in popular photo, drawing and document formats (e.g.
.JPG, .PDF, …). These commands can provide better control over the final look of your published figures.

3.6.1 Using par() commands to create multi-panel figures

We have already seen that
par() commands are shared among most graphics procedures (e.g. main
defines main titles,
col defines colors, cex defines font and symbol magnification, …). We have also seen the
par(new=TRUE) command used in between two graphics commands to create overlay figures. The par()
command also includes the parameters
mfcol and mfrow to create multi-panel figures from multiple R graphics.


Figure 55. A multi-panel figure created with
par(mfrow=c(1,3))

Crash Course: R & BioConductor


59

> # Use par(mfrow) to create a multi-panel figure
> par(mfrow = c(1,3))
> # Call three graphics commands to fill the multi-panel figure
> hist(x=rnorm(100),xlab="Normal Data",main="Histogram",
col="wheat")
> boxplot(x=rnorm(100),xlab="Normal Data",main="Boxplot",
col="purple")
> plot(1:25,rnorm(25),xlab="",ylab="Normal Data",col="red",
type="l",main="Line Graph")

The
mfrow parameter was used to specify that the figure should have 3 panels, arranged in one row with
three columns (Figure 55). The
mfrow parameter will fill the panels with each new graphic from left-to-right by
rows, while the equivalent
mfcol parameter will fill the panels with each new graphic from top-to-bottom by
columns. Individual graphs can be assigned to specific rows and columns of the multi-panel figure using the
mfg parameter.

3.6.2 Applying consistent graphics parameters among multiple graphs

When
par() is used outside of several graphs, it can also be used to organize multiple graphs in a single
figure, add borders and background colors, apply consistent fonts and labeling, etc (Figure 56).


Figure 56. A paneled figure with 3 graphs with consistent parameters using par()

> # Use par(mfrow) to create a multi-panel figure
> par(mfrow = c(1,3), bg="sky blue",fg="blue",family="serif")
> # Call three graphics commands to fill the multi-panel figure
> hist(x=rnorm(100),xlab="Normal Data",main="Histogram",
col="wheat")
> boxplot(x=rnorm(100),xlab="Normal Data",main="Boxplot",
col="purple")
> plot(1:25,rnorm(25),xlab="",ylab="Normal Data",col="red",
type="l",main="Line Graph")

The
par() command is very powerful, but only two parameters were used in this simple example. The
bg parameter is used to set the background color of the paneled figure to "sky blue", the fg parameter sets the
foreground color to
"blue" and the family parameter specifies a serif font.

Crash Course: R & BioConductor


60


3.7 Exporting R Graphics

3.7.1 Exporting R graphics as .PDF and photo format output

After you create several graphics, you will want to save them for publication or for a presentation. It is
possible to save your R graphics as a .PDF or .PS document using the
pdf() and postscript() commands
found in the
grDevices package library. The grDevices package also include many popular photo image
formats (e.g. .BMP, .JPEG, .PNG, .TIFF or .SVG), using the
jpeg(), png(), tiff() or svg() commands. A
blank image file is opened by the chosen graphic device command (e.g. the
jpeg() command), then the open
image file is filled by a plot or graphic command (e.g. the
hist() or par() commands) and the image file is
completed with the command
dev.off(), which turns off the open graphic device. Most of these image format
commands will have their own unique parameters to control the final size of the image, compression and other
features.

> # Open a .JPG image to save a histogram graph
> jpeg(filename="histogram.jpeg",quality=80)
> # Plot the histogram in the open jpeg file
> hist(rnorm(100),col="wheat")
> # Turn off the open jpeg graphic device
> dev.off()
3.7.2 Choosing an image format If you are going to export a graphic, then think carefully about the appropriate image format. Different
formats have different purposes. For example, the .PNG format is typically used for images that will be shared
over the internet. The .PNG format is good for images that have large blocks of monochrome color (e.g. a bar
chart or histogram) and it may be preferred over .JPG for figures with text and sharp transitions. The .TIFF
format is typically the standard for publication in journals, so you may want to export to .TIFF unless instructed
otherwise by the journal. The .PNG, .JPG and .TIFF formats are all raster graphics formats, which store images
as a rectangular grid of pixels or colors. One disadvantage of raster graphics is that they can produce pixilated
images when enlarged. An alternative to raster graphics would be a vector graphic image, like a .SVG file used
in the Inkscape software system. Vector graphics are easy to edit, shrink and enlarge, but they may not produce
shadings and continuous color tones as well as raster graphics.

3.8 Sample Problems for Students

#4. {Fisher’s iris data} Sir Ronald A. Fisher famously used this set of iris flower data as an example to test
his new linear discriminant statistical model. Now, the iris data set is used as a historical example for new
statistical classification models.

A) Make a boxplot of all four measurements from Fisher’s iris data

B) Create a multi-panel figure with histograms of all four measurments. Do you notice anything
that could not be seen from the boxplot?

C) Create a multi-panel figure with boxplots of all four measurements, paneled by the three
different species. Do you notice any differences among species?

Crash Course: R & BioConductor


61

#5. {AFP data} Suppose alpha-fetoprotein (AFP) is a potential biomarker for liver cancer and other cancer
types. A researcher might be interested in AFP levels before and after taking a new drug in one of four
concentrations.

A) In section 3.2.1, the
barplot() and arrows() commands were used to create a barchart of
mean(BMI) by gender with error bars. Install the
sciplot package library and use the
bargraph.CI() command to replicate that graph.

B) Use the bargraph.CI() command to create a bar chart that compares AFP difference over all five
drug concentrations.

C) Create an interleaved bar chart to plot mean AFP difference by drug concentration and gender

#6. {AE data} Doctors, epidemiologists and other researchers look at adverse events to explore the
symptoms and medical conditions affecting patients. A researcher might choose to look for associations
between adverse events and diet.

A) Create a histogram of Percent Body Fat (or your choice of continuous response variable), then
overlay a normal curve.

B) Install the lattice package and use the
barchart() command to graph the AEtable data table
created for question #3. C) in the previous chapter. What kind of plot is this? Add the
appropriate figure legend.

Crash Course: R & BioConductor


62

Ch. 4. Basic Statistical Tests and Analyses in R

4.1 Student’s T-test

We have already explored many features of R, but most R users are primarily interested in statistical
tests. The obvious place to start is with the Student’s t-test. The t-test can be used to compare a single mean
against a constant null value, like if you want to determine if the average is significantly different from zero.
More commonly, a two-sample t-test is used to compare the means of two independent groups. Frequently, a
matched pairs t-test is used to compare two treatments applied to the same sample of subjects or objects in an
experiment. All three types of t-tests can be computed in R. Furthermore, the t-tests can be computed as one-
sided or two-sided tests, with or without the assumption of equal variances among groups.

4.1.1 One-sample t-tests

The command
t.test() is used to compute most Student’s t-tests in R. The parameters x and y
identify the one or two samples used in the test. This first example shows a one-sample Student’s t-test, where
only the
x parameter is used. A one-sample Student’s t-test compares the mean of a sample against a constant
null value, specified by the
mu parameter. Most often, the null value of a one-sided test is zero, but the null
value can be set to a custom value. In this example, we might want to know if the average sepal length of an
Iris setosa plant is greater than 5 cm. This one-sample t-test is also a one-sided t-test, because the
alternative
hypothesis is that the sample mean is greater than 5 cm. The test will not produce a statistically significant
result if the sample mean is less than 5 cm. The test results for the one-sample test provide a t-statistic value
t,
the degrees of freedom
df and p-value with a confidence interval for the single sample mean.

> # Define a vector of Iris setosa sepal lengths from iris data
> sepal.length <- iris$Sepal.Length[iris$Species == "setosa"]
> # Compute a one-sided, one-sample t-test with iris data
> t.test(x=sepal.length,alternative="greater",mu=5.0)

One Sample t-test

data: sepal.length
t = 0.1204, df = 49, p-value = 0.4523
alternative hypothesis: true mean is greater than 5
95 percent confidence interval:
4.922425 Inf
sample estimates:
mean of x
5.006

4.1.2 Two-sample t-tests

Two-sample t-tests are produced when both parameters
x and y are used. The two-sample test use an
additional parameter
var.equal, which indicates whether the test is computed with the usual assumption that
variances are approximately equal between the two samples or whether the test should be computed with a
Welch correction for unequal variances among the two samples. The test results for the two-sample t-test
provide a t-statistic value
t, the degrees of freedom df and p-value with a confidence interval for the difference
between the two sample means.

> # Define a vector of % body fat data for men from AE data
> bfat.m <- AE[AE$Gender == "Male",6]
> # Define a vector of % body fat data for women from AE data
> bfat.f <- AE[AE$Gender == "Female",6]

Crash Course: R & BioConductor


63

> # Compute a two-sided, two-sample t.test with AE data
> t.test(bfat.m,bfat.f,alternative="two.sided",var.equal=TRUE)

Two Sample t-test

data: bfat.m and bfat.f
t = 0.1095, df = 62, p-value = 0.9132
alt. hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.132776 4.611759
sample estimates:
mean of x mean of y
22.78788 22.54839
> # Compute a one-sided, two-sample t.test with AE data
> t.test(bfat.m,bfat.f,alternative="greater",var.equal=FALSE)

Welch Two Sample t-test

data: bodyfat.m and bodyfat.f
t = 0.1096, df = 61.968, p-value = 0.4565
alt. hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-3.408104 Inf
sample estimates:
mean of x mean of y
22.78788 22.54839
4.1.3 Matched pairs t-tests
Matched pairs t-tests are produced when two parameters
x and y represent two matched pair responses.
Matched pairs responses are often two repeated measurements taken from the same subjects (e.g. heart rate in
beats per minute, before and after light exercise). Even though responses from two samples are used, the
parameter
var.equal is not used, because the paired t-test is equivalent to the one-sample t-test computed on
the difference between the two samples. The test results for the matched pairs t-test include a t-statistic value
t,
the degrees of freedom
df and p-value with a confidence interval for the mean of the differences between the
two samples.

> afp.pre <- afp.data$AFP.before
> afp.post <- afp.data$AFP.after
> t.test(afp.pre,afp.post,alternative="greater",paired=TRUE)

Paired t-test

data: afp.pre and afp.post
t = 24.7612, df = 19, p-value = 3.180e-16
alt. hypothesis: true difference in means is greater than 0
95 percent confidence interval:
1.106445 Inf
sample estimates:
mean of the differences
1.189512

4.2 Linear Regression and ANOVA
Many statistical tests are easy to use like the student’s t-tests, but other functions have some more
difficult options and input requirements. The next two examples will demonstrate some statistical tests that
require more careful preparation and analysis than the Student’s t-test. This manual was written to teach readers

Crash Course: R & BioConductor


64

how to use R; it was not written to teach readers about basic statistics concepts. If you are unfamiliar with
linear regression and analysis of variance (ANOVA), then just focus your attention on how the subtle changes
in the R commands can produce totally different results. If you are more familiar with these methods, then you
may want to focus on how these examples could be extended into more complicated linear models like multiple
regression, multifactor ANOVA or analysis of covariance (ANCOVA). You may even notice that these two
examples do not just highlight the need for careful use of R commands, but they also highlight the conceptual
link between linear regression and ANOVA.

4.2.1 Linear regression

Very briefly, simple linear regression analyses are used to model the relationship between two
continuous variables with one straight line. The model estimates the unknown slope and y-intercept of the line
that best summarizes the relationship between the two continuous variables. Finally, a statistical test is
computed to determine if the slope is significantly different from zero. If the slope is equal to zero, then there is
no relationship between the two variables. If the slope is significantly different from zero, then there may be a
meaningful relationship between the two variables. Linear regression is part of the larger class of linear models,
which also includes the ANOVA model. The linear models share a commons set of assumptions. The fit of a
linear model is assessed using these model assumptions. The model errors, or residuals, are often examined to
determine if these model assumptions were met. The regression model is often visualized with a scatterplot of
the continuous response variable Y plotted against the continuous predictor variable X, while the model
assumptions are often evaluated with plots of the model residuals (Figure 57).


Figure 57. Regression plot and residual plot from R.

> # Define afp.data data frame with stringsAsFactors FALSE
> afp.data <- data.frame(subject,gender,height,weight,BMI,drug,
AFP.before,AFP.after,AFP.diff,
stringsAsFactors=FALSE)
> # Call the lm() procedure to fit regression
> afp.reg <- lm(formula = AFP.diff ~ drug, data = afp.data)
> afp.reg

Call:
lm(formula = AFP.diff ~ drug, data = afp.data)

Crash Course: R & BioConductor


65

Coefficients:
(Intercept) drug
-1.113835 -0.004408

> anova(afp.reg)
Analysis of Variance Table

Response: AFP.diff
Df Sum Sq Mean Sq F value Pr(>F)
drug 1 0.00230 0.00230 0.1225 0.7304
Residuals 18 0.33734 0.01874
> summary(afp.reg)

Call:
lm(formula = AFP.diff ~ drug, data = afp.data)

Residuals:
Min 1Q Median 3Q Max
-0.332846 -0.010882 0.008867 0.060927 0.228871

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.217901 0.053020 -22.97 8.71e-15 ***
drug 0.001515 0.004329 0.35 0.73
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1369 on 18 degrees of freedom
Multiple R-squared: 0.00676, Adjusted R-squared: -0.04842
F-statistic: 0.1225 on 1 and 18 DF, p-value: 0.7304

> # Open a PDF device for plots
> pdf("Regression.pdf")
> # Plot AFP difference vs. drug (Regression plot)
> plot(drug,AFP.diff,ylab="difference",main="regression plot")
> abline(coef=afp.reg$coefficients)
> # Plot residuals vs. fitted values (Residual plot)
> plot(afp.reg$fitted.values,afp.reg$residuals,xlab="fitted",
ylab="residual",main="residual plot")
> dev.off()
> # Open PDF file with graphs
>browseURL("Regression.pdf")

The
lm() function is used to fit both linear regression and ANOVA, so the most critical step in this
example is to redefine
afp.data with stringsAsFactors = FALSE. In the next example, we will see that
lm() fits an ANOVA if drug is a factor effect. The formula parameter is used to define the linear regression
model. So,
AFP.diff ~ drug indicates that AFP.diff is the response variable and drug is the predictor
variable. Since drug is a continous variable that has not been identified as a factor,
lm() will fit the linear
regression model. The data parameter specifies a data set for the
lm() procedure, so the variables do not need
to entered as
AE$drug and AE$AFP.diff in the formula.
The output of the
lm() procedure is stored as a list. The lm() procedure will print its call and
coefficients arguments by default, but we need to store the complete results in the variable
afp.reg if we want
to view hidden results like the residuals and fitted values. The F-test to determine if the slope is significantly
different from zero can be computed by using the
anova() or summary() function on the stored result of the
lm() command. The anova() function constructs the usual ANOVA table, while summary() provides
individual t-tests of the model coefficients and statistics like R-squared. In this example, a PDF device was
opened to display a regression plot of AFP differences vs. drug concentration and a residual plot of the model

Crash Course: R & BioConductor


66

residuals vs. fitted values. The
abline() command was used to add a regression line to the regression plot,
using the best-fit coefficients for the slope and y-intercept stored in the
afp.reg variable.
If you are familiar with linear regression, you will probably notice the linear regression model does not
fit this data very well. There is an obvious curved trend between the two variables, so the straight line estimated
by linear regression will never provide an accurate mode for this relationship. One way to improve model fit
would be use a polynomial regression model, which is easy to fit with
lm(). However, the next example will
show that the relationship can also be modeled with ANOVA.

4.2.2 ANOVA

One-way ANOVA is a statistical model that estimates and compares the means of three or more samples
or groups. Sometimes an ANOVA is visualized with a barchart to compare the means and standard errors
among three or more groups (Figure 58). The group means are compared with a F-test statistic that is the ratio
of the variance between the groups over the variance within the groups. The F-test provides a single statistical
test that can detect one or more differences among three or more groups, but it often does not identify the
specific groups that are different.


Figure 58. A bar chart plotting mean AFP difference for five drug concentration groups.

> # Redefine drug as a factor
> afp.data$drug <- as.factor(afp.data$drug)
> # Run LM ANOVA
> afp.aov <- lm(formula = AFP.diff ~ drug, data = afp.data)
> afp.aov

Call:
lm(formula = AFP.diff ~ drug, data = afp.data)

Coefficients:
(Intercept) drug5 drug10 drug15 drug20
-1.148482 0.050817 0.021928 -0.211923 -0.001521

> afp.anova <- anova(afp.aov)

Crash Course: R & BioConductor


67

Analysis of Variance Table
Response: AFP.diff
Df Sum Sq Mean Sq F value Pr(>F)
drug 4 0.17607 0.04402 0.8434 0.5191
Residuals 15 0.78285 0.05219

> afp.summary <- summary(afp.aov)

Call:
lm(formula = AFP.diff ~ drug, data = afp.data)

Residuals:
Min 1Q Median 3Q Max
-0.3663 -0.1271 -0.0560 0.1197 0.4467

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.148482 0.114226 -10.055 4.65e-08 ***
drug5 0.050817 0.161539 0.315 0.757
drug10 0.021928 0.161539 0.136 0.894
drug15 -0.211923 0.161539 -1.312 0.209
drug20 -0.001521 0.161539 -0.009 0.993
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2285 on 15 degrees of freedom
Multiple R-squared: 0.1836, Adjusted R-squared: -0.03409
F-statistic: 0.8434 on 4 and 15 DF, p-value: 0.5191

> # Open a PDF device for barplot()
> pdf("ANOVA.pdf")

> # Define plot labels and settings
> main = "One-way ANOVA"
> ylab = "AFP differences"
> xlab = "drug concentrations"
> colors = rainbow(5)

> # Compute drug concentration means
> means = afp.aov$fitted.values[1:5]
> names(means) = levels(afp.data$drug)

> # Generate a barplot
> mp <- barplot(height = means,main=main,xlab=xlab,ylab=ylab,
col=colors,ylim=c(0,-2))

> # Define the end points of each error bar
> X0 <- X1 <- mp
> Y0 <- means - afp.summary$sigma
> Y1 <- means + afp.summary$sigma

> # Plot error bars
> arrows(X0,Y0,X1,Y1,code=3,angle=90)

> dev.off()
> browseURL("ANOVA.pdf")

If the drug variable is stored as a factor, then
lm() will fit an ANOVA rather than a linear regression,
even though the formula parameter is the same. Now, the default output from
lm() will show coefficients for
an intercept, the mean of the 0 ng / mL drug concentration, and the ANOVA coefficients for the four remaining

Crash Course: R & BioConductor


68

drug concentration groups. These coefficients represent the adjustment you would need to make to the intercept
to compute that drug concentration group mean. The
anova() and summary() commands generate ANOVA
table and coefficient test results for ANOVA, just like in the linear regression example. The ANOVA table and
coefficient t-test output will help you determine if there are any significant differences among the group means.
The output from
lm() and summary() are stored as lists, so they include hidden output that can be called later,
if the results are stored as a variable. For example, you could produce a
barplot() graph with error bars using
the fitted values stored by
lm() and the estimate of the error stored by the summary() command. Notice the bar
chart shows the ANOVA model fits a slight curved trend to the data, which is more accurate than the straight-
line trend fit by the linear regression model.
Other extensions of the linear model can be fit using R. Using more complicated formulas in with the
lm() function can produce multiple regression models, multifactor ANOVA models and ANCOVA models.
Linear mixed models, or ANOVA models with random effect variables, can be fit using the
lme() function
from the
nlme package library. Multiple Analysis of Variance (MANOVA) can be fit with the manova()
command from the
stats() library. Complicated ANOVA models for microarray data can be fit using the
limma and maanova package libraries.

4.3 R Commander

Most statistical tests in R are called from the command line, but some R functions include helpful
graphical interface elements. In fact, several groups of R users work on projects to create comprehensive GUI’s
for handling a variety of statistical problems with R. One popular GUI project is called R Commander (Figure
59). The R Commander GUI is an R package of GUI elements developed using the programming language
tcl/tk. It provides point click access to many simple R procedures, like importing data, computing simple t-tests
and ANOVA, or plotting simple graphs. Download the Rcmdr package and enter the command
Rcmdr() to
open the R Commander GUI.

Crash Course: R & BioConductor


69


Figure 59. The R Commander (Rcmdr) statistics GUI.



4.4 Sample Problems for Students

#7. Search the help menus to find the command(s) for a non-parametric statistical test analogous to the
Student’s t-test (e.g. Mann-Whitney U-test, Wilcoxon rank sum test, ...). Repeat at least one of the S
tudent’s t-test examples from section 4.1 with this non-parametric test.

#8. Add a second predictor variable to the formula parameter of the lm() procedure from the regression or
ANOVA example in section 4.2 to create a more complicated linear model.

Crash Course: R & BioConductor


70

Ch. 5. Writing Basic Scripts in R

5.1 Text Editors

Before you write your first R script, you should carefully choose a text editor to use while writing your
code. You can write functions and scripts from the command line, but it is a really bad idea because the
command line submits each new line of code that you write. Programming from the command line makes it
very difficult to correct mistakes in your code. For the examples in this guide, you will only need to use the R
editor from your R GUI or a simple text editor like MS Windows Notepad or Apple TextEdit.app in Mac OSX.
However, advanced programmers should feel welcome to use their favorite text editor, like Emacs or VIM.
Many programming text editors have special features to aid R programming, like code correction and automated
text coloring, indentation and bracket matching.

5.1.1 R GUI editors

The R Editor from the Windows R GUI is a very basic text editor that opens within the Windows R GUI
(Figure 60). You probably will not find many differences between editing your R code in this R editor and the
Windows Notepad editor. The R Editor has all the basic cut, copy and paste features of Windows Notepad, plus
find and replace text features. The one minor advantage of the R Editor is that you can submit code to R
directly from R Editor using the Run all and Run line or selection options from your edit menu.


Figure 60. The R Editor from the Windows R GUI.

The Editor from the Mac OSX R GUI is a slightly more advanced text editor than the R Editor for the
Windows R GUI described above Figure 61). The Mac OSX Editor features automated text coloring,
indentation, bracket matching and text completion. You may find these features very helpful, or slightly
bothersome, depending on your personal preferences. There are also new features available in the Mac OSX
Editor from the Edit menu of the R GUI. The Mac OSX Editor still includes options to undo and redo; cut,
copy and paste; find and replace text and source document (i.e. Run All).

Crash Course: R & BioConductor


71


Figure 61. The Editor from the Mac OSX R GUI.

5.1.2 MS Notepad and Mac OSX TextEdit.app

You can also edit R scripts using text editors, like MS Notepad or TextEdit.app in Mac OSX. These text
editors will not have any advanced features like bracket matching or text completion. Also, you will not be able
to submit your code to R with a point-and-click interface. However, using an independent editor like MS
Notepad or TextEdit.app can prevent you from losing changes to your code in the event that R might crash
during a lengthy computation. You could also use a more advanced word processor software, like MS Word, to
add coloring and formatting to your code while you are writing, then export the raw text to Notepad or R when
you are ready to submit the code.

5.1.3 More advanced programming editors

Users with previous programming experience may prefer to use their favorite programming text editors,
like Emacs or Vim, to edit their R scripts. Many programming text editors will have special packages to help
debug R code. For example, the “Emacs Speaks Statistics” (ESS) package for the Emacs editor includes some
debugging tools for R code. The integrated development environment (IDE) Eclipse also includes packages to
debug R scripts. Even the basic indentation and bracket matching utilities of the Apple Script Editor.app and
XCode.app editors may help when writing R scripts.

5.2 Hello World!

5.2.1 My first R script

Writing scripts in R can be very easy or extremely complicated. I will begin by providing you some
sample code that you can manipulate and play with on your own. My first example is the familiar “Hello
World!” example. Enter the following R code into one of the text editors described above and save your file
with the name hello.txt or hello.R on your computer’s desktop or in an easy to find file folder.

Crash Course: R & BioConductor


72

# This is my first R script

print("Hello World!")

The file you saved is now an R source file. Notice, the comment and the
print() statement are not
preceded by
> symbol, because the script text is saved in a text editor and it is not entered at the command line.
You can load this new script into R using the
source() command in the R console or possibly from your text
editor (e.g. the Run all and Run line or selection options in Windows R GUI editor).

> source("C:/Documents and Settings/.../Desktop/hello.R")
[1] "Hello World!"
>
5.2.2 Defining and calling functions The example above demonstrates one of simplest possible R scripts, because the script simply “calls”
the function
print() from the base R software. A second example will demonstrate the important distinction
between “defining” and “calling” a function in an R script. Enter the following R code your text editor and save
the file with the name hello_base.txt or hello_base.R on your desktop or another location.

# This is my second R script

# Define the function "hello"

hello <- function(x){

if(is.numeric(x) == TRUE) print("Hello World!")
else print("ALL YOUR BASE ARE BELONG TO US!!!")}

# Call the function hello

hello(4.23)
This new script is still very simple, but again it utilizes the two concepts of defining and calling a
function in an R script. The first part of the source code defines a new function called
hello(). The hello()
function requires one input objects, then it returns one of two different messages if the input is numeric or non-
numeric. The second part of the script calls the new function
hello() for the numeric input 4.23. When we
run the new source code, we immediately see the output from our called function. However, we can also call
the new function
hello() to evaluate a new user input at any point during our R session

> source("C:/Documents and Settings/user/Desktop/hello_base.R")
[1] "Hello World!"
> hello("Fred Flintstone")
[1] "ALL YOUR BASE ARE BELONG TO US!!!"
>

5.3 Use Scripts to Automate and Save Workflows

In the previous chapters of this manual, every example presented R commands entered one-at-a-time at
the command line. For many some objectives, working at the command line can be a quick and interactive way
to compute statistical tests or create graphics. However, many analyses will be complicated and will require
dozens of steps from start to finish. When an analysis workflow uses several commands for data import, data

Crash Course: R & BioConductor


73

manipulation, statistical tests and graphing, it is often a good idea to incorporate the procedures into a single R
script.

# Import a .csv text file data set

AE <- read.csv(file="~/AdverseEvents.csv",stringsAsFactors=TRUE)

# Generate a pie chart

counts <- summary(AE$Adverse.Event)

pie(x = counts,main = "Pie Chart of Adverse Events")

# Compute a two-sample t-test

x <- AE$Sodium.Intake[AE$Adverse.Event=="Headache"]
y <- AE$Sodium.Intake[AE$Adverse.Event=="Nausea"]

t <- t.test(x,y,var.equal=FALSE)

print(t)
In this very simple example, a script was create to import the adverse events data set, generate a pie
chart of the adverse event factor variable and compute a two-sample t-test to compare the sodium intake of
patients experiencing headache versus patients with nausea. The script might be saved with the file name
AEworkflow.R in the current directory. When the script is sourced at the command line, the pie chart graphic
will automatically appear and the output from the
t.test will be printed at the console.

> source("~/AEworkflow.R")

Welch Two Sample t-test

data: x and y
t = 1.4444, df = 8.858, p-value = 0.1831
alt hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-401.5423 1810.4329
sample estimates:
mean of x mean of y
2191.986 1487.540

5.4 Computation and Output Options

5.4.1 if() statements

The second “hello world” example in section 5.2 demonstrated the use of a simple
if() statement to
produce one of two different
print() statements. This example may have been trivial, but if() statements can
be very useful in programming new R scripts. An
if() statement can be used to check the quality of incoming
data or provide multiple analysis options if several types of valid data are possible. Another
if() statement
could be used to determine if and when data should be saved or printed to the console.

# Define a function to process percentage (%) data

percent.fun <- function(x,char=FALSE){

# Use if() to verify inputs x and char

Crash Course: R & BioConductor


74

if(is.numeric(x) == false)
stop("Input 'x' must be numeric")

if(is.logical(char) == false)
stop("Input 'char' must be logical")

# Use if() to eliminate negative values

if(x < 0) x <- NA

# Process x data

x <- x*100

# Use if() to choose character or numeric output

if(is.na(x) == FALSE & char == TRUE) x <- paste(x,"%")

# Return result

x}
The goal of this script example is to convert positive decimals (e.g. 0.44) into percent measurements
(e.g. 44%). The first two
if() commands in this example are used to determine if the two function inputs, x and
char, are variables with the correct storage mode. If the variables have the incorrect storage mode, the
command
stop() will stop processing the function and print the quoted error message. The third if()
statement eliminates any negative numbers, while the fourth and final
if() command is used to determine if
non-missing percents should be converted into text strings followed by % signs. Notice, the
& operator
indicates two conditions need to be evaluated in the
if() statement. The percent value x will only be
converted to a text string if the value x is not missing (i.e.
x = NA) and the char parameter is TRUE. The
paste() command is used to connect two or more objects together in a text string; in this example the percent
number will be connected to a % sign.
This last scripting example and the two-color scatterplot example in section 3.2.3 both used the
ampersand symbol
& to simultaneously evaluate two inequalities. The & symbol is used as the “and” operator in
R, indicating that two or more statements need to evaluated at the same time. The vertical bar, or “pipe”,
symbol
| is used as the “or” operator in R, indicating that at least on of two or more conditions need to be met
before continuing. For example, the script above could use an “or” operator to eliminate both negative values
and values greater than 1. The script could also use an “and” operator to process only the x values between zero
and 1.
The
& and | operators are used to evaluate two or more conditions before a result might be applied.
However,
if() statements can be used to apply multiple conditions, when the if() statement is followed by {}
brackets to identify the beginning and end of the conditional result. In fact, you can call multiple
if()
statements with the result of a previous
if() statement. This can be a useful alternative to messy condition
statements with multiple
& and | operators.

# Use if() to eliminate negative values and process data

if(x < 0 | x > 1) x <- NA

if(x > 0 & x <= 1){

x <- x*100

if(char = TRUE) x <- paste(x,”%”)

print(x)}

Crash Course: R & BioConductor


75


5.4.2 for() and while() loops

Some times a process needs to be repeated in a script. For example, you may need to compute the mean
of a single row of data from a matrix, for each of twenty rows in the matrix. Or you may need to compute
separate linear regressions of percent body fat versus sodium intake for five different regions. You may need to
repeatedly apply a minimization algorithm to update an estimate in a statistical model until the estimates change
by less than 1%. Or you may want to model the number of rolls from two six-sided dies it would take add up to
the number 100. Sometimes a process will be repeated a fixed number of times, other times the process will
repeat until some condition is met. The
for() and while() commands in R can perform these kinds of
repeated processes within a script.

# Define a function to compute row statistics with a for() loop

row.stats.loop <- function(x){

# Initialize vectors

row.means <- row.medians <- vector("numeric",length=nrow(x))

# Use a for() loop to compute means and medians for each row

for(i in 1:nrow(x)){

row.means[i] <- mean(x[i,])
row.medians[i] <- median(x[i,])}

# Create a list of output

output <- list()
output[["row means"]] <- row.means
output[["row medians"]] <- row.medians

# Call the output list to report final results

output}

# Define a function to count the number of dice rolls until 100

dice100 <- function(){

# Set rolls and tot equal to 0

tot <- rolls <- 0

# use a while() loop to count the number of rolls

while(tot < 100){

tot <- tot + round(runif(1,1,6)) + round(runif(1,1,6))
rolls <- rolls + 1}

# Report the number of rolls need to reach 100

rolls}

The
for() loop command uses a variable (e.g. the letter i) that will step through a sequence of numbers
(e.g. the sequence
1:nrow(x)). Note the variable can be given any name, but the letter i is a popular choice,
since the rows and columns of a matrix are often labeled
i and j, respectively. Also notice that you can use

Crash Course: R & BioConductor


76

variables or functions (e.g.
nrow(x)) to define the sequence of a for() loop. Often the for() loop variable
(e.g.
i) will be used in the index of vectors, matrices, arrays or data frames within the for() loop. If the for()
or
while() loop must use more than one line of code, then {} brackets are used to open and close the loop.
The
while() loop will evaluate each command within the loop, test the required condition within the
parentheses, then repeat the commands within the loop until the condition statement is finally met. In the
example, two variables are defined,
rolls and tot, to count the number of dice rolls and the total result of all
the rolls added together. At the first step of the while loop, both the
rolls and tot variables are equal to zero.
The command
round(runif(1,1,6) was used to simulate the roll of one single die. Suppose the first die
yields the number 4 and the second die yields the number 3. Both results are added to the
tot variable, so total
equals 7. The script also adds the number 1 to the
rolls variable, to indicate that we have used one roll of the
two dice at this first iteration of the loop. The
tot variable is equal to 7, which is less than 100, so the process
repeats with a second run through the loop commands.

5.4.3 apply() functions

The for() loop is a very useful function that can be used to modify the elements of a vector, matrix, array
or data frame. Unfortunately, writing for() loop code can get lengthy and computation of for() loops can be
slow for large matrix, array and data frame objects (e.g. microarray data). The apply() function is an alternative
to the for() loop that can simplify code and potentially speed up computation time.

# Define a function to compute row stats with apply()

row.stats.apply <- function(x){

# Initialize vectors

row.means <- row.medians <- vector("numeric",length=nrow(x))

# Use apply() to compute means and medians for each row

row.means <- apply(x,MARGIN=1,FUN=mean)
row.medians <- apply(x,MARGIN=1,FUN=median)

# Create a list of output

output <- list()
output[["row means"]] <- row.means
output[["row medians"]] <- row.medians

# Call the output list to report final results

output}

The
apply() function has an X parameter to specify the matrix, array or data frame to be used in the
calculations. The
MARGIN parameter indicates whether computations should be applied over rows ( 1), columns
(
2) or both c(1,2), while the FUN parameter identifies the function that will be applied over the rows or
columns. Any additional parameters of the function identified in the
FUN parameter can be specified at the end
of the apply statement. For example, the
mean() function includes a parameter trim for trimmed means, so
apply(X=x,MARGIN=1,FUN=mean,trim=0.1) will compute 10% trimmed means over each row of the matrix x.

Crash Course: R & BioConductor


77

5.5 Sample Problems for Students

#9. Create a script to automate the creation and analysis of the AFP data set

#10. Create your own script to compute two new types of row statistic (e.g. standard deviation and
interquartile range) for a data frame or matrix. Be creative, add graphics or a statistical test (e.g. linear
regression).







Literature Cited

Becker, Richard A. and John M. Chambers. 1984. S: An Interactive Environment for Data Analysis and
Graphics. Wadsworth & / Cole. Pacific Grove, CA. ISBN 053403313X

Becker, Richard A., John M. Chambers and Allan R. Wilks. 1988. The New S Language
. Chapman & Hall.
London. ISBN 053409192X.

Chambers, John M. 2004. Programming with Data: A Guide to the S Language
. Springer. New York. ISBN
0387985034

Chambers, John M. 2008. Software for Data Analysis: Programming with R
. Springer. ISBN 0387759352

Chambers, John M. and Trevor J. Hastie. 1992. Statistical Models in S
. Chapman & Hall. London. ISBN
9780412830402

Hornik, Kurt. 2008. The R FAQ. http://CRAN.R-project.org/doc/FAQ/. ISBN 3-900051-08-9.

Ihaka, Ross and Robert Gentleman. 1996. R: A language for data analysis and graphics. Journal of
Computational and Graphical Statistics. 5(3): 299-314.

Sussman, Gerald and Guy Steele. 1975. SCHEME: An Interpreter for Extended Lambda Calculus AI Memo
349. MIT Artifical Intelligence Laboratory. Cambridge, MA.

Tukey, John. 1977. Exploratory Data Analysis
. Addison-Wesley. Reading, MA.