Type Name / Value | R Function to Check | Description |
---|---|---|
Numeric | is.numeric() | Numbers with decimals (e.g., 3.14 ) |
Integer | is.integer() | Whole numbers (e.g., 5L ) |
Character | is.character() | Text or string data (single or double quoted) |
Logical | is.logical() | Boolean values: TRUE or FALSE |
NA | is.na() | Missing value; used to represent absence of data |
NULL | is.null() | Empty object; represents “no object”/undefined |
NaN | is.nan() | “Not a Number”; result of undefined mathematical operations |
Inf / -Inf | is.infinite() | Positive or negative infinity (e.g., 1/0 or -1/0) |
Missing and undifined values:
NA
is used for missing data and is type-aware (e.g., NA_integer_, NA_character_).
NULL
is different from NA: it represents no value at all, often used in lists or empty objects.
NaN
is a specific type of NA (i.e., is.na(NaN)
is TRUE
).
Inf
arises in operations like 1/0; it’s a valid numeric value.
When comparing or sorting character strings in R, the comparison is done lexicographically, meaning alphabetically based on Unicode (ASCII for common characters). This ordering can sometimes lead to surprising results, especially when comparing letters, numbers, and accented characters.
Here’s a simplified table showing the position of some common characters based on their Unicode code points:
Character | Type | Unicode Code | Explanation |
---|---|---|---|
0 |
Digit | 48 | Digits come before letters |
1 |
Digit | 49 | |
9 |
Digit | 57 | |
A |
Uppercase letter | 65 | Uppercase come before lowercase letters |
B |
Uppercase letter | 66 | |
Z |
Uppercase letter | 90 | |
a |
Lowercase letter | 97 | Lowercase letters come after uppercase ones |
b |
Lowercase letter | 98 | |
z |
Lowercase letter | 122 |
Example:
## [1] "1" "10" "9" "a" "A" "z" "Z"
If you compare characters: Digits > uppercase letters > lowercase letters
You can inspect the Unicode code points of characters with utf8ToInt()
:
## [1] 65
## [1] 97
## [1] 49 48
sort()
, utf8ToInt()
and class()
to explore behavior.Structure | R Function to Check | Description |
---|---|---|
Vector | is.vector() | A one-dimensional array of elements of the same basic type |
Factor | is.factor() | A special type of vector used to represent categorical data |
Matrix | is.matrix() | A 2D structure with rows and columns, all elements must be of same type |
Array | is.array() | A multi-dimensional (n ≥ 1) generalization of a matrix with same type data |
vector:
# build vectors
x <- c(3,2,0,5)
y <- c("85",8,9)
a <- seq(1,6,1)
rep(y,3) # all elements were changed to character
## [1] "85" "8" "9" "85" "8" "9" "85" "8" "9"
## [1] 2 0
## [1] 1 4 9 16 25 36
## [1] 2 4 6 8 10 12
factor:
# limited number of different values, encoded in integer
data <- factor(c(3,2,0,5,3,2,0,5,3))
as.integer(data)
## [1] 3 2 1 4 3 2 1 4 3
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
# sum(data) # Brings an error: 'sum' not meaningful for factors
levels(data) <- c("bleu","jaune","rouge","gris")
matrix and array:
#In a general manner, to access the data:
x2[indexes.dim1,indexes.dim2]
x3[indexes.dim1,indexes.dim2,indexes.dim3]
xN[indexes.dim1,indexes.dim2,indexes.dim3,...,indexes.DimN]
## [1] 4 3
## NULL
## [1] 8
## [1] 2 5 8 11
## [,1] [,2] [,3]
## [1,] 2 4 6
## [2,] 8 10 12
## [3,] 14 16 18
## [4,] 20 22 24
# the vector contains 3 elements and will be added by element by row, it will be recycled
x2 + c(3,2,5)
## [,1] [,2] [,3]
## [1,] 4 4 8
## [2,] 6 10 9
## [3,] 12 11 11
## [4,] 13 13 17
# x2[,1] <- c(3,2,5) # Error: ! number of items to replace is not a multiple of replacement length
# create a 3 dimensional array
x3 <- array(1:12,dim=c(2,2,3))
# Extract elements
x3[1,2,2]
## [1] 7
## [1] 3 7 11
Structure | R Function to Check | Description |
---|---|---|
Data Frame | is.data.frame() | Like a table; each column is a vector of the same length, but types can vary |
List | is.list() | A generic container that can hold elements of different types and sizes |
data.frame:
# collection of vectors and/or factors constrained by column
# create a data.frame
firstNames <- c("Remy", "Lol", "Pierre", "Domi", "Ben")
IMC <- data.frame(sex=c("H", "F", "H", "F", "H"),
height=c(1.83,1.76,1.82,1.60,1.90),
weight=c(67,58,66,48,75),
row.names=firstNames)
# check dimensions
dim(IMC)
## [1] 5 3
## [1] "sex" "height" "weight"
## [1] 1.83 1.76 1.82 1.60 1.90
## [1] "H" "F" "H" "F" "H"
## sex height weight
## Remy H 1.83 67
## [1] 1.82
list:
# very flexible, store everything
# create a list
list.ex <- list(one_vec=1:12,
one_name="Boo",
one_tab=matrix(1:4,nrow=2))
# Extract elements
list.ex$one_tab
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
## [1] "a"
Choosing a variable name is important. It is recommended to make it explicit, short, and unique.
It’s best to maintain a consistent naming style throughout the script: variableName, variable_name, or variable.name (note that in some other languages, the dot is not allowed as it is used to call functions).
Rule 1: A variable name cannot start with a number.
Rule 2: It cannot contain special characters such as: & " ' / \ @ $ () [] {}
, any mathematical operators, or punctuation marks.
Rule 3: Variable names are case-sensitive: name ≠ Name.
This is applicable for abjects and functions
When working with data, you’ll often hear about “long” and “wide” data formats. These are two common ways to organize tabular data, and understanding the difference is important for analysis and visualization.
Example:
indiv | species | height_0 | height_10 | height_20 |
---|---|---|---|---|
A | class1 | 15 | 20 | 23 |
B | class1 | 10 | 15 | 24 |
Here, each time point (height_0
, height_10
, height_20
) has its own column.
This format is often easier to read for humans and is common in spreadsheets.
In long format, each row is one measurement, and repeated variables (like time) are stored in a single column, with another column to indicate the context (e.g., time).
Same data in long format:
indiv | species | time | height |
---|---|---|---|
A | class1 | height_0 | 15 |
A | class1 | height_10 | 20 |
A | class1 | height_20 | 23 |
B | class1 | height_0 | 10 |
B | class1 | height_10 | 15 |
B | class1 | height_20 | 24 |
This format is especially useful for plotting and for many R functions that work better with tidy, long-form data.
Format | Description | Use Case |
---|---|---|
Wide | One row per observation, variables in columns | Easy to read, spreadsheet-style |
Long | One row per measurement, with key-value columns | Ideal for plots, grouped analysis |
You can use the pivot_longer()
and pivot_wider()
functions (from the tidyr
package) to switch between formats.
pivot_longer()
function enables to convert multiple columns into key-value pairs, where column names become variable names, and their corresponding values are stacked. pivot_wider()
function allows to spread key-value pairs across multiple columns. We say that pivot_longer()
reshapes wide format to long format and pivot_wider()
do the invert.
Basic syntax
# load the library
library(tidyr)
# Convert from wide to long format
long_tab <- wide_tab %>%
pivot_longer(
cols, # The columns to gather into key-value pairs (e.g., height_0, height_10)
names_to, # The name of the new column that will store the former column names (e.g., "time")
values_to # The name of the new column that will store the values (e.g., "height")
)
# Convert from long back to wide format
wide_tab <- long_tab %>%
pivot_wider(
names_from, # The column whose values will become new column names (e.g., "time")
values_from # The column whose values will fill the new wide-format table (e.g., "height")
)
In R, packages are collections of functions, data sets, and documentation that extend the functionality of base
R. They are like add-ons or plugins that provide additional tools for performing specific tasks (data analysis, visualization, machine learning…).
Think of them as libraries or modules that you can install to get extra tools for your work. Some packages come pre-installed with R, while others can be installed from repositories.
In the context of R, a repository is a centralized location where software, code, data, or packages are stored, managed, and distributed.
A repository is essentially a storage space (either local or online) where software packages, source code, or project files are organized, tracked, and shared. It helps developers and users access the software they need, often with version control and documentation included. It exists several repositories.
CRAN is the primary repository for R packages. It is a collection of thousands of R packages that are contributed by developers around the world. Packages stored on CRAN go through a review process to ensure quality and compatibility. CRAN allows users to install packages directly in R either via the “Packages” panel or directly using a command line.
After installing a package, you need to activate it.
Each time you close R/RStudio, any installed packages will be preserved, but will not be active when you switch R/RStudio back on. You need to activate it either by checking the box next to the package name (see image above) or by using a simple command line.
Note that for the library function, you must enter the package name without quotation marks.
Bioconductor is a repository specifically for bioinformatics and computational biology packages. It focuses on tools for analyzing omics data.
To use and fetch packages from this directory, you need to download… a package! And this package, named “BiocManager”, installs normally because it’s available in CRAN.
In short, you need to download a package available in CRAN to access other packages that may be available in a directory other than CRAN.
In some cases, if you want to perform certain types of analysis and you can’t find the package you need in CRAN, there’s a good chance you’ll find it in Bioconductor (if it involves biology-related methods or analyses).
After installing the “BiocManager” package, you can download the packages available in the directory.
Don’t forget to activate the installed package(s) with library()
.
Please check the Bioconductor website for more information about installation and the list of available packages.
Link : Bioconductor
GitHub is a platform for hosting code repositories, and it is widely used by R developers to share and collaborate on R packages and projects.
Developers can use GitHub to publish the latest versions of their R packages, even before they are available on CRAN. It also allows for collaboration, version control, and contribution from the open-source community.
Like Bioconductor, you’ll need to install a package to download packages from GitHub.
Once devtools
is installed, you can download packages from the GitHub community.
Note that here we don’t use install_packages()
but install_github()
to fetch the package we’re interested in.
Don’t forget to activate the installed package(s) with library()
.
saveRDS()
and readRDS()
are functions in R used to save and load single R objects. They are often used when you want to store an object (like a data frame, list, or function) to a file and retrieve it later — even under a different name.
saveRDS()
writes a single R object to a file.
readRDS()
reads that object back into R.
They use the .rds file format and they work with one unnamed object at a time.
Basic Syntax
Key Arguments:
object
: The R object you want to save.
file
: File path or connection to write the object.
compress
: Whether to compress the file (TRUE, “gzip”, “bzip2”, etc.).
ascii
: Save in ASCII format (mainly for readability or portability).
refhook
: Advanced; handles reference objects.
Notes
saveRDS()
is ideal when you want to save one object at a time.
Use readRDS()
when you want to load the object into any variable name.
.rds files are not designed for sharing across systems or languages — use .csv or .rds with caution if interoperability is needed.
Other functions are classically used: save()
and load()
save(object, tab, otherthing, file = "file.RData")
load("file.RData")
ls()
[1] "object" "tab" "otherthing" # all saved objects are loaded with names they wer given
Function | Saves Multiple Objects | Retains Object Names | File Type |
---|---|---|---|
save() |
Yes | Yes | .RData |
saveRDS() |
No | No | .rds |
?
: Search R documentation for a specific term.(You can also do this with the help()
function.)
To find more about it from R’s documentation, simply search for the term with a single question mark placed ahead of it.
You can use your favorite web browser. Start any prompt with “R” and request in English. You will often find forums (Stack Overflow, Biostars), blogs (Data Geek, bioinfo-fr) and browsers (GeeksForGeeks, RSeek - dedicated to R) that are plenty of “already asked questions”. Keywords are important and experience will help you to optimize your search.
You can use your favorite Large Language Model (LLM) to help with programming. These tools are very good at basic coding tasks like writing simple functions, explaining code, or helping you debug.
CAUTION: LLMs are not perfect. The way you ask your question can strongly influence the answer you get. Also, LLMs are designed to always give you a response — even when they’re unsure — so they won’t say “I don’t know” or “That doesn’t make sense.”
That means it’s your job to read the answer critically and test the code yourself. For complex or very specific questions, the LLM might make mistakes or give you code that looks correct but doesn’t actually work.
Think of LLMs as helpful assistants - not experts. They’re great for learning, but you should always verify their answers and ask for help when something doesn’t seem right.
In R, many functions are implemented in one package and then reused or re-exported in other packages. This allows developers to build on existing tools without rewriting the same code. However, it also means that:
Example: resize()
in IRanges vs. GenomicRanges packages
The function resize()
is defined in the IRanges package, but it is also used in GenomicRanges. Both packages make use of it, but the help pages differ in level of detail or examples.
To check where the function is coming from:
To access documentation from a specific package:
Why This Matters
Tip: You can also use getAnywhere(resize)
to explore all versions of a function if it’s defined in multiple places.
Some function names in R are very common, like select()
or merge()
. These names are used in multiple packages, and sometimes they have different behavior depending on which package they come from.
Example: merge()
merge()
exists in base R, and is used to combine data frames by common columns or row names.merge()
also exists in data.table, with enhanced performance and slightly different syntax.merge()
methods (especially for S4 or S3 classes).When you load data.table
, its version of merge()
overrides the one from base (or masks similar names from other packages).
You can check which version is currently active with:
And you can explicitly call the version you want like this:
Summary
Function Name | Common Packages | How to Disambiguate |
---|---|---|
select() |
dplyr , MASS , others |
Use dplyr::select() explicitly |
merge() |
base , data.table |
Use base::merge() or data.table::merge() |
filter() |
dplyr , stats |
Use dplyr::filter() or stats::filter() |
Tip: Use conflicts()
to see which functions are masked (overridden) when you load packages:
This will help you understand why a function might behave differently than expected.
Functions are essential components of the R programming language. They allow you to encapsulate code into reusable blocks, making your scripts more modular, readable, and easier to maintain.
Defining Functions:
A function in R has three main parts: the function name, arguments, and the function body.
Function name: A label that identifies the function and is used to call it. see rules
Arguments: The inputs passed to the function, defined within parentheses.
Function body: The block of code that executes when the function is called, enclosed in curly braces {}.
Basic Syntax:
Example:
Here’s a simple example of how to define and use a function in R that adds two numbers:
# Define the function
add_numbers <- function(num1, num2) {
result <- num1 + num2
return(result)
}
# Call the function
sum_result <- add_numbers(num1 = 5, num2 = 10)
# Display the result
sum_result
## [1] 15
Your turn: Define a function that allows to return the average of two values and display a text: “The result of this treament is 49”.
Tips: Look at the help of the function paste()
and its arguments. To display a text from within a function, you can use the function print()
or message()
.
# Define the function
av_numbers <- function(val1, val2){
res <- (val1 + val2)/2
print(paste("The result of this treament is",res))
message("The result of this treament is ",res)
return(res)
}
# Call the function
mean_result <- av_numbers(val1 = 10, val2 = 20)
## [1] "The result of this treament is 15"
## The result of this treament is 15
## [1] 15
The argument x
takes each line (MARGIN=1) of plant_height[,-c(1,2)]
and computes the difference between the maximum and minimum values.
To go further you can check this blog.
In data manipulation, comparisons help you filter and select specific data based on conditions. Some common comparison operators include: ==
(equal to), !=
(not equal to), >
(greater than), <
(less than), >=
(greater than or equal to), and <=
(less than or equal to). Additionally, %in%
is useful for checking if a value belongs to a set of values. These comparisons allow you to create logical statements that you can use to filter the elements that meet your criteria.
Here is a table summarizing the most common operators:
Operator | Description | Example |
---|---|---|
== |
Equal to | x == "yes" , x == 6 |
!= |
Not equal to | x != "no" , x != 5 |
> , < |
Greater than, Less than | x > 5 , x < 5 |
>= , <= |
Greater/Less than or equal to | x >= 5 , x <= 5 |
%in% |
Checks if value is in a set | x %in% c("A", "C") |
& |
logical AND | x < 6 & x > 3 |
| |
logical OR | x > 6 | x < 3 |
! |
logical NO | !x %in% c("A", "C") |
apply
familyIn this section, we will learn how to use the apply
family of functions in R. These functions help you perform repetitive operations (e.g., sum()
, mean()
) on rows, columns, or elements of data structures like vectors, matrices, or data frames.
We will start with simple built-in functions like colSums()
and gradually explore more flexible tools like apply
, lapply
, sapply
, and tapply.
Finally, we’ll compare these base R tools with functions from the tidyverse, such as group_by()
and summarise()
.
First, we’ll create an example data set named plant_height.
This data set describes the heights of ten individuals in centimeters at three different time points (0, 10, and 20 days). The first column contains the IDs for each individual, the second its species, and each successive column describes their heights at time points 0, 10, and 20 in that order.
plant_height <- data.frame(indiv = LETTERS[1:10],
species = rep(c("class1","class2"),each=5),
height_0 = c(15, 10, 12, 9, 17, 13, 10, 11, 15, 13),
height_10 = c(20, 15, 14, 15, 19, 22, 18, 21, 24, 20),
height_20 = c(23, 24, 18, 17, 26, 23, 19, 23, 24, 21))
Imagine you want to know the average height at each time point. How would you do?
mean()
average_0 <- mean(plant_height$height_0)
average_10 <- mean(plant_height$height_10)
average_20 <- mean(plant_height$height_20)
This is quite OK if you have three columns but we can imagine how fastidious it will be with 100 features.
colMeans()
, rowMeans()
,colSums()
and rowSums()
that allow to compute mean and sum of values for all rows or columns at once.## height_0 height_10 height_20
## 12.5 18.8 21.8
Now your turn, compute the average height of each individual.
## [1] 19.33333 16.33333 14.66667 13.66667 20.66667 19.33333 15.66667 18.33333
## [9] 21.00000 18.00000
These functions are fast and convenient, but they work only in specific situations. What if we want to apply a different function, or calculate values by group?
apply()
functionThe apply
command allows to apply any function across an array, matrix or data frame.
Basic Syntax:
apply(X, # Array, matrix or data frame
MARGIN, # 1: rows, 2: columns, c(1, 2): rows and columns
FUN, # Function to be applied
...) # Additional arguments to FUN
Example:
Compute the maximum height reached at each time point.
# We need to select for numeric columns
max_height_tp <- apply(X=plant_height[,-c(1,2)], MARGIN=2, FUN=max)
# Display the result
max_height_tp
## height_0 height_10 height_20
## 17 24 26
Your turn: How much plants have grown between 0 and 20h ?
Tips: You may need to build a function, see section How to build a function?.
# build the function
height_amplitude <- function(x){
growth <- max(x) - min(x)
return(growth)
}
# We need to select for numeric columns
growth_height <- apply(X=plant_height[,-c(1,2)], MARGIN=1, FUN=height_amplitude)
# Display the result
growth_height
## [1] 8 14 6 8 9 10 9 12 9 8
The argument x
takes each line (MARGIN=1) of plant_height[,-c(1,2)]
and computes the difference between the maximum and minimum values.
Important: What if the function needs additional arguments?
What if the function you want to apply requires more than one argument ?
For example, the mean()
function has multiple arguments:
- x
: the data to calculate the mean (this is what apply()
will pass)
- trim
: to remove a fraction of extreme values
- na.rm
: to ignore NA
values when computing the mean
You can check the full list using ?mean
.
Let’s modify our data to include some NA
values:
# Add NA values to simulate missing data
plant_height <- data.frame(
indiv = LETTERS[1:10],
species = rep(c("class1", "class2"), each = 5),
height_0 = c(15, NA, 12, 9, 17, NA, 10, 11, 15, 13),
height_10 = c(20, 15, 14, NA, 19, 22, 18, 21, 24, 20),
height_20 = c(23, 24, 18, 17, 26, 23, 19, 23, 24, 21)
)
Now, if we want to calculate the mean for each time point while ignoring the NA
s, we need to pass the na.rm = TRUE
argument to the mean()
function. Here’s how to do that with apply()
:
# Apply mean to each column, skipping NA values
mean_height_tp <- apply(
X = plant_height[, -c(1, 2)], # select only numeric columns
MARGIN = 2, # apply function to each column
FUN = mean, # function to apply
na.rm = TRUE # extra argument passed to mean()
)
The first argument of the function you’re applying (e.g., x
in mean(x, ...)
) receives the content of X
in apply()
. Any additional arguments must be named and explicitly provided after the FUN
parameter.
This rule applies to all functions in the apply family (apply()
, lapply()
, sapply()
, tapply()
, etc.).
If you’re using your own custom function , make sure its first argument matches what
apply()
or similar functions will provide. You can always add other named arguments afterward.
lapply()
functionlapply()
is a function in R used to apply another function to each element of a list. It returns a list of the same length, with the results of applying the function to each element.
Basic Syntax:
Example:
Calculate the means of list elements. Here we use runif()
to produce 10 random values that follow a uniform distribution within a defined range (min
and max
arguments) and sample()
that randomly pick 10 values within a vector (1:10
).
# Create a reproducible random list
set.seed(123) # set the seed from where to start the random operation
plants <- list(
height = runif(10, min = 10, max = 20),
mass = runif(10, min = 5, max = 10),
flowers = sample(1:10, 10)
)
# Display the list
plants
## $height
## [1] 12.87578 17.88305 14.08977 18.83017 19.40467 10.45556 15.28105 18.92419
## [9] 15.51435 14.56615
##
## $mass
## [1] 9.784167 7.266671 8.387853 7.863167 5.514623 9.499125 6.230439 5.210298
## [9] 6.639604 9.772518
##
## $flowers
## [1] 9 10 1 5 3 2 6 7 8 4
## $height
## [1] 15.78248
##
## $mass
## [1] 7.616846
##
## $flowers
## [1] 5.5
As you can see,
lapply()
returns a list where each value is the mean of a corresponding element in the original list.
Your turn: What is the variable type (the class
) of elements of the list ?
Create a list called my_list containing:
A vector containing integer values from 1 to 10
A vector containing the letters A, B and C
A vector containing the boolean values TRUE, FALSE, TRUE, FALSE
# Create the list with proposed elements
my_list <- list(
numbers = 1:10, # numeric vector
letters = c("A", "B", "C"), # character vector
flags = c(TRUE, FALSE, TRUE, FALSE) # logical vector
)
# Use lapply to find the length of each element
element_classes <- lapply(X=my_list, FUN=class)
# Display the result
element_classes
## $numbers
## [1] "integer"
##
## $letters
## [1] "character"
##
## $flags
## [1] "logical"
sapply()
functionsapply()
is a simplified version of lapply()
. Like lapply()
, it applies a function to each element of a list (or list-like object). But instead of always returning a list, sapply()
tries to return the simplest possible result: a vector, matrix, or list—depending on what makes the most sense.
Basic Syntax:
Example:
Calculate the means of list elements. Here we use runif()
to produce 10 random values that follow a uniform distribution within a defined range (min
and max
arguments) and sample()
that randomly pick 10 values within a vector (1:10
). If you have already built the list from the above section, you can see that computing the code agoin will return the same object. set.seed()
make the generation of random element reproducible.
# Create a reproducible random list
set.seed(123) # set the seed from where to start the random operation
plants <- list(
height = runif(10, min = 10, max = 20),
mass = runif(10, min = 5, max = 10),
flowers = sample(1:10, 10)
)
# Use lapply to calculate the mean of each list element
sapply(plants, mean)
## height mass flowers
## 15.782475 7.616846 5.500000
Compared to lapply()
, this is easier to work with because it returns a named numeric vector instead of a list.
Note: It works with data.frame
!
sapply()
can also be used on data frames, since data frames are essentially lists of columns.
## indiv species height_0 height_10 height_20
## "character" "character" "numeric" "numeric" "numeric"
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## indiv species height_0 height_10 height_20
## NA NA NA NA 21.8
You can observe a warning message, it is not an error so the code runs. It tells you that for 2 elements, it can not compute the mean as elements are not numeric or logical. Indeed, TRUE
and FALSE
are encoded by 1 and 0 respectively.
Your turn: How many plants reach at least 15 cm at each time point ?
Tips: You may need to build a function, see section How to build a function?.
# build the function
at_least_15 <- function(x){
res <- sum(x>=15)
return(res)
}
# We need to select for numeric columns
nb_15 <- sapply(X=plant_height, FUN=at_least_15)
# Display the result
nb_15
## indiv species height_0 height_10 height_20
## 10 10 NA NA 10
In R, when you compare character values (like
"A"
or"class1"
) to a number using>=
, R doesn’t give an error. Instead, it quietly converts the number into a character and compares the two as text, not as numbers. For example,"A" >= 15
is actually treated as"A" >= "15"
, which compares the two strings alphabetically. Since"A"
comes after"1"
in alphabetical order, the result isTRUE
. So when you apply a function likesum(x >= 15)
to a character column, R may return unexpected values — like10
if all comparisons returnTRUE
. That’s why it’s important to check that you’re only applying numeric operations to numeric columns (more details).
tapply()
functiontapply()
is a function used to apply a function to subsets of a vector, based on one or more grouping variables. It’s especially useful when you want to compute summary statistics (like the mean, sum, or count) within groups, such as calculating the average height per time point or species.
Basic Syntax:
tapply(X, # the numeric vector you want to analyze
INDEX, # a factor or vector that defines the groups
FUN. # the function to apply to each group
)
Example:
Compute the average height of plants at 0 min by species.
# The data frame
plant_height <- data.frame(
indiv = LETTERS[1:10],
species = rep(c("class1", "class2"), each = 5),
height_0 = c(15, 10, 12, 9, 17, 13, 10, 11, 15, 13),
height_10 = c(20, 15, 14, 15, 19, 22, 18, 21, 24, 20),
height_20 = c(23, 24, 18, 17, 26, 23, 19, 23, 24, 21)
)
# Average height at t = 0 by species
tapply(X = plant_height$height_0,
INDEX = plant_height$species,
FUN = mean)
## class1 class2
## 12.6 12.4
Note:
The tapply()
function in R is designed to apply a function to a single vector, grouped by one or more categorical variables (INDEX
). It only works with one variable at a time.
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## class1 class2
## NA NA
Here, plant_height[, 3:5]
is a data frame, not a vector. tapply()
expects X
to be a single vector with the same length as the grouping variable.
How could you do?
Looking at the other apply
family functions, we have seen the sapply()
function that apply functions on columns of data frames.
sapply()
and tapply()
You can combine sapply()
and tapply()
to apply a function (like mean()
) across multiple numeric columns, grouped by a categorical variable.
# build a function
myFun <- function(col,myData) {
tapply(col, myData$species, mean)
}
sapply(plant_height,myFun, myData=plant_height)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## indiv species height_0 height_10 height_20
## class1 NA NA 12.6 16.6 21.6
## class2 NA NA 12.4 21.0 22.0
In our case, we built a function that takes two arguments, col
and myData
(More details). The result is a matrix
keeping the columns from the original data and a line by group in the species
column.
Let’s reshape using pivot_longer()
our plant height data into long format, where each row represents one observation (one individual at one time point), and then use tapply()
to calculate the average height at each time step.
# load library
library(tidyr)
# from wide to long format
plant_height_long <- plant_height %>% pivot_longer(where(is.numeric),
names_to="time",
values_to="height")
Now you can easily use tapply()
with the new generated columns to compute the height average in function of the time.
## height_0 height_10 height_20
## 12.5 18.8 21.8
group_by()
and summarise()
from dplyr packagegroup_by()
is used to group data based on one or more variables (columns). This function is often used in conjunction with other tidyverse
functions.
One function that works perfectly with group_by()
is summarise()
.
We can, for example, know the age mean of all patients. Indeed, summarise()
can take in account basic functions like mean()
, median()
, max()
…
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# The data frame
plant_height <- data.frame(
indiv = LETTERS[1:10],
species = rep(c("class1", "class2"), each = 5),
height_0 = c(15, 10, 12, 9, 17, 13, 10, 11, 15, 13),
height_10 = c(20, 15, 14, 15, 19, 22, 18, 21, 24, 20),
height_20 = c(23, 24, 18, 17, 26, 23, 19, 23, 24, 21)
)
# from wide to long format
plant_height_long <- plant_height %>% pivot_longer(where(is.numeric),
names_to="time",
values_to="height")
# summarize plant height
plant_height_long %>%
summarise(mean_height = mean(height))
## # A tibble: 1 × 1
## mean_height
## <dbl>
## 1 17.7
But combined with group_by()
, we can be more precise and obtain the average height
by time
.
To carry out this operation, we put two functions in a row, always using the %>%
symbol. The group_by()
function first groups the variable we’re interested in, in this case time
. There are only three possibilities for this variable, height_0
, height_10
or height_20
Secondly, the summarise()
function takes into account the time
variable, calculating the average for each possibility of the time
variable.
## # A tibble: 3 × 2
## time mean_height
## <chr> <dbl>
## 1 height_0 12.5
## 2 height_10 18.8
## 3 height_20 21.8