1 Variable types

1.1 primitive data type

Type Name / Value R Function to Check Description
Numeric is.numeric() Numbers with decimals (e.g., 3.14)
Integer is.integer() Whole numbers (e.g., 5L)
Character is.character() Text or string data (single or double quoted)
Logical is.logical() Boolean values: TRUE or FALSE
NA is.na() Missing value; used to represent absence of data
NULL is.null() Empty object; represents “no object”/undefined
NaN is.nan() “Not a Number”; result of undefined mathematical operations
Inf / -Inf is.infinite() Positive or negative infinity (e.g., 1/0 or -1/0)

Missing and undifined values:

NA is used for missing data and is type-aware (e.g., NA_integer_, NA_character_).

NULL is different from NA: it represents no value at all, often used in lists or empty objects.

NaN is a specific type of NA (i.e., is.na(NaN) is TRUE).

Inf arises in operations like 1/0; it’s a valid numeric value.

1.2 Character ordering

When comparing or sorting character strings in R, the comparison is done lexicographically, meaning alphabetically based on Unicode (ASCII for common characters). This ordering can sometimes lead to surprising results, especially when comparing letters, numbers, and accented characters.

Here’s a simplified table showing the position of some common characters based on their Unicode code points:

Character Type Unicode Code Explanation
0 Digit 48 Digits come before letters
1 Digit 49
9 Digit 57
A Uppercase letter 65 Uppercase come before lowercase letters
B Uppercase letter 66
Z Uppercase letter 90
a Lowercase letter 97 Lowercase letters come after uppercase ones
b Lowercase letter 98
z Lowercase letter 122

Example:

# Create a vector (see section bellow)
chars <- c("a", "Z", 1, 9, 10, "A", "z")
sort(chars)
## [1] "1"  "10" "9"  "a"  "A"  "z"  "Z"

If you compare characters: Digits > uppercase letters > lowercase letters

You can inspect the Unicode code points of characters with utf8ToInt():

utf8ToInt("A")   
## [1] 65
utf8ToInt("a")   
## [1] 97
utf8ToInt("10")  # (only the first element is taken into account)
## [1] 49 48
  • R compares characters based on their Unicode codes, not their numeric meaning.
  • Be careful when comparing letters to numbers, or strings containing digits.
  • Always check types and use sort(), utf8ToInt() and class() to explore behavior.

1.3 Homogeneous data structures (Combine same type elements)

Structure R Function to Check Description
Vector is.vector() A one-dimensional array of elements of the same basic type
Factor is.factor() A special type of vector used to represent categorical data
Matrix is.matrix() A 2D structure with rows and columns, all elements must be of same type
Array is.array() A multi-dimensional (n ≥ 1) generalization of a matrix with same type data
Examples

vector:

# build vectors
x <- c(3,2,0,5)
y <- c("85",8,9)
a <- seq(1,6,1)
rep(y,3) # all elements were changed to character
## [1] "85" "8"  "9"  "85" "8"  "9"  "85" "8"  "9"
c <- 1:6
# Extract elements
x[2:3]
## [1] 2 0
# Manipulate
a^2
## [1]  1  4  9 16 25 36
a + c # add values of elements at the same index
## [1]  2  4  6  8 10 12

factor:

# limited number of different values, encoded in integer
data <- factor(c(3,2,0,5,3,2,0,5,3))
as.integer(data)
## [1] 3 2 1 4 3 2 1 4 3
data==5
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
data=="5"
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
# sum(data) # Brings an error: 'sum' not meaningful for factors
levels(data) <- c("bleu","jaune","rouge","gris")

matrix and array:

#In a general manner, to access the data:
x2[indexes.dim1,indexes.dim2]
x3[indexes.dim1,indexes.dim2,indexes.dim3]
xN[indexes.dim1,indexes.dim2,indexes.dim3,...,indexes.DimN]
# create a matrix
x2 <- matrix(1:12,nrow=4,ncol=3,byrow=TRUE)
# check dimensions
dim(x2) 
## [1] 4 3
row.names(x2)
## NULL
# Extract elements
x2[3,2]
## [1] 8
x2[,2]
## [1]  2  5  8 11
# Manipulate
x2 + x2
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    8   10   12
## [3,]   14   16   18
## [4,]   20   22   24
# the vector contains 3 elements and will be added by element by row, it will be recycled
x2 + c(3,2,5) 
##      [,1] [,2] [,3]
## [1,]    4    4    8
## [2,]    6   10    9
## [3,]   12   11   11
## [4,]   13   13   17
# x2[,1] <- c(3,2,5) # Error: ! number of items to replace is not a multiple of replacement length

# create a 3 dimensional array
x3 <- array(1:12,dim=c(2,2,3))
# Extract elements
x3[1,2,2]
## [1] 7
x3[1,2,] 
## [1]  3  7 11

1.4 Heterogeneous data structures (Can combine different type elements)

Structure R Function to Check Description
Data Frame is.data.frame() Like a table; each column is a vector of the same length, but types can vary
List is.list() A generic container that can hold elements of different types and sizes
Examples

data.frame:

# collection of vectors and/or factors constrained by column
# create a data.frame
firstNames <- c("Remy", "Lol", "Pierre", "Domi", "Ben")
IMC <- data.frame(sex=c("H", "F", "H", "F", "H"),
                  height=c(1.83,1.76,1.82,1.60,1.90),
                  weight=c(67,58,66,48,75),
                  row.names=firstNames)

# check dimensions
dim(IMC)
## [1] 5 3
colnames(IMC)
## [1] "sex"    "height" "weight"
# Extract elements
IMC$height
## [1] 1.83 1.76 1.82 1.60 1.90
IMC[,"sex"]
## [1] "H" "F" "H" "F" "H"
IMC["Remy",]
##      sex height weight
## Remy   H   1.83     67
IMC[3,2]
## [1] 1.82

list:

# very flexible, store everything
# create a list
list.ex <- list(one_vec=1:12,
                one_name="Boo",
                one_tab=matrix(1:4,nrow=2))

# Extract elements
list.ex$one_tab
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
list.ex[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
# add new elements
list.ex$new <- list(a="a",b="b")
list.ex$new$a
## [1] "a"

1.5 Naming a variable

Choosing a variable name is important. It is recommended to make it explicit, short, and unique.

It’s best to maintain a consistent naming style throughout the script: variableName, variable_name, or variable.name (note that in some other languages, the dot is not allowed as it is used to call functions).

Rule 1: A variable name cannot start with a number.

Rule 2: It cannot contain special characters such as: & " ' / \ @ $ () [] {}, any mathematical operators, or punctuation marks.

Rule 3: Variable names are case-sensitive: name ≠ Name.

This is applicable for abjects and functions

2 Data frame formats

When working with data, you’ll often hear about “long” and “wide” data formats. These are two common ways to organize tabular data, and understanding the difference is important for analysis and visualization.

2.1 Wide format

Example:

indiv species height_0 height_10 height_20
A class1 15 20 23
B class1 10 15 24

Here, each time point (height_0, height_10, height_20) has its own column.

This format is often easier to read for humans and is common in spreadsheets.

2.2 Long format

In long format, each row is one measurement, and repeated variables (like time) are stored in a single column, with another column to indicate the context (e.g., time).

Same data in long format:

indiv species time height
A class1 height_0 15
A class1 height_10 20
A class1 height_20 23
B class1 height_0 10
B class1 height_10 15
B class1 height_20 24

This format is especially useful for plotting and for many R functions that work better with tidy, long-form data.

2.3 Summary

Format Description Use Case
Wide One row per observation, variables in columns Easy to read, spreadsheet-style
Long One row per measurement, with key-value columns Ideal for plots, grouped analysis

2.4 The tidyverse solution

You can use the pivot_longer() and pivot_wider() functions (from the tidyr package) to switch between formats.

pivot_longer() function enables to convert multiple columns into key-value pairs, where column names become variable names, and their corresponding values are stacked. pivot_wider() function allows to spread key-value pairs across multiple columns. We say that pivot_longer() reshapes wide format to long format and pivot_wider() do the invert.

Basic syntax

# load the library
library(tidyr)
# Convert from wide to long format
long_tab <- wide_tab %>% 
  pivot_longer(
    cols,        # The columns to gather into key-value pairs (e.g., height_0, height_10)
    names_to,    # The name of the new column that will store the former column names (e.g., "time")
    values_to    # The name of the new column that will store the values (e.g., "height")
  )

# Convert from long back to wide format
wide_tab <- long_tab %>% 
  pivot_wider(
    names_from,   # The column whose values will become new column names (e.g., "time")
    values_from   # The column whose values will fill the new wide-format table (e.g., "height")
  )

3 How to install packages?

3.1 What is a package?

In R, packages are collections of functions, data sets, and documentation that extend the functionality of base R. They are like add-ons or plugins that provide additional tools for performing specific tasks (data analysis, visualization, machine learning…).

Think of them as libraries or modules that you can install to get extra tools for your work. Some packages come pre-installed with R, while others can be installed from repositories.

3.2 What is a repository?

In the context of R, a repository is a centralized location where software, code, data, or packages are stored, managed, and distributed.

A repository is essentially a storage space (either local or online) where software packages, source code, or project files are organized, tracked, and shared. It helps developers and users access the software they need, often with version control and documentation included. It exists several repositories.

  • CRAN (Comprehensive R Archive Network)

CRAN is the primary repository for R packages. It is a collection of thousands of R packages that are contributed by developers around the world. Packages stored on CRAN go through a review process to ensure quality and compatibility. CRAN allows users to install packages directly in R either via the “Packages” panel or directly using a command line.

install.packages("package_name")

After installing a package, you need to activate it.
Each time you close R/RStudio, any installed packages will be preserved, but will not be active when you switch R/RStudio back on. You need to activate it either by checking the box next to the package name (see image above) or by using a simple command line.

library(package_name)

Note that for the library function, you must enter the package name without quotation marks.


  • Bioconductor

Bioconductor is a repository specifically for bioinformatics and computational biology packages. It focuses on tools for analyzing omics data.

To use and fetch packages from this directory, you need to download… a package! And this package, named “BiocManager”, installs normally because it’s available in CRAN.

In short, you need to download a package available in CRAN to access other packages that may be available in a directory other than CRAN.

In some cases, if you want to perform certain types of analysis and you can’t find the package you need in CRAN, there’s a good chance you’ll find it in Bioconductor (if it involves biology-related methods or analyses).

install.packages("BiocManager")

After installing the “BiocManager” package, you can download the packages available in the directory.

BiocManager::install("package1")

BiocManager::install(c("package1","package2","package3"))

Don’t forget to activate the installed package(s) with library().

Please check the Bioconductor website for more information about installation and the list of available packages.

Link : Bioconductor


  • GitHub

GitHub is a platform for hosting code repositories, and it is widely used by R developers to share and collaborate on R packages and projects.

Developers can use GitHub to publish the latest versions of their R packages, even before they are available on CRAN. It also allows for collaboration, version control, and contribution from the open-source community.

Like Bioconductor, you’ll need to install a package to download packages from GitHub.

install.packages("devtools")

Once devtools is installed, you can download packages from the GitHub community.

devtools::install_github("name_of_github_repository/package_name")

Note that here we don’t use install_packages() but install_github() to fetch the package we’re interested in.

Don’t forget to activate the installed package(s) with library().

4 How to save and load R objects?

saveRDS() and readRDS() are functions in R used to save and load single R objects. They are often used when you want to store an object (like a data frame, list, or function) to a file and retrieve it later — even under a different name.

saveRDS() writes a single R object to a file.

readRDS() reads that object back into R.

They use the .rds file format and they work with one unnamed object at a time.

Basic Syntax

saveRDS(object, file = "file.rds", compress = TRUE)
readRDS("file.rds")

Key Arguments:

object: The R object you want to save.

file: File path or connection to write the object.

compress: Whether to compress the file (TRUE, “gzip”, “bzip2”, etc.).

ascii: Save in ASCII format (mainly for readability or portability).

refhook: Advanced; handles reference objects.

Notes

saveRDS() is ideal when you want to save one object at a time.

Use readRDS() when you want to load the object into any variable name.

.rds files are not designed for sharing across systems or languages — use .csv or .rds with caution if interoperability is needed.

Other functions are classically used: save() and load()

save(object, tab, otherthing, file = "file.RData")
load("file.RData") 
ls()
[1] "object"  "tab" "otherthing" # all saved objects are loaded with names they wer given
Function Saves Multiple Objects Retains Object Names File Type
save() Yes Yes .RData
saveRDS() No No .rds

5 How to find help?

5.1 In RStudio

  1. ?: Search R documentation for a specific term.

(You can also do this with the help() function.)

To find more about it from R’s documentation, simply search for the term with a single question mark placed ahead of it.

# the question mark before the function name
?mean
# help(function_name) function
help(mean)
  1. Through the interface

5.2 On line

  1. Browsers

You can use your favorite web browser. Start any prompt with “R” and request in English. You will often find forums (Stack Overflow, Biostars), blogs (Data Geek, bioinfo-fr) and browsers (GeeksForGeeks, RSeek - dedicated to R) that are plenty of “already asked questions”. Keywords are important and experience will help you to optimize your search.

  1. Large Language Model

You can use your favorite Large Language Model (LLM) to help with programming. These tools are very good at basic coding tasks like writing simple functions, explaining code, or helping you debug.

CAUTION: LLMs are not perfect. The way you ask your question can strongly influence the answer you get. Also, LLMs are designed to always give you a response — even when they’re unsure — so they won’t say “I don’t know” or “That doesn’t make sense.”

That means it’s your job to read the answer critically and test the code yourself. For complex or very specific questions, the LLM might make mistakes or give you code that looks correct but doesn’t actually work.

Think of LLMs as helpful assistants - not experts. They’re great for learning, but you should always verify their answers and ask for help when something doesn’t seem right.

5.3 Reused Functions Across Packages

In R, many functions are implemented in one package and then reused or re-exported in other packages. This allows developers to build on existing tools without rewriting the same code. However, it also means that:

  • The same function name can appear in multiple packages.
  • The help page you see may depend on which package is currently loaded.
  • Some help pages are more detailed than others, even if the function is essentially the same.

Example: resize() in IRanges vs. GenomicRanges packages

The function resize() is defined in the IRanges package, but it is also used in GenomicRanges. Both packages make use of it, but the help pages differ in level of detail or examples.

To check where the function is coming from:

find("resize")

To access documentation from a specific package:

?IRanges::resize
?GenomicRanges::resize

Why This Matters

  • You may get different documentation depending on your loaded packages.
  • Understanding which package defines or reuses a function helps avoid confusion.
  • This is especially important when working in complex environments (e.g., bioinformatics, tidyverse).

Tip: You can also use getAnywhere(resize) to explore all versions of a function if it’s defined in multiple places.

5.4 Common Function Names and Conflicts Between Packages

Some function names in R are very common, like select() or merge(). These names are used in multiple packages, and sometimes they have different behavior depending on which package they come from.

Example: merge()

  • merge() exists in base R, and is used to combine data frames by common columns or row names.
  • merge() also exists in data.table, with enhanced performance and slightly different syntax.
  • Other packages may define their own merge() methods (especially for S4 or S3 classes).

When you load data.table, its version of merge() overrides the one from base (or masks similar names from other packages).

You can check which version is currently active with:

merge
# or
find("merge")

And you can explicitly call the version you want like this:

?base::merge
?data.table::merge

Summary

Function Name Common Packages How to Disambiguate
select() dplyr, MASS, others Use dplyr::select() explicitly
merge() base, data.table Use base::merge() or data.table::merge()
filter() dplyr, stats Use dplyr::filter() or stats::filter()

Tip: Use conflicts() to see which functions are masked (overridden) when you load packages:

conflicts()

This will help you understand why a function might behave differently than expected.

6 How to build a function?

Functions are essential components of the R programming language. They allow you to encapsulate code into reusable blocks, making your scripts more modular, readable, and easier to maintain.

Defining Functions:

A function in R has three main parts: the function name, arguments, and the function body.

Function name: A label that identifies the function and is used to call it. see rules

Arguments: The inputs passed to the function, defined within parentheses.

Function body: The block of code that executes when the function is called, enclosed in curly braces {}.

Basic Syntax:

myFunction <- function(arg) {
  # Function body
  result <- max(arg) - min(arg)
  return(result)
}

Example:

Here’s a simple example of how to define and use a function in R that adds two numbers:

# Define the function
add_numbers <- function(num1, num2) {
  result <- num1 + num2
  return(result)
}

# Call the function
sum_result <- add_numbers(num1 = 5, num2 = 10)

# Display the result
sum_result
## [1] 15

Your turn: Define a function that allows to return the average of two values and display a text: “The result of this treament is 49”.

Tips: Look at the help of the function paste() and its arguments. To display a text from within a function, you can use the function print() or message().

Solution
# Define the function
av_numbers <- function(val1, val2){
  res <- (val1 + val2)/2
  print(paste("The result of this treament is",res))
  message("The result of this treament is ",res)
  return(res)
}
# Call the function
mean_result <- av_numbers(val1 = 10, val2 = 20)
## [1] "The result of this treament is 15"
## The result of this treament is 15
# Display the result
mean_result
## [1] 15

The argument x takes each line (MARGIN=1) of plant_height[,-c(1,2)] and computes the difference between the maximum and minimum values.

To go further you can check this blog.

7 How to compare elements?

In data manipulation, comparisons help you filter and select specific data based on conditions. Some common comparison operators include: == (equal to), != (not equal to), > (greater than), < (less than), >= (greater than or equal to), and <= (less than or equal to). Additionally, %in% is useful for checking if a value belongs to a set of values. These comparisons allow you to create logical statements that you can use to filter the elements that meet your criteria.

Here is a table summarizing the most common operators:

Operator Description Example
== Equal to x == "yes" , x == 6
!= Not equal to x != "no" , x != 5
> , < Greater than, Less than x > 5 , x < 5
>= , <= Greater/Less than or equal to x >= 5 , x <= 5
%in% Checks if value is in a set x %in% c("A", "C")
& logical AND x < 6 & x > 3
| logical OR x > 6 | x < 3
! logical NO !x %in% c("A", "C")

8 The apply family

In this section, we will learn how to use the apply family of functions in R. These functions help you perform repetitive operations (e.g., sum(), mean()) on rows, columns, or elements of data structures like vectors, matrices, or data frames.

We will start with simple built-in functions like colSums() and gradually explore more flexible tools like apply, lapply, sapply, and tapply. Finally, we’ll compare these base R tools with functions from the tidyverse, such as group_by() and summarise().

First, we’ll create an example data set named plant_height. This data set describes the heights of ten individuals in centimeters at three different time points (0, 10, and 20 days). The first column contains the IDs for each individual, the second its species, and each successive column describes their heights at time points 0, 10, and 20 in that order.

plant_height <- data.frame(indiv = LETTERS[1:10],
                      species = rep(c("class1","class2"),each=5),
                      height_0 = c(15, 10, 12, 9, 17, 13, 10, 11, 15, 13),
                      height_10 = c(20, 15, 14, 15, 19, 22, 18, 21, 24, 20),
                      height_20 = c(23, 24, 18, 17, 26, 23, 19, 23, 24, 21))

8.1 Functions from base

Imagine you want to know the average height at each time point. How would you do?

  • You could compute the average value of each column sequentially using mean()
average_0 <- mean(plant_height$height_0)
average_10 <- mean(plant_height$height_10)
average_20 <- mean(plant_height$height_20)

This is quite OK if you have three columns but we can imagine how fastidious it will be with 100 features.

  • R base package comes with few simple built-in functions: colMeans(), rowMeans(),colSums() and rowSums() that allow to compute mean and sum of values for all rows or columns at once.
# We need to select for numeric columns
average_tp <- colMeans(plant_height[,-c(1:2)])
average_tp
##  height_0 height_10 height_20 
##      12.5      18.8      21.8

Now your turn, compute the average height of each individual.

Solution
# We need to select for numeric columns
average_ind <- rowMeans(plant_height[,-c(1:2)])
average_ind
##  [1] 19.33333 16.33333 14.66667 13.66667 20.66667 19.33333 15.66667 18.33333
##  [9] 21.00000 18.00000

These functions are fast and convenient, but they work only in specific situations. What if we want to apply a different function, or calculate values by group?

8.2 The apply() function

The apply command allows to apply any function across an array, matrix or data frame.

Basic Syntax:

apply(X,       # Array, matrix or data frame
      MARGIN,  # 1: rows, 2: columns, c(1, 2): rows and columns
      FUN,     # Function to be applied
      ...)     # Additional arguments to FUN

Example:

Compute the maximum height reached at each time point.

# We need to select for numeric columns
max_height_tp <- apply(X=plant_height[,-c(1,2)], MARGIN=2, FUN=max)

# Display the result
max_height_tp
##  height_0 height_10 height_20 
##        17        24        26

Your turn: How much plants have grown between 0 and 20h ?

Tips: You may need to build a function, see section How to build a function?.

Solution
# build the function
height_amplitude <- function(x){
  growth <- max(x) - min(x)
  return(growth)
}

# We need to select for numeric columns
growth_height <- apply(X=plant_height[,-c(1,2)], MARGIN=1, FUN=height_amplitude)

# Display the result
growth_height
##  [1]  8 14  6  8  9 10  9 12  9  8

The argument x takes each line (MARGIN=1) of plant_height[,-c(1,2)] and computes the difference between the maximum and minimum values.

Important: What if the function needs additional arguments?

What if the function you want to apply requires more than one argument ?

For example, the mean() function has multiple arguments:
- x: the data to calculate the mean (this is what apply() will pass)
- trim: to remove a fraction of extreme values
- na.rm: to ignore NA values when computing the mean

You can check the full list using ?mean.

Let’s modify our data to include some NA values:

# Add NA values to simulate missing data
plant_height <- data.frame(
  indiv = LETTERS[1:10],
  species = rep(c("class1", "class2"), each = 5),
  height_0 = c(15, NA, 12, 9, 17, NA, 10, 11, 15, 13),
  height_10 = c(20, 15, 14, NA, 19, 22, 18, 21, 24, 20),
  height_20 = c(23, 24, 18, 17, 26, 23, 19, 23, 24, 21)
)

Now, if we want to calculate the mean for each time point while ignoring the NAs, we need to pass the na.rm = TRUE argument to the mean() function. Here’s how to do that with apply():

# Apply mean to each column, skipping NA values
mean_height_tp <- apply(
  X = plant_height[, -c(1, 2)],  # select only numeric columns
  MARGIN = 2,                    # apply function to each column
  FUN = mean,                    # function to apply
  na.rm = TRUE                   # extra argument passed to mean()
)

The first argument of the function you’re applying (e.g., x in mean(x, ...)) receives the content of X in apply(). Any additional arguments must be named and explicitly provided after the FUN parameter.

This rule applies to all functions in the apply family (apply(), lapply(), sapply(), tapply(), etc.).

If you’re using your own custom function , make sure its first argument matches what apply() or similar functions will provide. You can always add other named arguments afterward.

8.3 The lapply() function

lapply() is a function in R used to apply another function to each element of a list. It returns a list of the same length, with the results of applying the function to each element.

Basic Syntax:

lapply(X,   # a list or a vector
       FUN  # the function you want to apply to each element
       )

Example:

Calculate the means of list elements. Here we use runif() to produce 10 random values that follow a uniform distribution within a defined range (min and max arguments) and sample() that randomly pick 10 values within a vector (1:10).

# Create a reproducible random list
set.seed(123) # set the seed from where to start the random operation
plants <- list(
  height = runif(10, min = 10, max = 20), 
  mass = runif(10, min = 5, max = 10),
  flowers = sample(1:10, 10)
)

# Display the list
plants
## $height
##  [1] 12.87578 17.88305 14.08977 18.83017 19.40467 10.45556 15.28105 18.92419
##  [9] 15.51435 14.56615
## 
## $mass
##  [1] 9.784167 7.266671 8.387853 7.863167 5.514623 9.499125 6.230439 5.210298
##  [9] 6.639604 9.772518
## 
## $flowers
##  [1]  9 10  1  5  3  2  6  7  8  4
# Use lapply to calculate the mean of each list element
lapply(plants, mean)
## $height
## [1] 15.78248
## 
## $mass
## [1] 7.616846
## 
## $flowers
## [1] 5.5

As you can see, lapply() returns a list where each value is the mean of a corresponding element in the original list.

Your turn: What is the variable type (the class) of elements of the list ?

Create a list called my_list containing:

  • A vector containing integer values from 1 to 10

  • A vector containing the letters A, B and C

  • A vector containing the boolean values TRUE, FALSE, TRUE, FALSE

Solution
# Create the list with proposed elements
my_list <- list(
  numbers = 1:10, # numeric vector
  letters = c("A", "B", "C"), # character vector
  flags = c(TRUE, FALSE, TRUE, FALSE) # logical vector
)

# Use lapply to find the length of each element
element_classes <- lapply(X=my_list, FUN=class)

# Display the result
element_classes
## $numbers
## [1] "integer"
## 
## $letters
## [1] "character"
## 
## $flags
## [1] "logical"

8.4 The sapply() function

sapply() is a simplified version of lapply(). Like lapply(), it applies a function to each element of a list (or list-like object). But instead of always returning a list, sapply() tries to return the simplest possible result: a vector, matrix, or list—depending on what makes the most sense.

Basic Syntax:

sapply(X,   # a list, a data frame or a vector
       FUN  # the function you want to apply to each element
       )

Example:

Calculate the means of list elements. Here we use runif() to produce 10 random values that follow a uniform distribution within a defined range (min and max arguments) and sample() that randomly pick 10 values within a vector (1:10). If you have already built the list from the above section, you can see that computing the code agoin will return the same object. set.seed() make the generation of random element reproducible.

# Create a reproducible random list
set.seed(123) # set the seed from where to start the random operation
plants <- list(
  height = runif(10, min = 10, max = 20), 
  mass = runif(10, min = 5, max = 10),
  flowers = sample(1:10, 10)
)

# Use lapply to calculate the mean of each list element
sapply(plants, mean)
##    height      mass   flowers 
## 15.782475  7.616846  5.500000

Compared to lapply(), this is easier to work with because it returns a named numeric vector instead of a list.

Note: It works with data.frame!

sapply() can also be used on data frames, since data frames are essentially lists of columns.

sapply(plant_height,class)
##       indiv     species    height_0   height_10   height_20 
## "character" "character"   "numeric"   "numeric"   "numeric"
sapply(plant_height,mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
##     indiv   species  height_0 height_10 height_20 
##        NA        NA        NA        NA      21.8

You can observe a warning message, it is not an error so the code runs. It tells you that for 2 elements, it can not compute the mean as elements are not numeric or logical. Indeed, TRUE and FALSE are encoded by 1 and 0 respectively.

Your turn: How many plants reach at least 15 cm at each time point ?

Tips: You may need to build a function, see section How to build a function?.

Solution
# build the function
at_least_15 <- function(x){
  res <- sum(x>=15)
  return(res)
}

# We need to select for numeric columns
nb_15 <- sapply(X=plant_height, FUN=at_least_15)

# Display the result
nb_15
##     indiv   species  height_0 height_10 height_20 
##        10        10        NA        NA        10

In R, when you compare character values (like "A" or "class1") to a number using >=, R doesn’t give an error. Instead, it quietly converts the number into a character and compares the two as text, not as numbers. For example, "A" >= 15 is actually treated as "A" >= "15", which compares the two strings alphabetically. Since "A" comes after "1" in alphabetical order, the result is TRUE. So when you apply a function like sum(x >= 15) to a character column, R may return unexpected values — like 10 if all comparisons return TRUE. That’s why it’s important to check that you’re only applying numeric operations to numeric columns (more details).

8.5 The tapply() function

tapply() is a function used to apply a function to subsets of a vector, based on one or more grouping variables. It’s especially useful when you want to compute summary statistics (like the mean, sum, or count) within groups, such as calculating the average height per time point or species.

Basic Syntax:

tapply(X,     # the numeric vector you want to analyze
       INDEX, # a factor or vector that defines the groups
       FUN.   # the function to apply to each group
       )

Example:

Compute the average height of plants at 0 min by species.

# The data frame
plant_height <- data.frame(
  indiv = LETTERS[1:10],
  species = rep(c("class1", "class2"), each = 5),
  height_0 = c(15, 10, 12, 9, 17, 13, 10, 11, 15, 13),
  height_10 = c(20, 15, 14, 15, 19, 22, 18, 21, 24, 20),
  height_20 = c(23, 24, 18, 17, 26, 23, 19, 23, 24, 21)
)

# Average height at t = 0 by species
tapply(X = plant_height$height_0, 
       INDEX = plant_height$species, 
       FUN = mean)
## class1 class2 
##   12.6   12.4

Note:

The tapply() function in R is designed to apply a function to a single vector, grouped by one or more categorical variables (INDEX). It only works with one variable at a time.

tapply(plant_height[, 3:5], plant_height$species, mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## class1 class2 
##     NA     NA

Here, plant_height[, 3:5] is a data frame, not a vector. tapply() expects X to be a single vector with the same length as the grouping variable.

How could you do?

Looking at the other apply family functions, we have seen the sapply() function that apply functions on columns of data frames.

  1. Combine sapply() and tapply()

You can combine sapply() and tapply() to apply a function (like mean()) across multiple numeric columns, grouped by a categorical variable.

# build a function 

myFun <- function(col,myData) {
  tapply(col, myData$species, mean)
}

sapply(plant_height,myFun, myData=plant_height) 
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
##        indiv species height_0 height_10 height_20
## class1    NA      NA     12.6      16.6      21.6
## class2    NA      NA     12.4      21.0      22.0
# it may be cleaner to select for numerical values :)

In our case, we built a function that takes two arguments, col and myData (More details). The result is a matrix keeping the columns from the original data and a line by group in the species column.

  1. Use the long format

Let’s reshape using pivot_longer() our plant height data into long format, where each row represents one observation (one individual at one time point), and then use tapply() to calculate the average height at each time step.

# load library
library(tidyr)
# from wide to long format
plant_height_long <- plant_height %>% pivot_longer(where(is.numeric),
                                                  names_to="time",
                                                  values_to="height")

Now you can easily use tapply() with the new generated columns to compute the height average in function of the time.

tapply(X = plant_height_long$height,
       INDEX = plant_height_long$time,
       FUN = mean)
##  height_0 height_10 height_20 
##      12.5      18.8      21.8

8.6 Group and summarize the information with group_by() and summarise() from dplyr package

group_by() is used to group data based on one or more variables (columns). This function is often used in conjunction with other tidyverse functions.

?dplyr::group_by

One function that works perfectly with group_by() is summarise().

?dplyr::summarise

We can, for example, know the age mean of all patients. Indeed, summarise() can take in account basic functions like mean(), median(), max()

# load library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# The data frame
plant_height <- data.frame(
  indiv = LETTERS[1:10],
  species = rep(c("class1", "class2"), each = 5),
  height_0 = c(15, 10, 12, 9, 17, 13, 10, 11, 15, 13),
  height_10 = c(20, 15, 14, 15, 19, 22, 18, 21, 24, 20),
  height_20 = c(23, 24, 18, 17, 26, 23, 19, 23, 24, 21)
)

# from wide to long format
plant_height_long <- plant_height %>% pivot_longer(where(is.numeric),
                                                  names_to="time",
                                                  values_to="height")

# summarize plant height
plant_height_long %>%
  summarise(mean_height = mean(height))
## # A tibble: 1 × 1
##   mean_height
##         <dbl>
## 1        17.7

But combined with group_by(), we can be more precise and obtain the average height by time.

To carry out this operation, we put two functions in a row, always using the %>% symbol. The group_by() function first groups the variable we’re interested in, in this case time. There are only three possibilities for this variable, height_0, height_10 or height_20 Secondly, the summarise() function takes into account the time variable, calculating the average for each possibility of the time variable.

plant_height_long %>%
  group_by(time) %>%
  summarise(mean_height = mean(height))
## # A tibble: 3 × 2
##   time      mean_height
##   <chr>           <dbl>
## 1 height_0         12.5
## 2 height_10        18.8
## 3 height_20        21.8