*Basing yourself on a bash script, the idea of this exercise is to add another branch to your previous pipeline to consolidate what you've learnt previously.* # Where to start? ## Understand the bash script Let’s understand the bash script & identify all the steps and tools involved that we will need. Let’s split the bash script into parts: - The first part is applied once, at the beginning of the script in order to index the reference genome that will be used to map the reads: ```bash # line 30 bowtie2-build --threads ${ncpus_build} ${genome} Bwt2_index/tauri > Bwt2_index/Bwt2_index.log 2>&1 # input: reference genome fasta # tool: bowtie2-build (bowtie/1.3.1) # aim: index the reference genome # output: Bwt2_index/ folder with indexed genome files ``` - In the “for” loop section, we apply on each sample independently a set of tools: ```bash # line 43 fastqc --outdir FastQC ${rnadir}/${sample}.fastq.gz > FastQC/${sample}.log 2>&1 # input: each RNAseq fastq # tool: fastqc (fastqc/0.12.1) # aim: run QC analysis (per sample) # output: a zip & html report in FastQC/ folder ``` ```bash # line 47 bowtie2 --threads ${ncpus_map} -x Bwt2_index/tauri -U ${rnadir}/${sample}.fastq.gz -S SAMs/${sample}.sam > LOGs/${sample}_bowtie2.log 2>&1 # input: each RNAseq fastq + the indexed reference genome files # tool: bowtie2 (bowtie/1.3.1) # aim: map the reads onto the reference genome (per sample) # output: mapped reads in SAM format within the SAMs/ folder ``` ```bash # lines 51-53 samtools view -b SAMs/${sample}.sam -o BAMs/${sample}.bam samtools sort --threads ${ncpus_sort} BAMs/${sample}.bam -o BAMs/${sample}_sort.bam samtools index -@ ${ncpus_index} BAMs/${sample}_sort.bam # input: mapped reads in SAM format # tool: samtools (samtools/1.18) # aim: 1. change format sam -> bam (per sample) # 2. sort reads (per sample) # 3. index the sorted bam (per sample) # general output: sorted and indexed reads in BAM format within the BAMs/ folder ``` ```bash # line 57 featureCounts -T ${ncpus_count} -t gene -g ID -a ${annots} -s 2 -o Counts/${sample}_ftc.txt BAMs/${sample}_sort.bam > LOGs/${sample}_ftc.log 2>&1 # input: sorted & indexed reads in BAM format + reference genome annotation # tool: featureCounts (subread/subread-1.5.2) # aim: count reads per gene in reference genome (per sample) # output: a gene count table within the Counts/ folder ``` - Outside the for loop, at the end of the bash pipeline, we concatenate all the count tables & al the FastQC reports: ```bash # lines 61-62 paste Counts/*_ftc.txt > Counts/ftc_tmp.txt awk -v nb=${nbs} -v col=7 'BEGIN {FS ="\t"}{ctmp=$1; for (i=col; i<=nb*col; i=i+col) {count=sprintf("%s\t%s", ctmp, $i); ctmp=count}; print count}' Counts/ftc_tmp.txt | sed '1d' > Counts/counts.txt # input: gene count table of all samples # tools: paste & awk (standard in bash) # aim: concatenate the gene count table of all samples and reformat to a clean tab-separated file # output: a single clean tab-separated gene count table containing all samples ``` ```bash # line 65 multiqc FastQC/ --filename report_multiqc.html --outdir FastQC/ ``` - The final pipeline should look like this (without going into much detail with the individual input/output files): ![Workflow example](images/tp_workflow_extension.png) ## Try running the bash script as is If you try running the bash script, make sure you to load the modules first: ```bash module load multiqc/1.13 fastqc/0.12.1 subread/2.0.6 bowtie/1.3.1 samtools/1.18 ``` NB: subread is the package that contains featureCounts Execution example (copy the script to your directory first): ```bash bash bash_pipeline.sh -g Data/O.tauri_genome.fna -a Data/O.tauri_annotation.gff -d Data/ SRR3099585_chr18 SRR3099586_chr18 SRR3099587_chr18 SRR3105697_chr18 SRR3105698_chr18 SRR3105699_chr18 ``` ## Your mission The FastQC/MultiQC branch is already implemented in the Snakefile from the previous exercises. Your mission is to start from that Snakefile and add the bottom branch (bowtie/samtools/counts) to it.