{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "676a7a2f-6578-4b9d-a288-3c338e705f70",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Croisement des données "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c1b28ac-acda-466a-9d75-c1157907afc5",
   "metadata": {},
   "source": [
    "##### *Claire Toffano-Nioche, I2BC, Paris-Saclay*\n",
    "\n",
    "##### *Pauline Francois, Anses, Lyon*\n",
    "\n",
    "##### *Selon les ressources de Matthias Zytnicki, INRAE, Toulouse*\n",
    "\n",
    "##### *et toute la team Roscoff*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "232ac2ed-00aa-4652-9a70-9c944bf00f81",
   "metadata": {},
   "source": [
    "## Objectifs du cours\n",
    "\n",
    "- Permettre de croiser des données génomiques de types différents\n",
    "- Renforcer les notions vues précedemment\n",
    "- Se familiariser avec <code>bedtools</code>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c11fd4d0-4c89-4287-8cf4-e1af3cc6bfba",
   "metadata": {},
   "source": [
    "Ce que ce cours n'est pas :\n",
    "- Une réponse à une vraie question biologique\n",
    "- Une méthode statistiquement valide"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3797b525-90a4-4126-8afe-817b8acfe8bc",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Croisement de données\n",
    "\n",
    "#### Qu’est-ce c'est ?\n",
    "- Une comparaison des positions ou intervalles génomiques: \n",
    "    - un variant par rapport à des gènes d’intérêt, \n",
    "    - des pics de ChIP-Seq par rapport à des gènes différentiellement exprimés, etc.\n",
    "    \n",
    "#### À quel moment est-ce valide ?\n",
    "- Lorsque vous cherchez des co-occurrences\n",
    "- Lorsque vous donnez des distributions (de distance)\n",
    "\n",
    "#### À quel moment est-ce douteux ?\n",
    "- Lorsque les résultats sont présentés comme significatifs.\n",
    "\n",
    "#### Données disponibles\n",
    "- Une liste de gènes différentiellement exprimés (Cellules avec un KO du gène beta-2-microglobulin, infectées par l'herpes simplex virus type 1 vs non infectées) (TXT)\n",
    "- Une liste des sites de fixation de H3K4me3 déterminés par Chip-seq (BED)\n",
    "- Une liste de variants (VCF)\n",
    "- Un fichier d'annotation humain (GFF3)\n",
    "- La liste des chromosomes humains et leur taille (LEN)\n",
    "\n",
    "## Avant de commencer\n",
    "\n",
    "Vérifier que vous êtes dans votre espace projet, sous-dossier TP_croisement"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7a0bd48a-14bd-47d5-ade3-03ec8a10a06c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/shared/ifbstor1/projects/2336_nanopore/TP_croisement\n"
     ]
    }
   ],
   "source": [
    "pwd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bc23cff-130e-4e35-a7ae-7103ac98f04a",
   "metadata": {},
   "source": [
    "## Accès aux données du TP"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca66dbeb-82ea-4fe8-baf2-9a3bbddb62ae",
   "metadata": {},
   "source": [
    "1. Rendez vous dans le dossier <code>data/</code> et listez en le contenu"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5c7e5721-822f-4e1f-9802-dbc5244ec631",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[0m\u001b[01;32mchrs.len\u001b[0m  \u001b[01;32mDEgenes.txt\u001b[0m  \u001b[01;32mhsGRCh38.genes.gff3\u001b[0m  \u001b[01;32mpicChip.bed\u001b[0m  \u001b[01;32mvariant.vcf\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "cd data\n",
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3d3e4f9-e9ad-45e1-b981-92a55e47d1af",
   "metadata": {},
   "source": [
    "Ce TP repose sur 4 jeux de données, issues d'analyses précédentes ainsi que sur les annotations du génome humain : \n",
    "\n",
    "- la longueur des chrs (et des contigs) de la version GRCh27 du génome humain (<code>chrs.len</code>)\n",
    "- une liste de SNP variants (<code>variant.vcf</code>)\n",
    "- une liste des identifiants de gènes différentiellement exprimés (<code>DEgenes.txt</code>)\n",
    "- une liste des pics issus d'une analyse de ChIP-seq (<code>picChip.bed</code>), ici une marque d'histone h3k4me3\n",
    "- les annotations correspondant aux gènes uniquement du génome humain GRCh38 (<code>hsGRCh38.genes.gff3</code>)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c69e3d9b-8e71-4e83-a0f2-ee6da9090549",
   "metadata": {},
   "source": [
    "2. Afin de se familiariser avec le contenu des données, on affiche les 2 premières lignes de l'ensemble des fichiers :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "bf626e3c-fc81-4e24-b07c-4a92fe62e9f8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==> chrs.len <==\n",
      "1\t248956422\n",
      "10\t133797422\n",
      "\n",
      "==> DEgenes.txt <==\n",
      "ENSG00000004846\n",
      "ENSG00000005981\n",
      "\n",
      "==> hsGRCh38.genes.gff3 <==\n",
      "1\tensembl_havana\tgene\t65419\t71585\t.\t+\t.\tID=gene:ENSG00000186092;Name=OR4F5;biotype=protein_coding;description=olfactory receptor family 4 subfamily F member 5 [Source:HGNC Symbol%3BAcc:HGNC:14825];gene_id=ENSG00000186092;logic_name=ensembl_havana_gene;version=6\n",
      "1\tensembl_havana\tgene\t450703\t451697\t.\t-\t.\tID=gene:ENSG00000284733;Name=OR4F29;biotype=protein_coding;description=olfactory receptor family 4 subfamily F member 29 [Source:HGNC Symbol%3BAcc:HGNC:31275];gene_id=ENSG00000284733;logic_name=ensembl_havana_gene;version=1\n",
      "\n",
      "==> picChip.bed <==\n",
      "22\t16192349\t16192565\tregion_1\n",
      "22\t16846630\t16870710\tregion_2\n",
      "\n",
      "==> variant.vcf <==\n",
      "##fileformat=VCFv4.0\n",
      "##fileDate=20180418\n"
     ]
    }
   ],
   "source": [
    "head -n 2 *"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0d60187-8a3e-4755-bc04-4cba30bafb27",
   "metadata": {},
   "source": [
    "3. Pour le fichier variant.vcf on ne voit que des commentaires (souvent des lignes qui commencent par \"#\") ; afficher les deux dernières lignes de ce fichier :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b330a69b-f820-4176-87d5-88d9028c5617",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Y\t26614668\trs376925794\tTC\tT\t.\t.\tRS=376925794;RSPOS=26614671;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000000005000002000200;WGT=1;VC=DIV;ASP;CAF=0.9968,0.003244;COMMON=1\n",
      "Y\t26624486\trs771929773\tT\tTA\t.\t.\tRS=771929773;RSPOS=26624486;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000000005040024000200;WGT=1;VC=DIV;ASP;VLD;KGPhase3;CAF=0.9968,0.003244;COMMON=1\n"
     ]
    }
   ],
   "source": [
    "tail -n 2 variant.vcf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92acd376-3647-4e9e-89b2-00b5b4b860bf",
   "metadata": {},
   "source": [
    "4. Activer les outils bedtools (v2.30.0) et bc (1.07.1) dont nous aurons besoin plus loin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "49e89bca-a3da-4b88-9699-666dd55ce5a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "module load bedtools/2.30.0 bc/1.07.1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "107cabf7-3ed2-4fdd-8fa5-c1c24d1d43bb",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Problème\n",
    "\n",
    "##### *Question scientifique*\n",
    "\n",
    "Mes gènes différentiellement exprimés contiennent-ils plus de pics de chip qu'attendu ? (oui/non, pas de quantification)\n",
    "\n",
    "En d'autres termes, y a-t-il plus ou moins de pics de chip-seq dans les gènes différentiellement exprimés par rapport aux gènes non différentiellement exprimés ?\n",
    "\n",
    "##### *Données*\n",
    "- DEgenes.txt\n",
    "- picChip.bed\n",
    "- hsGRCh38.genes.gff3\n",
    "\n",
    "Afficher les cinq premières lignes des fichiers <code>DEgenes.txt</code>, <code>picChip.bed</code> et <code>hsGRCh38.genes.gff3</code> :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "bb52b1d0-7643-40cd-9db3-263d568ec9f9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==> DEgenes.txt <==\n",
      "ENSG00000004846\n",
      "ENSG00000005981\n",
      "ENSG00000006747\n",
      "ENSG00000015568\n",
      "ENSG00000047457\n",
      "\n",
      "==> picChip.bed <==\n",
      "22\t16192349\t16192565\tregion_1\n",
      "22\t16846630\t16870710\tregion_2\n",
      "22\t17067019\t17067283\tregion_3\n",
      "22\t17070620\t17114004\tregion_4\n",
      "22\t17128021\t17136091\tregion_5\n",
      "\n",
      "==> hsGRCh38.genes.gff3 <==\n",
      "1\tensembl_havana\tgene\t65419\t71585\t.\t+\t.\tID=gene:ENSG00000186092;Name=OR4F5;biotype=protein_coding;description=olfactory receptor family 4 subfamily F member 5 [Source:HGNC Symbol%3BAcc:HGNC:14825];gene_id=ENSG00000186092;logic_name=ensembl_havana_gene;version=6\n",
      "1\tensembl_havana\tgene\t450703\t451697\t.\t-\t.\tID=gene:ENSG00000284733;Name=OR4F29;biotype=protein_coding;description=olfactory receptor family 4 subfamily F member 29 [Source:HGNC Symbol%3BAcc:HGNC:31275];gene_id=ENSG00000284733;logic_name=ensembl_havana_gene;version=1\n",
      "1\tensembl_havana\tgene\t685679\t686673\t.\t-\t.\tID=gene:ENSG00000284662;Name=OR4F16;biotype=protein_coding;description=olfactory receptor family 4 subfamily F member 16 [Source:HGNC Symbol%3BAcc:HGNC:15079];gene_id=ENSG00000284662;logic_name=ensembl_havana_gene;version=1\n",
      "1\tensembl_havana\tgene\t923928\t944581\t.\t+\t.\tID=gene:ENSG00000187634;Name=SAMD11;biotype=protein_coding;description=sterile alpha motif domain containing 11 [Source:HGNC Symbol%3BAcc:HGNC:28706];gene_id=ENSG00000187634;logic_name=ensembl_havana_gene;version=11\n",
      "1\tensembl_havana\tgene\t944204\t959309\t.\t-\t.\tID=gene:ENSG00000188976;Name=NOC2L;biotype=protein_coding;description=NOC2 like nucleolar associated transcriptional repressor [Source:HGNC Symbol%3BAcc:HGNC:24517];gene_id=ENSG00000188976;logic_name=ensembl_havana_gene;version=10\n"
     ]
    }
   ],
   "source": [
    "head -n 5 DEgenes.txt picChip.bed hsGRCh38.genes.gff3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95e6c422-9dc5-4e3e-9239-81c0e179f8b5",
   "metadata": {},
   "source": [
    "<code>DEgenes.txt</code> est notre liste de gènes DE :\n",
    "\n",
    "![](images/liste_geneDE.png)\n",
    "\n",
    ".\n",
    "\n",
    ".\n",
    "\n",
    "<code>picChip.bed</code> est notre liste de pics issus du chip-seq et positionnés sur le génome :\n",
    "\n",
    "![](images/liste_chip.png)\n",
    "\n",
    ".\n",
    "\n",
    ".\n",
    "\n",
    "Pour croiser ces données il faut positionner les gènes DE sur le génome, on fait cela grâce au fichier d'annotation (gènes non DE plus clairs) :\n",
    "\n",
    "![](images/genes_position_genome.png)\n",
    "\n",
    ".\n",
    "\n",
    ".\n",
    "\n",
    "Le positionnement des données sur le génome permet maintenant de les croiser :\n",
    "\n",
    "![](images/associationGeneDE_chip.png)\n",
    "\n",
    ".\n",
    "\n",
    ".\n",
    "\n",
    "Le but sera de compter le nombre gène dans chacune des 4 classes possibles :\n",
    "\n",
    "![](images/geneDE_Chip.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "baac3578-e77e-42e3-9775-95771c05bb0e",
   "metadata": {
    "tags": []
   },
   "source": [
    "##### *Protocole envisagé*\n",
    "<details><summary></summary>\n",
    "\n",
    "  A. Extraire les intervalles génomiques des gènes différentiellement exprimés (DE)\n",
    "    \n",
    "  B. Comparer les intervalles génomiques des gènes DE avec les régions H3K4me3 (ie. l'emplacement des pics de chip-seq)\n",
    "  \n",
    "  C. Compter le nombre de chevauchements entre les gènes DE et les pics\n",
    "\n",
    "  D. Trouver les intervalles génomiques des gènes non-différentiellement exprimés\n",
    "\n",
    "  E. Comparer ces intervalles génomiques avec les régions H3K4me3\n",
    "\n",
    "  F. Compter le nombre de chevauchements entre les gènes non DE et les pics\n",
    "\n",
    "  G. Comparer les nombres de chevauchements\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf068431-71f8-463c-bdb1-9d97525ec1d8",
   "metadata": {
    "tags": []
   },
   "source": [
    ".\n",
    "\n",
    ".\n",
    "\n",
    ".\n",
    "\n",
    "#### A. Extraire les intervalles génomiques des gènes différentiellement exprimés (DE)\n",
    "\n",
    "1. Chercher (<code>grep</code>) une liste de mots (<code>-f</code>) contenue dans le fichier DEgenes.txt dans le fichier <code>hsGRCh38.genes.gff3</code> et rediriger la sortie standard dans un fichier nommé <code>DEgenes.gff3</code>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "52d15cbf-191b-44dd-a14c-b0a1df960035",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "162 DEgenes.txt\n",
      "162 DEgenes.gff3\n",
      "21492 hsGRCh38.genes.gff3\n",
      "1\tensembl_havana\tgene\t33081104\t33120530\t.\t+\t.\tID=gene:ENSG00000142920;Name=AZIN2;biotype=protein_coding;description=antizyme inhibitor 2 [Source:HGNC Symbol%3BAcc:HGNC:29957];gene_id=ENSG00000142920;logic_name=ensembl_havana_gene;version=16\n",
      "1\tensembl_havana\tgene\t70852353\t71047808\t.\t-\t.\tID=gene:ENSG00000050628;Name=PTGER3;biotype=protein_coding;description=prostaglandin E receptor 3 [Source:HGNC Symbol%3BAcc:HGNC:9595];gene_id=ENSG00000050628;logic_name=ensembl_havana_gene;version=20\n",
      "1\tensembl_havana\tgene\t74198235\t74544393\t.\t+\t.\tID=gene:ENSG00000259030;Name=FPGT-TNNI3K;biotype=protein_coding;description=FPGT-TNNI3K readthrough [Source:HGNC Symbol%3BAcc:HGNC:42952];gene_id=ENSG00000259030;logic_name=ensembl_havana_gene;version=7\n",
      "1\tensembl_havana\tgene\t111722064\t111755824\t.\t-\t.\tID=gene:ENSG00000284755;Name=AL049557.1;biotype=protein_coding;description=inka box actin regulator 2 [Source:NCBI gene%3BAcc:55924];gene_id=ENSG00000284755;logic_name=ensembl_havana_gene;version=1\n",
      "1\tensembl_havana\tgene\t147756199\t147773362\t.\t-\t.\tID=gene:ENSG00000265107;Name=GJA5;biotype=protein_coding;description=gap junction protein alpha 5 [Source:HGNC Symbol%3BAcc:HGNC:4279];gene_id=ENSG00000265107;logic_name=ensembl_havana_gene;version=2\n",
      "1\tensembl_havana\tgene\t149886975\t149887364\t.\t+\t.\tID=gene:ENSG00000184260;Name=HIST2H2AC;biotype=protein_coding;description=histone cluster 2 H2A family member c [Source:HGNC Symbol%3BAcc:HGNC:4738];gene_id=ENSG00000184260;logic_name=ensembl_havana_gene;version=5\n",
      "1\tensembl_havana\tgene\t173903804\t173917378\t.\t-\t.\tID=gene:ENSG00000117601;Name=SERPINC1;biotype=protein_coding;description=serpin family C member 1 [Source:HGNC Symbol%3BAcc:HGNC:775];gene_id=ENSG00000117601;logic_name=ensembl_havana_gene;version=13\n",
      "1\tensembl_havana\tgene\t183186238\t183244900\t.\t+\t.\tID=gene:ENSG00000058085;Name=LAMC2;biotype=protein_coding;description=laminin subunit gamma 2 [Source:HGNC Symbol%3BAcc:HGNC:6493];gene_id=ENSG00000058085;logic_name=ensembl_havana_gene;version=14\n",
      "1\tensembl_havana\tgene\t203026498\t203078740\t.\t+\t.\tID=gene:ENSG00000143847;Name=PPFIA4;biotype=protein_coding;description=PTPRF interacting protein alpha 4 [Source:HGNC Symbol%3BAcc:HGNC:9248];gene_id=ENSG00000143847;logic_name=ensembl_havana_gene;version=15\n",
      "1\tensembl_havana\tgene\t205913048\t205943460\t.\t-\t.\tID=gene:ENSG00000174502;Name=SLC26A9;biotype=protein_coding;description=solute carrier family 26 member 9 [Source:HGNC Symbol%3BAcc:HGNC:14469];gene_id=ENSG00000174502;logic_name=ensembl_havana_gene;version=18\n"
     ]
    }
   ],
   "source": [
    "grep -f DEgenes.txt hsGRCh38.genes.gff3 > DEgenes.gff3\n",
    "wc -l DEgenes.txt\n",
    "wc -l DEgenes.gff3\n",
    "wc -l hsGRCh38.genes.gff3\n",
    "head DEgenes.gff3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1696cd01-46de-440f-9492-77c89ce369d5",
   "metadata": {
    "tags": []
   },
   "source": [
    ".\n",
    "\n",
    "\n",
    "#### B. Comparer les intervalles génomiques des gènes DE avec les régions H3K4me3 (ie. l'emplacement des pics de chip-seq)\n",
    "\n",
    "Comme on va utiliser la suite d'outils que propose [Bedtools](https://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html), avant tout, faisons un tour sur la documentation.\n",
    "\n",
    "Chaque sous-commande fonctionne sur le même principe :\n",
    "\n",
    "<code>bedtools [sub-command] [options]</code>\n",
    "\n",
    "Les options varient selon la commande. Il suffit de cliquer sur son nom sur le site pour avoir accès à la fonction de la commande, à ses options et à des exemples.\n",
    "\n",
    "Les trois qui nous intéresserons dans un premier temps sont [<code>intersect</code>](https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html), [<code>flank</code>](https://bedtools.readthedocs.io/en/latest/content/tools/flank.html) et [<code>closest</code>](https://bedtools.readthedocs.io/en/latest/content/tools/closest.html)\n",
    "\n",
    "Bedtools accepte quasiment tous les formats de fichiers tabulés (GFF3, BED, etc)\n",
    "\n",
    "Dans notre cas de figure, nous souhaitons savoir si des gènes DE contiennent des pics et obtenir le nom de ces gènes. Nous utilisons donc la sous-commande \"bedtools .....\" ?\n",
    "\n",
    "Le fichier A sera celui de l'annotation des gènes DE (<code>DEgenes.gff3</code>) et le B correspondra aux pics (<code>picChip.bed</code>). Quelle option choisir pour obtenir uniquement les annotations du fichier A qui croisent avec le fichier B ?\n",
    "\n",
    "Rediriger la sortie standard dans un fichier nommé <code>intersect_DEgenes_picChip.gff3</code>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "05adfa85-3fc9-41ef-8130-5fa279446f60",
   "metadata": {},
   "outputs": [],
   "source": [
    "bedtools intersect -a DEgenes.gff3 -b picChip.bed -wa  -u > intersect_DEgenes_picChip.gff3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffff467e-cdc8-4e25-9eb7-100b1e9feba0",
   "metadata": {},
   "source": [
    ".\n",
    "\n",
    "#### C. Compter le nombre de chevauchements entre les gènes DE et les pics\n",
    "\n",
    "Nous avons obtenu la liste des gènes dans lesquels se trouvent des pics (<code>intersect_DEgenes_picChip.gff3</code>). Nous avons maintenant besoin de les compter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c18ab42b-aed0-424d-81ec-be40be39ea9d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "77 intersect_DEgenes_picChip.gff3\n"
     ]
    }
   ],
   "source": [
    "wc -l intersect_DEgenes_picChip.gff3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b21d36e1-75b0-4026-b73a-116946280ac3",
   "metadata": {
    "tags": []
   },
   "source": [
    ".\n",
    "\n",
    "\n",
    "#### D. Trouver les intervalles génomiques des gènes non-différentiellement exprimés\n",
    "\n",
    "grep -v : Chercher l'inverse d'une liste de gènes dans un texte et rediriger la sortie standard dans un fichier nommé <code>notDEgenes.gff3</code>\n",
    "\n",
    "On récupère donc tous les gènes dont le nom n'est pas dans la liste des gènes DE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b80b3af9-7cb5-43bd-b8d5-8bd12e9542f9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "21330 notDEgenes.gff3\n"
     ]
    }
   ],
   "source": [
    "grep -v -f DEgenes.txt hsGRCh38.genes.gff3 > notDEgenes.gff3\n",
    "wc -l notDEgenes.gff3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31dd0b7a-fd24-4430-8f27-5670b78c57bd",
   "metadata": {},
   "source": [
    "#### E.  Comparer ces intervalles génomiques avec les régions H3K4me3\n",
    "Faire l'intersection entre le fichier gff3 des gènes non DE et les pics de Chip-seq et écrire la sortie dans un fichier <code>intersect_notDEgenes_picChip.gff3</code>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "54320590-5175-455b-bca4-c8bb7dde1823",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9438 intersect_notDEgenes_picChip.gff3\n"
     ]
    }
   ],
   "source": [
    "bedtools intersect -a notDEgenes.gff3 -b picChip.bed -wa -u > intersect_notDEgenes_picChip.gff3\n",
    "wc -l intersect_notDEgenes_picChip.gff3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd6aeff0-ff25-4ea5-bb8e-ac0154b34e10",
   "metadata": {},
   "source": [
    ".\n",
    "\n",
    "\n",
    "#### F. Compter le nombre de chevauchements entre les gènes non DE et les pics\n",
    "\n",
    "Afin de pouvoir ensuite faire des ratios \"nb de gène avec au moins un chevauchements/nb de gène\" nous devons connaitre les valeurs suivantes :\n",
    "- Nombre de gènes DE\n",
    "- Nombre de gènes DE avec chevauchement\n",
    "- Nombre de gènes non DE\n",
    "- Nombre de gènes non DE avec chevauchement\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "1d212113-1ab5-416f-b735-a1be6cd51c96",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "162 DEgenes.gff3\n",
      "77 intersect_DEgenes_picChip.gff3\n",
      "21330 notDEgenes.gff3\n",
      "9438 intersect_notDEgenes_picChip.gff3\n"
     ]
    }
   ],
   "source": [
    "wc -l DEgenes.gff3\n",
    "wc -l intersect_DEgenes_picChip.gff3\n",
    "wc -l notDEgenes.gff3\n",
    "wc -l intersect_notDEgenes_picChip.gff3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2385cd90-6b6a-4613-8587-6f146a9cc21a",
   "metadata": {},
   "source": [
    ".\n",
    "\n",
    "#### G. Comparer les nombres de chevauchements\n",
    "\n",
    "Résumons nos informations :\n",
    " \n",
    " Ratios | avec pics | sans pic | total\n",
    "-----------|----------|----------|-------\n",
    "gene DE    |77        |162-77    |162\n",
    "gene non DE|9438      |21330-9438|21330    \n",
    "total      |          |          |21492\n",
    "\n",
    "Par simple calcul de pourcentage nous obtenons le tableau suivant :\n",
    "\n",
    "Ratios | avec pics | sans pic | total\n",
    "-----------|----------|----------|-------\n",
    "gene DE    |47.53%        | 52.47%    |162\n",
    "gene non DE|44.24%      |55.76%|21330    \n",
    "total      |          |          |21492\n",
    "\n",
    "En conclusion : \n",
    "On peut noter une légère hausse du nombre de pics dans les gènes DE vs non DE.\n",
    "\n",
    "/!\\ En revanche on ne peut pas affirmer ou infirmer que cela soit significatif /!\\\n",
    "\n",
    ".\n",
    "\n",
    ".\n",
    "\n",
    ".\n",
    "\n",
    "## Bonus n°1\n",
    "\n",
    "##### *Question scientifique*\n",
    "\n",
    "Quels sont les variants qui sont dans les promoteurs (2kb en amont du gène) des gènes différentiellement exprimés ?\n",
    "\n",
    "##### *Données*\n",
    "- DEgenes.gff3 (généré précédemment)\n",
    "- variant.vcf\n",
    "- chrs.len\n",
    "\n",
    "Afficher les cinq dernières lignes des fichiers <code>DEgenes.gff3</code> et <code>variant.vcf</code> :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "f6382f41-47cc-4f7f-9204-963cb4b60c29",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==> DEgenes.gff3 <==\n",
      "9\tensembl_havana\tgene\t127738317\t127778741\t.\t-\t.\tID=gene:ENSG00000095370;Name=SH2D3C;biotype=protein_coding;description=SH2 domain containing 3C [Source:HGNC Symbol%3BAcc:HGNC:16884];gene_id=ENSG00000095370;logic_name=ensembl_havana_gene;version=19\n",
      "9\tensembl_havana\tgene\t128149071\t128153455\t.\t+\t.\tID=gene:ENSG00000148346;Name=LCN2;biotype=protein_coding;description=lipocalin 2 [Source:HGNC Symbol%3BAcc:HGNC:6526];gene_id=ENSG00000148346;logic_name=ensembl_havana_gene;version=11\n",
      "9\tensembl_havana\tgene\t132725578\t132878777\t.\t-\t.\tID=gene:ENSG00000165695;Name=AK8;biotype=protein_coding;description=adenylate kinase 8 [Source:HGNC Symbol%3BAcc:HGNC:26526];gene_id=ENSG00000165695;logic_name=ensembl_havana_gene;version=9\n",
      "9\thavana\tgene\t133098121\t133163914\t.\t-\t.\tID=gene:ENSG00000285245;Name=AL162417.1;biotype=protein_coding;description=novel protein;gene_id=ENSG00000285245;logic_name=havana;version=1\n",
      "X\tensembl_havana\tgene\t71107404\t71112108\t.\t-\t.\tID=gene:ENSG00000147168;Name=IL2RG;biotype=protein_coding;description=interleukin 2 receptor subunit gamma [Source:HGNC Symbol%3BAcc:HGNC:6010];gene_id=ENSG00000147168;logic_name=ensembl_havana_gene;version=12\n",
      "\n",
      "==> variant.vcf <==\n",
      "Y\t26608388\trs761263486\tT\tTA\t.\t.\tRS=761263486;RSPOS=26608388;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000000005140026000200;WGT=1;VC=DIV;ASP;VLD;KGPhase3;CAF=0.9862,0.01379;COMMON=1\n",
      "Y\t26608388\trs77074025\tT\tTA\t.\t.\tRS=77074025;RSPOS=26608394;dbSNPBuildID=131;SSR=0;SAO=0;VP=0x050000000005000002000200;WGT=1;VC=DIV;ASP;CAF=0.9862,0.01379;COMMON=1\n",
      "Y\t26614668\trs760366233\tTC\tT\t.\t.\tRS=760366233;RSPOS=26614669;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000000005040024000200;WGT=1;VC=DIV;ASP;VLD;KGPhase3;CAF=0.9968,0.003244;COMMON=1\n",
      "Y\t26614668\trs376925794\tTC\tT\t.\t.\tRS=376925794;RSPOS=26614671;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000000005000002000200;WGT=1;VC=DIV;ASP;CAF=0.9968,0.003244;COMMON=1\n",
      "Y\t26624486\trs771929773\tT\tTA\t.\t.\tRS=771929773;RSPOS=26624486;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000000005040024000200;WGT=1;VC=DIV;ASP;VLD;KGPhase3;CAF=0.9968,0.003244;COMMON=1\n"
     ]
    }
   ],
   "source": [
    "tail -n5 DEgenes.gff3 variant.vcf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fb529e4-73e4-4f52-b291-0acc3ea934c2",
   "metadata": {},
   "source": [
    "##### *Protocole envisagé*\n",
    "<details><summary></summary>\n",
    "\n",
    "  A. Extraire la région située à 2kb en amont des gènes différentiellement exprimés (DE)\n",
    "    \n",
    "  B. Trouver tous les variants chevauchant les régions trouvées précedemment\n",
    "\n",
    "</details>\n",
    "\n",
    ".\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe524b39-abbb-4241-aaed-567bfea84066",
   "metadata": {},
   "source": [
    "Nous allons avoir besoin d'une autre sous-commande de bedtools : [<code>bedtools flank</code>](https://bedtools.readthedocs.io/en/latest/content/tools/flank.html) Regardez bien ses options.\n",
    "\n",
    "Notons qu'il y a une option obligatoire, <code>-g</code> pour communiquer à l'outil la taille des chromosomes. Pourquoi ?\n",
    "\n",
    "Lancer la commande et écrire la sortie standard dans un fichier <code>flank2kDEgenes.gff3</code>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "be749754-969d-47ff-b393-a45076292ce5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\tensembl_havana\tgene\t33079104\t33081103\t.\t+\t.\tID=gene:ENSG00000142920;Name=AZIN2;biotype=protein_coding;description=antizyme inhibitor 2 [Source:HGNC Symbol%3BAcc:HGNC:29957];gene_id=ENSG00000142920;logic_name=ensembl_havana_gene;version=16\n",
      "1\tensembl_havana\tgene\t71047809\t71049808\t.\t-\t.\tID=gene:ENSG00000050628;Name=PTGER3;biotype=protein_coding;description=prostaglandin E receptor 3 [Source:HGNC Symbol%3BAcc:HGNC:9595];gene_id=ENSG00000050628;logic_name=ensembl_havana_gene;version=20\n",
      "1\tensembl_havana\tgene\t74196235\t74198234\t.\t+\t.\tID=gene:ENSG00000259030;Name=FPGT-TNNI3K;biotype=protein_coding;description=FPGT-TNNI3K readthrough [Source:HGNC Symbol%3BAcc:HGNC:42952];gene_id=ENSG00000259030;logic_name=ensembl_havana_gene;version=7\n",
      "1\tensembl_havana\tgene\t111755825\t111757824\t.\t-\t.\tID=gene:ENSG00000284755;Name=AL049557.1;biotype=protein_coding;description=inka box actin regulator 2 [Source:NCBI gene%3BAcc:55924];gene_id=ENSG00000284755;logic_name=ensembl_havana_gene;version=1\n",
      "1\tensembl_havana\tgene\t147773363\t147775362\t.\t-\t.\tID=gene:ENSG00000265107;Name=GJA5;biotype=protein_coding;description=gap junction protein alpha 5 [Source:HGNC Symbol%3BAcc:HGNC:4279];gene_id=ENSG00000265107;logic_name=ensembl_havana_gene;version=2\n"
     ]
    }
   ],
   "source": [
    "bedtools flank -i DEgenes.gff3 -g chrs.len -l 2000 -r 0 -s > flank2kDEgenes.gff3\n",
    "head -n5 flank2kDEgenes.gff3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e946f02b-c74c-4ba2-bd70-58d7a7df7a60",
   "metadata": {},
   "source": [
    "Maintenant, c'est du déjà-vu : Trouver tous les variants chevauchant les régions trouvées précedemment. Renvoyer la sortie standard dans un fichier nommé <code>intersect_flank2kDEgenes_variant.txt</code>\n",
    "\n",
    "N'oubliez pas de regarder les options qui nous intéressent [ici](https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html) et visualisez les 5 premières lignes du fichier quand celui ci est généré"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "5b9268f8-637c-4a57-91ba-ccf48b2e6419",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\t33079702\trs10712676\tGA\tG\t.\t.\tRS=10712676;RSPOS=33079703;dbSNPBuildID=137;SSR=0;SAO=0;VP=0x0501000a0005170126000200;GENEINFO=AZIN2:113451|LOC105378635:105378635;WGT=1;VC=DIV;SLO;INT;R5;ASP;VLD;G5A;G5;GNO;KGPhase3;CAF=0.8706,0.1294;COMMON=1;TOPMED=0.96872610856269113,0.03127389143730886\n",
      "1\t74197393\trs201749320\tTTC\tT\t.\t.\tRS=201749320;RSPOS=74197394;dbSNPBuildID=137;SSR=0;SAO=0;VP=0x0500000a000504003e000200;GENEINFO=LRRIQ3:127255|FPGT-TNNI3K:100526835|FPGT:8790;WGT=1;VC=DIV;INT;R5;ASP;VLD;KGPhase1;KGPhase3;CAF=0.9958,0.004193;COMMON=1;TOPMED=0.99698967889908256,0.00301032110091743\n",
      "1\t147773875\trs72546659\tC\tCT\t.\t.\tRS=72546659;RSPOS=147773875;dbSNPBuildID=130;SSR=0;SAO=0;VP=0x0501100a000504003e000200;GENEINFO=GJA5:2702|LOC102723321:102723321;WGT=1;VC=DIV;TPA;SLO;INT;R5;ASP;VLD;KGPhase1;KGPhase3;CAF=0.9968,0.003195;COMMON=1;TOPMED=0.99143890163098878,0.00856109836901121\n",
      "1\t147775166\trs10641806\tA\tAAT\t.\t.\tRS=10641806;RSPOS=147775166;dbSNPBuildID=135;SSR=0;SAO=0;VP=0x0500000a000515003e000200;GENEINFO=GJA5:2702|LOC102723321:102723321;WGT=1;VC=DIV;INT;R5;ASP;VLD;G5;KGPhase1;KGPhase3;CAF=0.05811,0.9419;COMMON=1\n",
      "1\t173918736\trs577103838\tG\tGGAGGCA\t.\t.\tRS=577103838;RSPOS=173918736;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000020005040026000200;GENEINFO=SERPINC1:462;WGT=1;VC=DIV;R5;ASP;VLD;KGPhase3;CAF=0.9988,0.001198;COMMON=1;TOPMED=0.99892488532110091,0.00107511467889908\n",
      "405 intersect_flank2kDEgenes_variant.txt\n"
     ]
    }
   ],
   "source": [
    "bedtools intersect -a variant.vcf -b flank2kDEgenes.gff3 -wa -u > intersect_flank2kDEgenes_variant.txt\n",
    "head -n 5 intersect_flank2kDEgenes_variant.txt\n",
    "wc -l intersect_flank2kDEgenes_variant.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c68bcce3-1e44-4502-b93c-2b60636f476f",
   "metadata": {},
   "source": [
    "Pour aller plus loin avec bedtools <code>intersect</code>, on peut se demander combien y a-t-il de SNP par promoteur ? \n",
    "\n",
    "Stocker l'information dans un fichier <code>intersect_flanck2kDEgenes_n_variant.txt</code> et visualiser les 5 premières lignes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "4050259b-607e-4d70-9b46-ccd6044399e9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\tensembl_havana\tgene\t33079104\t33081103\t.\t+\t.\tID=gene:ENSG00000142920;Name=AZIN2;biotype=protein_coding;description=antizyme inhibitor 2 [Source:HGNC Symbol%3BAcc:HGNC:29957];gene_id=ENSG00000142920;logic_name=ensembl_havana_gene;version=16\t1\n",
      "1\tensembl_havana\tgene\t71047809\t71049808\t.\t-\t.\tID=gene:ENSG00000050628;Name=PTGER3;biotype=protein_coding;description=prostaglandin E receptor 3 [Source:HGNC Symbol%3BAcc:HGNC:9595];gene_id=ENSG00000050628;logic_name=ensembl_havana_gene;version=20\t0\n",
      "1\tensembl_havana\tgene\t74196235\t74198234\t.\t+\t.\tID=gene:ENSG00000259030;Name=FPGT-TNNI3K;biotype=protein_coding;description=FPGT-TNNI3K readthrough [Source:HGNC Symbol%3BAcc:HGNC:42952];gene_id=ENSG00000259030;logic_name=ensembl_havana_gene;version=7\t1\n",
      "1\tensembl_havana\tgene\t111755825\t111757824\t.\t-\t.\tID=gene:ENSG00000284755;Name=AL049557.1;biotype=protein_coding;description=inka box actin regulator 2 [Source:NCBI gene%3BAcc:55924];gene_id=ENSG00000284755;logic_name=ensembl_havana_gene;version=1\t0\n",
      "1\tensembl_havana\tgene\t147773363\t147775362\t.\t-\t.\tID=gene:ENSG00000265107;Name=GJA5;biotype=protein_coding;description=gap junction protein alpha 5 [Source:HGNC Symbol%3BAcc:HGNC:4279];gene_id=ENSG00000265107;logic_name=ensembl_havana_gene;version=2\t2\n"
     ]
    }
   ],
   "source": [
    "bedtools intersect -a flank2kDEgenes.gff3 -b variant.vcf -c > intersect_flanck2kDEgenes_n_variant.txt\n",
    "head -n 5 intersect_flanck2kDEgenes_n_variant.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6473abf6-e41a-42c6-bdca-769091a32ed9",
   "metadata": {},
   "source": [
    ".\n",
    "\n",
    ".\n",
    "\n",
    ".\n",
    "\n",
    "## Bonus n°2\n",
    "\n",
    "##### *Question scientifique*\n",
    "\n",
    "Quels sont les gènes différentiellement exprimés les plus proches des pics ChIP ET qui contiennent une mutation ?\n",
    "\n",
    "##### *Données*\n",
    "- DEgenes.gff3 (généré dans le problème 1)\n",
    "- variant.vcf\n",
    "- picChip.bed\n",
    "\n",
    "Afficher les cinq dernières lignes des fichiers <code>DEgenes.gff3</code>, <code>variant.vcf</code> et <code>picChip.bed</code> :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "29e103eb-5651-4e87-b581-ac1ad7967ed2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==> DEgenes.gff3 <==\n",
      "9\tensembl_havana\tgene\t127738317\t127778741\t.\t-\t.\tID=gene:ENSG00000095370;Name=SH2D3C;biotype=protein_coding;description=SH2 domain containing 3C [Source:HGNC Symbol%3BAcc:HGNC:16884];gene_id=ENSG00000095370;logic_name=ensembl_havana_gene;version=19\n",
      "9\tensembl_havana\tgene\t128149071\t128153455\t.\t+\t.\tID=gene:ENSG00000148346;Name=LCN2;biotype=protein_coding;description=lipocalin 2 [Source:HGNC Symbol%3BAcc:HGNC:6526];gene_id=ENSG00000148346;logic_name=ensembl_havana_gene;version=11\n",
      "9\tensembl_havana\tgene\t132725578\t132878777\t.\t-\t.\tID=gene:ENSG00000165695;Name=AK8;biotype=protein_coding;description=adenylate kinase 8 [Source:HGNC Symbol%3BAcc:HGNC:26526];gene_id=ENSG00000165695;logic_name=ensembl_havana_gene;version=9\n",
      "9\thavana\tgene\t133098121\t133163914\t.\t-\t.\tID=gene:ENSG00000285245;Name=AL162417.1;biotype=protein_coding;description=novel protein;gene_id=ENSG00000285245;logic_name=havana;version=1\n",
      "X\tensembl_havana\tgene\t71107404\t71112108\t.\t-\t.\tID=gene:ENSG00000147168;Name=IL2RG;biotype=protein_coding;description=interleukin 2 receptor subunit gamma [Source:HGNC Symbol%3BAcc:HGNC:6010];gene_id=ENSG00000147168;logic_name=ensembl_havana_gene;version=12\n",
      "\n",
      "==> variant.vcf <==\n",
      "Y\t26608388\trs761263486\tT\tTA\t.\t.\tRS=761263486;RSPOS=26608388;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000000005140026000200;WGT=1;VC=DIV;ASP;VLD;KGPhase3;CAF=0.9862,0.01379;COMMON=1\n",
      "Y\t26608388\trs77074025\tT\tTA\t.\t.\tRS=77074025;RSPOS=26608394;dbSNPBuildID=131;SSR=0;SAO=0;VP=0x050000000005000002000200;WGT=1;VC=DIV;ASP;CAF=0.9862,0.01379;COMMON=1\n",
      "Y\t26614668\trs760366233\tTC\tT\t.\t.\tRS=760366233;RSPOS=26614669;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000000005040024000200;WGT=1;VC=DIV;ASP;VLD;KGPhase3;CAF=0.9968,0.003244;COMMON=1\n",
      "Y\t26614668\trs376925794\tTC\tT\t.\t.\tRS=376925794;RSPOS=26614671;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000000005000002000200;WGT=1;VC=DIV;ASP;CAF=0.9968,0.003244;COMMON=1\n",
      "Y\t26624486\trs771929773\tT\tTA\t.\t.\tRS=771929773;RSPOS=26624486;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000000005040024000200;WGT=1;VC=DIV;ASP;VLD;KGPhase3;CAF=0.9968,0.003244;COMMON=1\n",
      "\n",
      "==> picChip.bed <==\n",
      "18\t77792172\t77797567\tregion_52418\n",
      "18\t77865917\t77869460\tregion_52419\n",
      "18\t78004811\t78005421\tregion_52420\n",
      "18\t78005736\t78005857\tregion_52421\n",
      "18\t78016442\t78016675\tregion_52422\n"
     ]
    }
   ],
   "source": [
    "tail -n5 DEgenes.gff3 variant.vcf picChip.bed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73162ff3-57f3-4e84-b31b-df7e3b10cea3",
   "metadata": {},
   "source": [
    "##### *Protocole envisagé*\n",
    "<details><summary></summary>\n",
    "\n",
    "  1. Trouver les pics qui contiennent une mutation\n",
    "    \n",
    "  2. Trouver le gène le plus proche de chaque région précédemment trouvée\n",
    "\n",
    "</details>\n",
    "\n",
    ".\n",
    "\n",
    ".\n",
    "\n",
    ".\n",
    "\n",
    "#### Trouver les pics qui contiennent une mutation\n",
    "\n",
    "Rien de neuf sous le soleil, nous stockons l'output dans un fichier <code>picVariant.bed</code>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "34797d92-7e3c-4c8d-b3ee-b7c5443c6e9e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "22\t16846630\t16870710\tregion_2\n",
      "22\t17070620\t17114004\tregion_4\n",
      "22\t17128021\t17136091\tregion_5\n",
      "22\t17147545\t17173488\tregion_6\n",
      "22\t17192673\t17206732\tregion_7\n"
     ]
    }
   ],
   "source": [
    "bedtools intersect -a picChip.bed -b variant.vcf -u > picVariant.bed\n",
    "\n",
    "## L'option -u permet de garder l'ensemble de l'information des pics sans ajouter celle des variants\n",
    "head -n 5 picVariant.bed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c21c87b8-5e9a-49e0-8832-f3bfb767b524",
   "metadata": {},
   "source": [
    ".\n",
    "\n",
    "#### Trouver le gène le plus proche de chaque région précédemment trouvée\n",
    "\n",
    "Nous avons besoin d'une nouvelle sous-commande : [<code>bedtools closest</code>](https://bedtools.readthedocs.io/en/latest/content/tools/closest.html)\n",
    "\n",
    "Attention, bien prendre note du message suivant :\n",
    "![](images/exo3_graph1.png)\n",
    "\n",
    "L'output est à stocker dans un fichier nommé <code>picVariant_sorted.bed</code>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "e374d7be-9faa-45d9-8d36-edff635648d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "sort -k1,1 -k2,2n picVariant.bed > picVariant_sorted.bed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "402d7339-b2da-45c3-993d-7be80bd5d036",
   "metadata": {},
   "source": [
    "Nous tentons un essai en paramètre par défaut. Rediriger la sortie dans un fichier <code>DEgenes_closest_picVariant.bed</code>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "dc3313c2-0e89-40ef-b14a-3bdefd71897b",
   "metadata": {},
   "outputs": [],
   "source": [
    "bedtools closest -a picVariant_sorted.bed -b DEgenes.gff3 > DEgenes_closest_picVariant.bed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21505fce-e754-4466-96fa-72a687d20d3c",
   "metadata": {},
   "source": [
    "Et si on prenait en compte le brin ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "1ab6e978-4f88-4d09-b5c4-6ba55e89e2e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "bedtools closest -a picVariant_sorted.bed -b DEgenes.gff3 -s > DEgenes_closest_picVariant.bed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42148205-cee4-455a-b481-666d2028fb09",
   "metadata": {},
   "source": [
    "Afficher le nombre de ligne"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "9c68d092-019c-4f06-b6c7-afff61222b81",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "28080 DEgenes_closest_picVariant.bed\n"
     ]
    }
   ],
   "source": [
    "wc -l DEgenes_closest_picVariant.bed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "139d3273-3590-4b47-99d2-ea398fab39a8",
   "metadata": {},
   "source": [
    ".\n",
    "\n",
    ".\n",
    "\n",
    ".\n",
    "\n",
    "## Bonus n°3\n",
    "\n",
    "#### *Devine la commande*\n",
    "\n",
    "Ce dernier bonus est l'occasion de voir des commandes qui peuvent être utiles mais que nous n'avons pas pu aborder précédemment \n",
    "\n",
    "#### *Awk*\n",
    "\n",
    "1. Pour commencer, un peu de Awk. Vous trouverez un fichier nommé .DEgenes_all.gff3 (ls -la pour le voir). Récuperez uniquement les annotations de gènes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "92e54fa8-0bd3-4d8d-bebb-23c6eced3f1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "awk 'BEGIN {OFS=FS=\"\\t\"} $3 == \"gene\" {print}' \".DEgenes_all.gff3\" > DEgenes.gff3 "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "983c4c3d-376f-425a-9525-1650cd762059",
   "metadata": {},
   "source": [
    "- <code>FS</code> et <code>OFS</code> : Permettrent de définir le séparateur de champ du fichier\n",
    "- <code>$3 == \"gene\"</code> : Permet de vérifier si la colonne 3 est une annotation de gène\n",
    "- <code>print</code> : Imprime à l'écran la ligne quand la condition est vraie"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d74ef3ea-e768-45da-8559-ddb6d0c1fb56",
   "metadata": {},
   "source": [
    "2. La dernière colonne du fichier intersect_flanck2kDEgenes_n_variant.txt contient le nombre de variants par promoteur. Comment s'assurer que nous avons bien pris en compte les 405 variants initialement trouvé à la question bonus 1 ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "f59b0641-1620-426b-912b-8571b127ca5d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "405\n"
     ]
    }
   ],
   "source": [
    "awk -F '\\t' 'BEGIN{s=0}{s+=$NF}END{print s}' intersect_flanck2kDEgenes_n_variant.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2e4c7c4-6942-4548-8801-237053bdf7e8",
   "metadata": {},
   "source": [
    "- <code>-F</code> : Permet de définir le séparateur de champ dans l'input\n",
    "- <code>s=0</code> : Initialisation d'une variable s qui vaut 0 avant de lire le fichier\n",
    "- <code>s+=$NF</code> : Pour chaque ligne, on ajoute la valeur de la dernière colonne à la somme s\n",
    "- <code>print s</code> : Une fois l'ensemble du fichier parcouru, on imprime la dernière valeur de la somme s"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e69a0772-f05e-47f1-af42-96a0902652be",
   "metadata": {},
   "source": [
    "#### *Alternative à grep -v*\n",
    "\n",
    "Dans l'exercice, point D, on utilise <code>grep -v</code> pour extraire les coordonnées génomiques des gènes non DE.\n",
    "\n",
    "Il était aussi possible de le faire avec la commande <code>intersect</code>.\n",
    "\n",
    "Trouverez vous comment ?\n",
    "\n",
    "Attention aux options !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "eea1f27a-4a66-4a40-aeb7-ef48ac7c770e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "21330 notDEgenes.gff3\n"
     ]
    }
   ],
   "source": [
    "bedtools intersect -a hsGRCh38.genes.gff3 -b DEgenes.gff3 -v -f 1 -F 1 -s -wa > notDEgenes.gff3\n",
    "wc -l notDEgenes.gff3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db332ffb-c07e-4140-8183-d44f3876f5a1",
   "metadata": {},
   "source": [
    "L'option la plus importante pour cette question est donc l'option <code>-v</code>. \n",
    "\n",
    "Mais par défaut, <code>intersect</code> valide une intersection dès qu'il y a une base en commun entre les 2 régions. Dans ce cas d'usage où l'on traite des gènes en entier, il suffit que quelques gènes se chevauchent pour qu'ils soient exclus à tort. La façon correcte est donc de préciser que l'on veut valider les chevauchements que si les gènes se chevauchent à 100% entre eux. Les options pour spécifier un % de chavauchement (exprimés entre <code>[0-1]</code>, 1 pour 100%) sont <code>-F</code> pour le fichier A et <code>-f</code> pour le fichier B. Il faut aussi que les 2 gènes soient sur le même brin (option <code>-s</code>)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}