{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Combining graphs from multiple data sources\n", "In this notebook you will be able to search for protein interactions (Uniprot), expressions in tissues (Bgee). Finally you will lear how to assemble a new graph of co-expressed gene. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import rdflib\n", "from SPARQLWrapper import SPARQLWrapper, JSON, TURTLE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Find interacting proteins with SCN5A (from Uniprot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By browsing http://uniprot.org, find the web page describing SCN5A_HUMAN. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Look at the RDF graph describing SCN5A \n", "\n", "You can directly access RDF https://www.uniprot.org/uniprotkb/Q14524.ttl . " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using the following graph pattern write a SPARQL query to find the proteins interacting with SCN5A. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find all PPI in which Q14524 is a participant.\n", "You can use the following graph pattern : \n", "```\n", "VALUES ?P1 {uniprot:Q14524}\n", "?interaction up:participant ?P1 . \n", "?interaction up:participant ?P2 .\n", "?P1 up:mnemonic ?P1_label .\n", "?P2 up:mnemonic ?P2_label .\n", "FILTER (?P2 != ?P1)\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uniprot_query = \"\"\"\n", "PREFIX rdf: \n", "PREFIX rdfs:\n", "PREFIX owl: \n", "PREFIX xsd: \n", "PREFIX dc: \n", "PREFIX dcterms: \n", "\n", "PREFIX taxon: \n", "PREFIX uniprot: \n", "PREFIX up:\n", "\n", "SELECT * WHERE {\n", "\n", "} \n", "\n", "\"\"\"\n", "\n", "sparql = SPARQLWrapper(\"http://sparql.uniprot.org/sparql/\")\n", "sparql.setQuery(uniprot_query)\n", "sparql.setReturnFormat(JSON)\n", "results = sparql.query().convert()\n", "print(results['results']['bindings'])\n", "\n", "list_of_genes = [\"SCN5A\"]\n", "for r in results['results']['bindings']:\n", " print(f\"{r['P1_label']['value']} <-> {r['P2_label']['value']} in {r['nb_expe']['value']} experiments.\")\n", " list_of_genes.append(r['P2_label']['value'].split(\"_HUMAN\")[0])\n", "\n", "list_of_genes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Post process your results get a list of gene identifiers (remove the \"_HUMAN\" postfix)\n", "\n", "you should get something like `['SCN5A', 'KCC2D', 'FGF12', 'ZMY19', 'EMC9', 'BANP', 'Q49AR9', 'TEKT4', 'PTN3']`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "genes = '\\\"'+\"\\\" \\\"\".join(list_of_genes)+'\\\"'\n", "genes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Filter tissues in which genes are expressed (from Bgee)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1. Find tissues in which SCN5A is expressed \n", "\n", "Bgee is a gene expression RDF dataset which integrates GTex. \n", "\n", "Based on the following graph patterns, assemble a SPARQL query to retrieve tissues in which TEKT4 is expressed, for HUMANS (http://purl.uniprot.org/taxonomy/9606). \n", "\n", "*Anatomical entities*\n", "```\n", "?anatEntity a genex:AnatomicalEntity ;\n", " rdfs:label ?anatName .\n", "```\n", "\n", "*Case insensitive matching of a string value*\n", "```\n", "FILTER (?geneName = 'TEKT4')\n", "```\n", "\n", "*Human organisms*\n", "```\n", "?organism obo:RO_0002162 . \n", "```\n", "\n", "*Some genes from some organisms*\n", "```\n", "?seq a orth:Gene;\n", " orth:organism ?organism ;\n", " rdfs:label ?geneName .\n", "```\n", "\n", "*Some genes expressed in some tissues*\n", "```\n", "?seq genex:isExpressedIn ?anatEntity.\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bgee_query = \"\"\"\n", "PREFIX orth: \n", "PREFIX genex: \n", "PREFIX obo: \n", "\n", "SELECT DISTINCT ?anatEntity ?anatName WHERE {\n", " \n", "}\n", "\"\"\"\n", "\n", "sparql = SPARQLWrapper(\"http://bgee.org/sparql\")\n", "sparql.setQuery(bgee_query)\n", "sparql.setReturnFormat(JSON)\n", "res = sparql.query().convert()\n", "#print(res[\"results\"][\"bindings\"])\n", "for r in res[\"results\"][\"bindings\"]:\n", " print(f\"{r['anatEntity']['value']}: {r['anatName']['value']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.2. Build a subgraph with a CONSTRUCT ... WHERE query \n", "A \"CONSTRUCT\" query builds a sub-graph from a graph pattern matched in the where clause. \n", "\n", "Structure of a \"CONSTRUCT\" query: \n", "```\n", "CONSTRUCT {\n", "... sub graph pattern ...\n", "} WHERE {\n", "... graph pattern ...\n", "}\n", "```\n", "\n", "Reuse the \"WHERE\" clause of the previous query to build a subgraph with only `genex:isExpressedIn` and `rdfs:label` relations. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bgee_subgraph = \"\"\"\n", "PREFIX orth: \n", "PREFIX genex: \n", "PREFIX obo: \n", "\n", "CONSTRUCT {\n", " \n", "} WHERE {\n", " \n", "}\n", "\"\"\"\n", "sparql = SPARQLWrapper(\"http://bgee.org/sparql\")\n", "sparql.setQuery(bgee_subgraph)\n", "results = sparql.query().convert()\n", "print(len(results))\n", "#print(results)\n", "print(results.serialize(format=\"turtle\"))\n", "\n", "KG = results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can use a VALUES clause to inject the results of the previous query `VALUES ?x { v1 v2 v3 ... vN }` in this new query. \n", "\n", "Modify the query to inject `\"SCN5A\" \"KCC2D\" \"FGF12\" \"ZMY19\" \"EMC9\" \"BANP\" \"Q49AR9\" \"TEKT4\" \"PTN3\"` as gene of interest. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bgee_subgraph = \"\"\"\n", "PREFIX orth: \n", "PREFIX genex: \n", "PREFIX obo: \n", "\n", "CONSTRUCT {\n", "\n", "} WHERE {\n", "\n", "}\n", "\"\"\"\n", "sparql = SPARQLWrapper(\"http://bgee.org/sparql\")\n", "sparql.setQuery(bgee_subgraph)\n", "results = sparql.query().convert()\n", "print(len(results))\n", "#print(results)\n", "print(results.serialize(format=\"turtle\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5 Build a network of genes that are co-expressed in the same tissue\n", "\n", "The result of a \"CONSTRUCT\" is a graph object. You can now write and execute a CONSTRUCT query on this graph to create new `coExpressedWith` between genes. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "q2 = \"\"\"\n", "PREFIX genex: \n", "PREFIX etbii:\n", "\n", "CONSTRUCT {\n", " \n", " ?gene1 etbii:coExpressedWith ?gene2 .\n", " \n", "} WHERE {\n", " \n", "}\n", "\"\"\"\n", "\n", "coExNet = KG.query(q2)\n", "print(coExNet.serialize(format=\"turtle\").decode())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }