{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Practical introduction to RDF and SPARQL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Version:** 1.3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objective"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this notebook is to make you comfortable with representing (simple) knowledge graphs in RDF, and then write simple SPARQL queries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reminder: IRIs and Literals\n",
    "**Resource** refer two complex objects identified by an **IRI** (International Resource Identifier == URI allowing international characters). Note that URLs are IRIs pointing to web accessible documents/data. URIs can be shortened with **PREFIX**. As an example `<http://my/super/vocab/my_term>` can be shortened as `ns:my_term` if `ns` is defined as a prefix for `http://my/super/vocab/`. \n",
    "\n",
    "**Literals** refer two simple values (numercial values, strings, boolean, dates)\n",
    "\n",
    "## Reminder: RDF, triples\n",
    "1. an RDF **statement** represents a **relationship** between two resources: a **subject** and an **object**\n",
    "1. relationships are directional and are called a .red[predicates] (or RDF properties)\n",
    "1. (logical) statements are called **triple** : {`subject`, `predicate`, `object`}\n",
    "1. a set of triples form a **directed labelled graph** : subject nodes are IRIs, edges are predicate (IRIs only),  object nodes are IRIs or Literals. \n",
    "\n",
    "Go through https://www.w3.org/TR/rdf11-primer/ to have more details on RDF. \n",
    "\n",
    "## Reminder: Turtle syntax\n",
    "- header to define prefix\n",
    "   - example: with `@prefix ns: http://my_voc# .`, `http://my_voc#term` can be written as `ns:term` \n",
    "- generally one line per triple with a `.` at the end: `<subject> <predicate> <object> .`\n",
    "- possible shortcuts to share the same subject: `;` \n",
    "```\n",
    "s p1 o1 ; \n",
    "    p2 o2 .\n",
    "```\n",
    "- possible shortcuts to share the same subject-predicate: `,` \n",
    "```\n",
    "s p o1, o2, o3 .\n",
    "```\n",
    "\n",
    "## Example\n",
    "turtle syntax: \n",
    "```ruby\n",
    "<http://HG37> rdf:type <http://human_genome> .\n",
    "<http://sample1> <http://is_aligned_with> <http://HG37> .\n",
    "<http://sample1> rdfs:comment \"Sample 1 from Study X [...]\"^^xsd:string .\n",
    "```\n",
    "\n",
    "or \n",
    "\n",
    "```turtle\n",
    "<http://HG37> rdf:type <http://human_genome> .\n",
    "<http://sample1> <http://is_aligned_with> <http://HG37> ;\n",
    "              rdfs:comment \"Sample 1 from Study X [...]\"^^xsd:string .\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 1\n",
    " \n",
    "1. Consider the following RDF properties `family:has_mother`, `family:has_father`, `family:has_sister`\n",
    "2. Represent with RDF triples the following family:\n",
    "    -  *The mother of John is Mary*,\n",
    "    -  *Mickael is the son of Mark*,\n",
    "    -  *Mickael and John are cousins*,\n",
    "    -  *Mark is the uncle of John*.\n",
    "3. Go to https://www.ldf.fi/service/rdf-grapher \n",
    "4. Generate a graphical representation of the RDF graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_rdf_data = \"\"\"\n",
    "# Enter your prefixes and your triples here\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SPARQL hands-on"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SPARQL is the standards language to query multiple data sources expressed in RDF. The principle consists in defining a **graph pattern** to be matched against an RDF graph."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <font color='red'>Definition</font>\n",
    "**Triple Patterns** (TPs) are like RDF triples except that each of the *subject*, *predicate* and *object* may be a **variable**. Variables are prefixed with a `?` . \n",
    "\n",
    "## <font color='green'>Example</font>\n",
    "Triple pattern\n",
    "```ruby\n",
    "?x <is_a_variant_of> <RAC1> .\n",
    "```\n",
    "\n",
    "RDF graph\n",
    "```ruby\n",
    "<SNP:123> <is_a_variant_of> <NEMO> .\n",
    "<SNP:rs527330002> <is_a_variant_of> <RAC1> .\n",
    "<SNP:rs527330002> <refers_to_organism> <http://www.uniprot.org/taxonomy/9606> .\n",
    "<SNP:rs61753123> <is_a_variant_of> <RAC1> .\n",
    "```\n",
    "\n",
    "Bindings of variables `?x`\n",
    "```ruby\n",
    "?x = <SNP:rs527330002>\n",
    "?x = <SNP:rs61753123>\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <font color='red'>Definition</font>\n",
    "**Basic Graph Patterns** (BGPs) consist in a set of triple patterns to be matched against an RDF graph.\n",
    "## <font color='green'>Example</font>\n",
    "Basic graph pattern\n",
    "```ruby\n",
    "?x <is_a_variant_of> <RAC1> .\n",
    "?x <is_a_risk_factor_for> ?z\n",
    "```\n",
    "![:scale 60%](fig/bgp.png)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4 Types of SPARQL queries\n",
    "- **SELECT** : returns the variables values (i.e. bound variables) for each graph pattern match ;\n",
    "- **CONSTRUCT** : returns an RDF graph constructed by substituting variables in a set of triple patterns ;\n",
    "- **ASK** : returns a boolean (true/false) indicating whether a query pattern matches or not ;\n",
    "- **DESCRIBE** : returns an RDF graph that describes the resources found (resources neighborhood).\n",
    "\n",
    "<br>\n",
    "<br>\n",
    "Additional features: Optional BGPs, union, filters, aggregate functions, negation, service, *etc.*\n",
    "\n",
    "# Anatomy of a SPARQL query\n",
    "\n",
    "![:scale 95%](fig/anat.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`DESCRIBE <http://identifiers.org/hgnc.symbol/RAC1>`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 2\n",
    "We will now use the RDFlib package to parse RDF Data and do some very basic SPARQL queries. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the knowledge graph contains 4 triples\n",
      "\n",
      "@prefix ns: <http://my_voc/> .\n",
      "@prefix snp: <http://my_snps/> .\n",
      "\n",
      "snp:123 ns:is_a_variant_of \"NEMO\" .\n",
      "\n",
      "snp:rs527330002 ns:is_a_variant_of \"RAC1\" ;\n",
      "    ns:refers_to_organism <http://www.uniprot.org/taxonomy/9606> .\n",
      "\n",
      "snp:rs61753123 ns:is_a_variant_of \"RAC1\" .\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from rdflib import Graph\n",
    "\n",
    "# RDF graph, in turtle syntax, stored in a string\n",
    "my_rdf_data = \"\"\"\n",
    "@prefix ns: <http://my_voc/> .\n",
    "@prefix snp: <http://my_snps/> .\n",
    "\n",
    "snp:123 ns:is_a_variant_of \"NEMO\" .\n",
    "snp:rs527330002 ns:is_a_variant_of \"RAC1\" .\n",
    "snp:rs527330002 ns:refers_to_organism <http://www.uniprot.org/taxonomy/9606> .\n",
    "snp:rs61753123 ns:is_a_variant_of \"RAC1\" .\n",
    "\"\"\"\n",
    "\n",
    "# Initialization of the in-memory RDF graph, RDFlib Graph object\n",
    "kg = Graph()\n",
    "\n",
    "# Parsing of the RDF data\n",
    "kg.parse(data=my_rdf_data, format='turtle')\n",
    "\n",
    "# Printing the size of the graph and serializing it again. \n",
    "print(f'the knowledge graph contains {len(kg)} triples\\n')\n",
    "print(kg.serialize(format=\"turtle\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now execute a simple query to search for all \"variants\" of `RAC1`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "http://my_snps/rs527330002 is a variant of RAC1\n",
      "http://my_snps/rs61753123 is a variant of RAC1\n"
     ]
    }
   ],
   "source": [
    "q = \"\"\"\n",
    "SELECT ?x WHERE {\n",
    "    ?x ns:is_a_variant_of \"RAC1\" .\n",
    "}\n",
    "\"\"\"\n",
    "\n",
    "res = kg.query(q)\n",
    "for row in res:\n",
    "    print(f\"{row['x']} is a variant of RAC1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 3 \n",
    "Generalize this query to show all *is a variant of* relations. You can use two variables `?x` and `?y`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "q = \"\"\"\n",
    "\"\"\"\n",
    "\n",
    "#res = kg.query(q)\n",
    "#for row in res:\n",
    "#    print(f\"{row['x']} is ...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 4\n",
    "Search for the name of the gene who has a variant refering to the `http://www.uniprot.org/taxonomy/9606` organism"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "q = \"\"\"\n",
    "\"\"\"\n",
    "\n",
    "#res = kg.query(q)\n",
    "#for row in res:\n",
    "#    print(row)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}