{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Practical introduction to RDF and SPARQL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reminder: IRIs and Literals\n", "\n", "**Literals** refer two simple values (numerical values, strings, boolean, dates, etc.)\n", "\n", "**Resource** refers to complex objects identified by an **IRI** (International Resource Identifier == URI allowing international characters). Note that URLs are IRIs pointing to web accessible documents/data. URIs can be shortened with **PREFIX**. As an example `` can be shortened as `ns:my_term` if `ns` is defined as a prefix for `http://my/super/vocab/`. \n", "\n", "\n", "## Reminder: RDF, triples\n", "1. an RDF **statement** represents a **relationship** between two resources: a **subject** and an **object**\n", "1. relationships are directional and are called a **predicates** (or RDF properties)\n", "1. (logical) statements are called **triple** : {`subject`, `predicate`, `object`}\n", "1. a set of triples form a **directed labelled graph** : subject nodes are IRIs, edges are predicate (IRIs only), object nodes are IRIs or Literals. \n", "\n", "Go through https://www.w3.org/TR/rdf11-primer/ to have more details on RDF. \n", "\n", "## Reminder: Turtle syntax\n", "- header to define prefix\n", " - example: with `@prefix ns: http://my_voc# .`, `http://my_voc#term` can be written as `ns:term` \n", "- generally one line per triple with a `.` at the end: ` .`\n", "- possible shortcuts to share the same subject: `;` \n", "```\n", "s p1 o1 ; \n", " p2 o2 .\n", "```\n", "- possible shortcuts to share the same subject-predicate: `,` \n", "```\n", "s p o1, o2, o3 .\n", "```\n", "\n", "## Example\n", "turtle syntax: \n", "```ruby\n", " rdf:type .\n", " .\n", " rdfs:comment \"Sample 1 from Study X [...]\"^^xsd:string .\n", "```\n", "\n", "or \n", "\n", "```turtle\n", " rdf:type .\n", " ;\n", " rdfs:comment \"Sample 1 from Study X [...]\"^^xsd:string .\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1\n", " \n", "1. Consider the following RDF properties `family:has_mother`, `family:has_father`, `family:has_sister`\n", "2. Only using these predicates, represent with RDF triples the following family:\n", " - *The mother of John is Mary*,\n", " - *Mickael is the son of Mark*,\n", " - *John and Mickael are cousins (because Mark and Mary are siblings)*.\n", "3. Go to https://www.ldf.fi/service/rdf-grapher \n", "4. Generate a graphical representation of the RDF graph." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Answer\n", "\n", "```turtle\n", "@prefix family: .\n", "\n", " family:has_mother .\n", " family:has_father .\n", " family:has_sister .\n", "```\n", "\n", "![:scale 50%](fig/family.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# SPARQL hands-on" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SPARQL is the standards language to query multiple data sources expressed in RDF. The principle consists in defining a **graph pattern** to be matched against an RDF graph." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Definition\n", "**Triple Patterns** (TPs) are like RDF triples except that each of the *subject*, *predicate* and *object* may be a **variable**. Variables are prefixed with a `?` . \n", "\n", "## Example\n", "Triple pattern\n", "```ruby\n", "?x .\n", "```\n", "\n", "RDF graph\n", "```ruby\n", " .\n", " .\n", " .\n", " .\n", "```\n", "\n", "Bindings of variables `?x`\n", "```ruby\n", "?x = \n", "?x = \n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Definition\n", "**Basic Graph Patterns** (BGPs) consist in a set of triple patterns to be matched against an RDF graph.\n", "## Example\n", "Basic graph pattern\n", "```ruby\n", "?x .\n", "?x ?z\n", "```\n", "![:scale 60%](fig/bgp.png)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4 Types of SPARQL queries\n", "- **SELECT** : returns the variables values (i.e. bound variables) for each graph pattern match ;\n", "- **CONSTRUCT** : returns an RDF graph constructed by substituting variables in a set of triple patterns ;\n", "- **ASK** : returns a boolean (true/false) indicating whether a query pattern matches or not ;\n", "- **DESCRIBE** : returns an RDF graph that describes the resources found (resources neighborhood).\n", "\n", "
\n", "
\n", "Additional features: Optional BGPs, union, filters, aggregate functions, negation, service, *etc.*\n", "\n", "# Anatomy of a SPARQL query\n", "\n", "![:scale 95%](fig/anat.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2\n", "We will now use the RDFlib package to parse RDF Data and do some very basic SPARQL queries. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from rdflib import Graph\n", "\n", "# RDF graph, in turtle syntax, stored in a string\n", "my_rdf_data = \"\"\"\n", "@prefix ns: .\n", "@prefix snp: .\n", "\n", "snp:123 ns:is_a_variant_of \"NEMO\" .\n", "snp:rs527330002 ns:is_a_variant_of \"RAC1\" .\n", "snp:rs527330002 ns:refers_to_organism .\n", "snp:rs61753123 ns:is_a_variant_of \"RAC1\" .\n", "\"\"\"\n", "\n", "# Initialization of the in-memory RDF graph, RDFlib Graph object\n", "kg = Graph()\n", "\n", "# Parsing of the RDF data\n", "kg.parse(data=my_rdf_data, format='turtle')\n", "\n", "# Printing the size of the graph and serializing it again. \n", "print(f'the knowledge graph contains {len(kg)} triples\\n')\n", "print(kg.serialize(format=\"turtle\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now execute a simple query to search for all \"variants\" of `RAC1`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "q = \"\"\"\n", "\n", "\"\"\"\n", "\n", "res = kg.query(q)\n", "for row in res:\n", " print(f\"{row['x']} is a variant of RAC1\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# SOLUTION:\n", "!echo \"ClNFTEVDVCA/eCBXSEVSRSB7CiAgICA/eCBuczppc19hX3ZhcmlhbnRfb2YgUkFDMSAuCn0K\" | base64 --decode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 3 \n", "Generalize this query to show all *is a variant of* relations. You can use two variables `?x` and `?y`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "q = \"\"\"\n", "\n", "\"\"\"\n", "\n", "res = kg.query(q)\n", "for row in res:\n", " print(f\"{row['x']} is a variant of {row['y']}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# SOLUTION:\n", "!echo \"ClNFTEVDVCA/eCA/eSBXSEVSRSB7CiAgICA/eCBuczppc19hX3ZhcmlhbnRfb2YgP3kgLgp9Cg==\" | base64 --decode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 4\n", "Search for the name of the gene who has a variant refering to the `http://www.uniprot.org/taxonomy/9606` organism" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "q = \"\"\"\n", "\n", "\"\"\"\n", "\n", "res = kg.query(q)\n", "for row in res:\n", " print(row['y'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# SOLUTION:\n", "!echo \"ClNFTEVDVCA/eSBXSEVSRSB7CiAgICA/eCBuczpyZWZlcnNfdG9fb3JnYW5pc20gPGh0dHA6Ly93d3cudW5pcHJvdC5vcmcvdGF4b25vbXkvOTYwNj4gLgogICAgP3ggbnM6aXNfYV92YXJpYW50X29mID95IC4KfQo=\" | base64 --decode" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 4 }