{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Practical introduction to RDF and SPARQL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Objective" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this notebook is to make you comfortable with representing (simple) knowledge graphs in RDF, and then write simple SPARQL queries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reminder: IRIs and Literals\n", "**Resource** refer two complex objects identified by an **IRI** (International Resource Identifier == URI allowing international characters). Note that URLs are IRIs pointing to web accessible documents/data. URIs can be shortened with **PREFIX**. As an example `` can be shortened as `ns:my_term` if `ns` is defined as a prefix for `http://my/super/vocab/`. \n", "\n", "**Literals** refer two simple values (numercial values, strings, boolean, dates)\n", "\n", "## Reminder: RDF, triples\n", "1. an RDF **statement** represents a **relationship** between two resources: a **subject** and an **object**\n", "1. relationships are directional and are called a .red[predicates] (or RDF properties)\n", "1. (logical) statements are called **triple** : {`subject`, `predicate`, `object`}\n", "1. a set of triples form a **directed labelled graph** : subject nodes are IRIs, edges are predicate (IRIs only), object nodes are IRIs or Literals. \n", "\n", "Go through https://www.w3.org/TR/rdf11-primer/ to have more details on RDF. \n", "\n", "## Reminder: Turtle syntax\n", "- header to define prefix\n", " - example: with `@prefix ns: http://my_voc# .`, `http://my_voc#term` can be written as `ns:term` \n", "- generally one line per triple with a `.` at the end: ` .`\n", "- possible shortcuts to share the same subject: `;` \n", "```\n", "s p1 o1 ; \n", " p2 o2 .\n", "```\n", "- possible shortcuts to share the same subject-predicate: `,` \n", "```\n", "s p o1, o2, o3 .\n", "```\n", "\n", "## Example\n", "turtle syntax: \n", "```ruby\n", " rdf:type .\n", " .\n", " rdfs:comment \"Sample 1 from Study X [...]\"^^xsd:string .\n", "```\n", "\n", "or \n", "\n", "```turtle\n", " rdf:type .\n", " ;\n", " rdfs:comment \"Sample 1 from Study X [...]\"^^xsd:string .\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1\n", " \n", "1. Consider the following RDF properties `family:has_mother`, `family:has_father`, `family:has_sister`\n", "2. Represent with RDF triples the following family:\n", " - *The mother of John is Mary*,\n", " - *Mickael is the son of Mark*,\n", " - *Mickael and John are cousins*,\n", " - *Mark is the uncle of John*.\n", "3. Go to https://www.ldf.fi/service/rdf-grapher \n", "4. Generate a graphical representation of the RDF graph." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "my_rdf_data = \"\"\"\n", "# Enter your prefixes and your triples here\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# SPARQL hands-on" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SPARQL is the standards language to query multiple data sources expressed in RDF. The principle consists in defining a **graph pattern** to be matched against an RDF graph." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Definition\n", "**Triple Patterns** (TPs) are like RDF triples except that each of the *subject*, *predicate* and *object* may be a **variable**. Variables are prefixed with a `?` . \n", "\n", "## Example\n", "Triple pattern\n", "```ruby\n", "?x .\n", "```\n", "\n", "RDF graph\n", "```ruby\n", " .\n", " .\n", " .\n", " .\n", "```\n", "\n", "Bindings of variables `?x`\n", "```ruby\n", "?x = \n", "?x = \n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Definition\n", "**Basic Graph Patterns** (BGPs) consist in a set of triple patterns to be matched against an RDF graph.\n", "## Example\n", "Basic graph pattern\n", "```ruby\n", "?x .\n", "?x ?z\n", "```\n", "![:scale 60%](fig/bgp.png)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4 Types of SPARQL queries\n", "- **SELECT** : returns the variables values (i.e. bound variables) for each graph pattern match ;\n", "- **CONSTRUCT** : returns an RDF graph constructed by substituting variables in a set of triple patterns ;\n", "- **ASK** : returns a boolean (true/false) indicating whether a query pattern matches or not ;\n", "- **DESCRIBE** : returns an RDF graph that describes the resources found (resources neighborhood).\n", "\n", "
\n", "
\n", "Additional features: Optional BGPs, union, filters, aggregate functions, negation, service, *etc.*\n", "\n", "# Anatomy of a SPARQL query\n", "\n", "![:scale 95%](fig/anat.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DESCRIBE `" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2\n", "We will now use the RDFlib package to parse RDF Data and do some very basic SPARQL queries. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "the knowledge graph contains 4 triples\n", "\n", "@prefix ns: .\n", "@prefix snp: .\n", "\n", "snp:123 ns:is_a_variant_of \"NEMO\" .\n", "\n", "snp:rs527330002 ns:is_a_variant_of \"RAC1\" ;\n", " ns:refers_to_organism .\n", "\n", "snp:rs61753123 ns:is_a_variant_of \"RAC1\" .\n", "\n", "\n" ] } ], "source": [ "from rdflib import Graph\n", "\n", "# RDF graph, in turtle syntax, stored in a string\n", "my_rdf_data = \"\"\"\n", "@prefix ns: .\n", "@prefix snp: .\n", "\n", "snp:123 ns:is_a_variant_of \"NEMO\" .\n", "snp:rs527330002 ns:is_a_variant_of \"RAC1\" .\n", "snp:rs527330002 ns:refers_to_organism .\n", "snp:rs61753123 ns:is_a_variant_of \"RAC1\" .\n", "\"\"\"\n", "\n", "# Initialization of the in-memory RDF graph, RDFlib Graph object\n", "kg = Graph()\n", "\n", "# Parsing of the RDF data\n", "kg.parse(data=my_rdf_data, format='turtle')\n", "\n", "# Printing the size of the graph and serializing it again. \n", "print(f'the knowledge graph contains {len(kg)} triples\\n')\n", "print(kg.serialize(format=\"turtle\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now execute a simple query to search for all \"variants\" of `RAC1`. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "http://my_snps/rs527330002 is a variant of RAC1\n", "http://my_snps/rs61753123 is a variant of RAC1\n" ] } ], "source": [ "q = \"\"\"\n", "SELECT ?x WHERE {\n", " ?x ns:is_a_variant_of \"RAC1\" .\n", "}\n", "\"\"\"\n", "\n", "res = kg.query(q)\n", "for row in res:\n", " print(f\"{row['x']} is a variant of RAC1\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 3 \n", "Generalize this query to show all *is a variant of* relations. You can use two variables `?x` and `?y`. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "q = \"\"\"\n", "\"\"\"\n", "\n", "#res = kg.query(q)\n", "#for row in res:\n", "# print(f\"{row['x']} is ...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 4\n", "Search for the name of the gene who has a variant refering to the `http://www.uniprot.org/taxonomy/9606` organism" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "q = \"\"\"\n", "\"\"\"\n", "\n", "#res = kg.query(q)\n", "#for row in res:\n", "# print(row)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }