SPARQL

SPARQL is a query language for queries over relational data. For example, such data we created last time using the ontology. The syntax of the query language is quite simple, at the same time, queries over relational data can be very powerful.

In these basic examples today, we will first use SPARQL playground, later we will move to more interesting data from wikidata. The online version of SPARQL playground does not seem to work now, but you can also download it and run it locally (on Win/Linux/MacOS, you just need Java, it did not work with JDK16 on Win for me, but JDK11 on Linux in WSL seems to work fine).

The simplest queries in SPARQL are SELECT queries. The basic form is of the query is a number of triples, each in form <object> <relation> <subject>. For example, in the SPARQL playground, we have a number of things, some of them have the class dbo:Person, we can query all persons with the query

SELECT ?person WHERE {
    ?person rdf:type dbo:Person
}

The identifiers starting with ? are variables in the query. Variables can be used as any part of the triple and can be used multiple times in the same query to specify the value must be the same.

Both dbo and rdf are prefixes that specify the ontology (a URL to an ontology) that defines the specific relation. We do not need to specify the prefixes in the playground, as they are specified for us, but the full query can contain them. We can for example put the following before the select keyword to define the rdf prefix.

PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

We can also create more complex queries in SPARQL, for example, we can follow more than one relation. We can select only persons that are female, if we check the schema picture in the playground, we can see that some objects there have a property tto:sex, which can be either “male” or “female”. In order to select only persons that are females, we can write the following query:

SELECT ?person WHERE {
    ?person rdf:type dbo:Person .
    ?person tto:sex "female"
}

The query basically specifies that we are interested in persons (first line) that are also females (second line). Notice the . that delimits the lines.

We can also select multiple things at the same time. In the next query, we will select persons and their pets. The relation tto:pet specifies that a person has a pet. The query is then:

SELECT ?person ?pet WHERE {
    ?person tto:pet ?pet
}

What if we wanted to select all persons and if they have a pet select also the pet? We can do it using the optional keyword on some of the lines:

SELECT ?person ?pet WHERE {
    ?person rdf:type dbo:Person .
    optional {?person tto:pet ?pet}
}

We can also use a more general filter expression to select only some results. For example, if we are interested only in persons who do not have any pet, we can write the query

SELECT ?person WHERE {
    ?person rdf:type dbo:Person .
    filter not exists {?person tto:pet ?pet}
}

The filter is more general, we can actually write a number of different conditions inside the filter expression, apart from the non exists. Generally, we can write any expression using the other variables we defined and constants.

The middle part of the triple (the relation between objects) can actually also be more complex, for example, we can use / to write chained relations. If we wanted to find, who is Eve’s grandfather, we can write either

SELECT ?grandparent WHERE {
    ttr:Eve dbo:parent ?parent .
    ?parent dbo:parent ?grandparent
}

or we can write the same with a more complex relation as:

SELECT ?grandparent WHERE {
    ttr:Eve dbo:parent/dbo:parent ?grandparent
}

There are actually even more complex ways how to combine properties, we can for example select all things that are a subclass (rdfs:subclass) of tto:Creature with

select ?subclass where {
    ?subclass rdfs:subClassOf tto:Creature
}

and if we want even indirect subclasses (subclasses of subclasses) we can use the + notation (it means at least one repetition of the relation)

select ?subclass where {
    ?subclass rdfs:subClassOf+ tto:Creature
}

Apart from + there is also a * that means any number of repetitions.

We can also use ^ to invert the relation – ?a ^rel ?b is equivalent to ?b rel ?a. This is extremely useful together with the possibility to specify a disjunction of two relations with |. We can specify all people that are either parents or offspring of William (ttr:William) with

SELECT ?relative WHERE {
    ttr:William (dbo:parent | ^dbo:parent) ?relative
}

SPARQL also allows for other things, like limiting the number of results using the LIMIT keyword, or ordering them using the ORDER BY keyword. It is also possible to use GROUP BY and aggregation function as in SQL.

For example, if we want to count number of persons by their sex, we can write the following query:

select ?sex (COUNT(?people) as ?peopleCount) where {
  ?people rdf:type dbo:Person .
  ?people tto:sex ?sex .
}
GROUP BY ?sex

Apart from COUNT there is a number of other aggregate functions that can be used in queries.

If you are more interested in SPARQL, I recommend checking the other examples in the playground.

Let us now try a more interesting knowledge base - the Wikidata. Wikidata has a query interface at query.wikidata.org. You can also see a number of different examples there in the examples tab. But let us for example try to find all the rivers in Czech republic together with the area of their basin. What is important to know about Wikidata is that finding the names of the relations can be quite hard, however, they are all in the namespace (prefix) wdt:, object, on the other hand have the prefix wd:. Wikidata then has auto-complete for the relation names (actually, codes) that you can use. It is also quite helpful to click on some of the results and to check the relation the result is in manually.

So, for our query (all rivers in the Czech republic), we need to find what is the name for the river class, and how we can check, is something is a river. If we start writing the query, we can notice, once we start writing wdt: that there is a specific property wdt:P31 that means “instance of”, then we can start writing wd: and find that there is a class for river (wd:Q4022). Our query thus is

SELECT ?river ?riverLabel WHERE
{
  ?river wdt:P31 wd:Q4022
         
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 10

We used the LIMIT to limit the number of results to 10 (so that we can check the relations we can find about rivers in the results). We also used a SERVICE that gives as human readable names for the objects we queried. For example, if we ask for a ?river, the service also defines ?riverLabel that contains the human readable label. The name of the label variable is by default always the same as the name of the original variable only with “label” appended.This name can be changed in the service configuration by appending ?river rdfs:label ?otherRiverLabel after the . in the service description. Such a definition can also be used if we want to obtain a label for something that is not directly returned.

Once we click on one of the results, we can see that the rivers have the “country” property that we can use to find the countries the river flows in. If we click on the property, we can see it is property P17, so we can use it with wdt:P17, and we can continue with our query (I actually prefer using the autocomplete to find the codes of the properties, but the full listing is useful to see what is actually available).

SELECT DISTINCT ?river ?riverLabel ?area WHERE
{
  ?river wdt:P31 wd:Q4022 .
  ?river wdt:P17 wd:Q213 .
  ?river wdt:P2053 ?area
         
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?area)

There are many other queries you can try in wikidata, I also recommend checking the examples that are available on that page.

Wikidata is not the only large knowledge base available through SPARQL, another one is for example dbpedia.org, you can query it at dbpedia.org/sparql.

Assignment

Today’s assignment consists of two independent parts. One part asks you to design an ontology using Protégé, the other asks to write some SPARQL queries and find information in wikidata.

Points: 5 points for each part
Deadline: 26 May 2024

Ontology design

Design an ontology usable for representation of a library catalogue. In the ontology, you want to have classes for books and authors. Books should have properties like title, author, genre etc.. Authors should have properties like date of birth, country, etc. Consider also classes and hierarchy at least for the countries.

After you design this basic ontology, define the relations between books and authors and authors and countries and also define at least some defined classes (for example - books by European authors, books written before 1900 etc.).

I specifically do not define the ontology precisely, the goal is also not to define a large number of classes - a few examples in each category is enough.

Queries over Wikidata

Write SPARQL queries over wikidata that return the following information:

  1. Actors that received the Oscar award (Academy Award) sorted by the total number of (any) awards they received in decreasing order together with the list of all their awards.
  2. Find all the rivers that flow into the Vltava river and all rivers that flow to them etc. ordered by their basin area.
  3. Write one non-trivial wikidata query about anything that is interesting for you. The complexity of this query should be roughly comparable to the queries above.