Linked Data Basics for Techies

From OpenOrg

Revision as of 12:20, 4 April 2012 by WikiSysop (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Contents

Introduction

Intended Audience

This is intended to be a crash course for a techie/programmer who needs to learn the basics ASAP. It is not intended as an introduction for managers or policy makers (I suggest looking at Tim Berners-Lee's TED talks if you want the executive summary).

It's primarily aimed at people who're tasked with creating RDF and don't have time to faff around. It will also be useful to people who want to work with RDF data. RDF is a data structure perfect for people creating mash-ups!

Please Feedback-- especially if something doesn't make sense!!!!

If you are new to RDF/Linked Data then you can help me!

I put a fair bit of effort into writing this, but I am too familar with the field!

If you are learning for the first time and something in this guide isn't explained very well, please drop me a line so I can improve it. cjg@ecs.soton.ac.uk

Warning

Some things in this guide are deliberately over-simplified. It is intended to get you started really fast, rather than cover every facet of the subject.

Alternatives

If you don't like my way of explaining things, then there's other introductions out there;

(suggest more!)

Why Should I Care about RDF?

At first glance it's not very easy to see why we might want yet another data format on top of XML, JSON, CSV etc. RDF is different and is better in some situations.

Structure

Tabular data: (SQL Databases, Excel, CSV, etc.) information is arranged in a strict grid. Adding and removing data is easy, but changing the shape of the table is a much higher cost.

Tree data: (JSON, XML.) Tree-data is easy to get your head around as you can worry about little bits at a time. It can still be tricky to modify the structure and merge data from multiple sources, especially if those sources were not designed with the merger in mind.

Graph data: (RDF). A graph is a list of relationships between things. This can any shape. This can be a bit more work to get your head around when coding, compared to the more limited structures, but ultimately it's more flexible. A set of relationships merged with a set of relationships is just a bigger set of relationships, so merging two RDF documents is trivial.

Merging

RDF uses globally unique identifiers for everything; the things we're talking about, the relationships, datatypes. This means two RDF graphs can be merged and there's no danger of having any confusion, eg. one dataset has a 'pitch' value for the frequency of a sound, and another for the angle of a roof. Doesn't happen in RDF as it's unamibuous. Of course, that makes if far more verbose, but TANSTAAFL.

Because all RDF really is is a list of unambiguous relationships, you can combine two RDF graphs and just get a bigger list of relationships between things. No other formats in common use allow this.

Why don't we use it for everything?

It's just not suited for all cases. It's verbosity can be a point against it some situations, as can the flexibility. More flexible means that there's a higher amount of cognitive load on understanding data in RDF.

SPARQL databases are much less mature than SQL databases. They don't (yet) have all the little niceties we've become accustomed to, but they are improving. I still don't believe they are suited for managing records with access controls etc. This will certainly change in the next few years.

RDF Data

RDF & Triples

RDF is a way of structuring information to be very interoperable and extendable. The most simple unit of information is called a 'triple'. A triple consists of three parts;

  1. the ID of a thing,
  2. the ID of a property that relates that thing to another thing or a value
  3. the ID of the thing it relates to OR a value, like text, a number or date.

For example:

<Person23> <hasDateOfBirth> "1969-05-23" .
<Person23> <name> "Marvin Fenderson" .
<Person23> <memberOf> <Group003> .

The first thing is called the Subject'. The second is sometimes called a Predicate,Property or Relation, the last bit is the Object. If the last bit is a value rather than the ID of a thing it's called a Literal. ID's may represent absolutely anything, but we use web addresses for them. These are called URIs (note that URI and URL are slightly different. It can be confusing at first because http://webscience.org/person/6 refers to a person, not a webpage, but it's a very handy way to ensure that these IDs are globally unique.

One little caveat, a literal can have a datatype (like integer or string, also represented by a URI, but we still call this a "Triple" (yes that's dumb)).

The neat thing about this structure is that you can represent almost any other kind of data using it. It's not great at doing ordered lists of values.

URI vs URL

A URI represents a single concept or thing, but many URIs can represent the same thing.

If you resolve a URI it's considered good practice to return some useful triples about the concept the URI represents, but don't lose sleep over doing that -- it's an optional bonus feature of RDF.

All URLs are URIs. Not all URIs are URLs.

URI: Universal Resource Indicator - this identifies something uniquely.

URL: Universal Resource Location - this not only identifies something, but also describes where it is located.

Example

<http://dbpedia.org/resource/Julius_Caesar> is a URI for Julius Caesar.

<http://en.wikipedia.org/wiki/Julius_Caesar> is a URL (and therefore also a URI) for a web page about Julius Caesar.

There is no URL for Julius Caesar as you can't download him via the web as he's dead and also not a string of ones and zeros!

Two URIs may indicate the same concept, just like two URLs may return exactly the same document, or you can have more than one name to address the same person. <http://www4.wiwiss.fu-berlin.de/gutendata/resource/people/Caesar_Julius_100_BC-44_BC> is another URI for Julius Caesar, created by a different organisation. You can choose if you treat two URIs as referring to the same concept or not. It depends on the problem you're trying to solve. There are no objective truths!

RDF Documents

RDF+XML

There are several ways of writing RDF triples into a file. The most common is called RDF+XML (which people often just called RDF). It usually looks something like this:

 <foaf:Person rdf:about='http://webscience.org/person/7'>
    <foaf:name>Christopher Gutteridge</foaf:name>
 </foaf:Person>

If you want to produce RDF+XML See this Guide. To parse RDF+XML just find and use a library, there's one in most popular langauges!

This wiki uses a simple subset of RDF+XML for examples.

Turtle (and N3)

N3 is quite complicated so some bright person defined a cut-down version called Turtle which is really easy to read and write, but is sadly not as widely supported as RDF+XML.

Turtle looks something like this:

 <http://webscience.org/person/7> a foaf:Person ;
    foaf:name "Christopher Gutteridge" .

Note that this is NOT XML. The angle brackets just go around URIs which are not abbreviated using a prefix. (see later in this guide)

RDFa

RDFa is a way to embed triples into an HTML document. It can be confusing for beginners, but some software tools generate valid RDFa which is fine, but don't try to create it by hand until you get some experience!

To decode RDFa in an HTML page, just put the URL into http://graphite.ecs.soton.ac.uk/browser/

Other Serialisations

There's one for JSON, and N-Triples which just writes out triples, one per line (and is a subset of Turtle).

Namespaces

For URIs it's common to define a bunch of related concepts in the same "namespace". A namespace is bit like a directory on a filesystem; it usually ends with either "/" and "#" and the IDs in the namespace generally don't have "/" or "#" in as that confuses things.

You will probably define your own namespace for your own concepts, such as your organistations members, or the bus stops nearby, but for classes and predicates you'll often use other people's namespaces. in RDF+XML and Turtle it's common to use a namespace prefix to make things more readable. In RDF+XML you must use namespace prefixes for predicates. The following examples mean exactly the same thing:

Example 1 - RDF/XML

(in RDF+XML you have to use a namespace for the predicates to make them legal XML tags)

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/">
  <rdf:Description rdf:about="http://webscience.org/person/7">
     <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person" />
     <foaf:name>Christopher Gutteridge</foaf:name>
  </rdf:Description>
</rdf:RDF>

Example 2: Turtle

Same data as Example 1 (Turtle, but represents the same data as above). Turtle auto defines 'rdf:' so you don't have to (unlike RDF+XML where you always have to define it). An annoying quirk of turtle is that the value to the right of the prefix (eg. the 'bar' in foo:bar) must start with a letter, not a numerical digit, but if in doubt the prefix is a convenience, you can always just put full URIs in angle brackets instead.

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix wsperson: <http://webscience.org/person/> .
wsperson:p7 rdf:type foaf:Person .
wsperson:p7 foaf:name "Christopher Gutteridge" .

Example 3: N-Triples

Same data as Example 1 and 2.

<http://webscience.org/person/7> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<http://webscience.org/person/7> <http://xmlns.com/foaf/0.1/name> "Christopher Gutteridge" .
  • See also the prefix.cc service listed in the "Tools and Services" section of this guide.

Some Common Namespaces

Here's a quick summary of the most common namespaces.

rdf - has the core parts of RDF, but usually you'll only see rdf:type.

rdfs - used for making statements about predicates and classes, also has rdfs:label and rdfs:comment which are good basic ways of giving something a label and making comments about it.

owl - used for making much more complex statements about predicates and classes. This is cool, but don't worry about it too much when you're just getting started. Also defines owl:sameAs to indicate two URIs represent the same thing (in your opinion).

dcterms - Dublin Core terms. Very useful generic properties for making statements about resources -- who created them, when, who published it, title, description etc. An older version is called 'Dublin Core Elements' and this can be confusing. In general always use dcterms. Some people use "dc" as the abbreviation, but this is confusing as it's not obvious if it's dc-terms or dc-elements, so don't do that.

foaf - Friend of a Friend. This is good for describing the facts from a person or organisations 'profile', things like their email address, phone numbers, names, what groups they are a member of etc.

geo - Allows you to specify a latitude & longitude of something. Even of it's a big thing, then you can still give a useful reference point to navigate to.

dbpedia - DBPedia defines a URI for the primary of every page of wikipedia. So http://dbpedia.org/resource/Southampton is a URI representing the city of Southampton.

Also: skos, sioc, void, dcat, geonames, doap, vcard, org, event, prog, bibo, aiiso, vann, scovo

For a list of the namespaces on prefix.cc see; http://prefix.cc/popular/all


Two relationships you must know

Almost every URI you define should have rdf:type and rdfs:label defined. Noe that those are different namespaces. It's very easy to type "rdf:label" by accident.

rdfs:label

Reading URIs can be hard. Use rdfs:label to give a human readable label to a URI. eg.

<http://id.southampton.ac.uk/building/59> rdfs:label "Building 59" .

If you want to be more international-friendly you can add a language code;

<http://id.southampton.ac.uk/building/59> rdfs:label "Building 59"@en . <http://id.southampton.ac.uk/building/59> rdfs:label "Edificio 59"@es .

Actually, this is one of the real benefits of RDF; you can easily provide alternate labels in several langauges. Or someone else can provide tripes labelling a set of existing URIs in a new language.

rdf:type

The full URI for this relationship is http://www.w3.org/1999/02/22-rdf-syntax-ns#type

This relates a URI for a thing to a URI for a set of things. For example;

<http://id.ecs.soton.ac.uk/person/1248> rdf:type foaf:Person . <http://id.southampton.ac.uk/building/59> rdf:type <http://vocab.deri.ie/rooms#Building> .

The rdf:type of something gives you a broad indication of what properties something will have and how to work with it.

The rdf:type of something is usually referred to as "class" not "type". If you've done any set theory you can think of it as s set.

a

Turtle and SPARQL have an abbreviation of rdf:type which is just "a". As you use it so often this saves a far bit of typing, eg.

<http://id.ecs.soton.ac.uk/person/1248> a foaf:Person . <http://id.southampton.ac.uk/building/59> a <http://vocab.deri.ie/rooms#Building> .

Defining your own types

When designing a system I find it useful to try to give everything very broad, common types. This makes it possible for other people's tools to understand your data. If you do want to define more specific classes that's OK, but I suggest you also give each thing a broad class so tools can understand it. eg. If you were making a document about swimming you might have swim:Race and swim:Swimmer as types. I would strongly suggest using event:Event and foaf:Person as additional types.

In theory you could use the semantic definitions to say swim:Race rdfs:subClassOf event:Event . swim:Swimmer rdfs: subClassOf foaf:Person . and it's helpful to do so, but most apps don't bother trying to resolve such semantics, yet.

Don't use types to encode data

It's tempting to use rdf:type to indicate and set the thing belongs in, eg. <http://id.ecs.soton.ac.uk/person/1248> rdf:type foaf:Person . <http://id.ecs.soton.ac.uk/person/1248> rdf:type myns:PeopleWhoDontEatCheese . <http://id.ecs.soton.ac.uk/person/1248> rdf:type myns:PeopleOverSixFoot .

In my experience this is a mistake. It makes it harder for someone else to get to grips with your data. Use rdf:type to indicate what properties something is likely to have. People probably have names, email addresses and parents. Buildings have a postcode, number of floors and occupants. PeopleWhoDontEatCheese are unlikely to have any properties that foaf:Person won't. The above data is better encoded as:

<http://id.ecs.soton.ac.uk/person/1248> rdf:type foaf:Person . <http://id.ecs.soton.ac.uk/person/1248> myns:eatsCheese "false"^^xsd:boolean . <http://id.ecs.soton.ac.uk/person/1248> myns:overSixFoot "true"^^xsd:boolean .

nb. I've used 'eatsCheese' as I find negative properties are always a bit more confusing. You could have had 'lactoseIntolerant' if that's what you meant, of course.

Don't use all the things of a type in a document to infer something

For example you might know that all the Buildings in http://data.southampton.ac.uk/dumps/places/2012-01-09/places.rdf happen to be buildings occupied by the University of Southampton. This is a potential for problems later as later this data may later contain buildings which are not explicitly occupied by the university.

Worse still, if you merged it with data from other sources, there's now no way to tell which buildings belong to the University of Southampton.

It's much better to use explicit than implicit data.

If it's data you are creating, add a triple to every building to *say* it's a building occupied by the University of Southampton. That way there can be no confusion later if it happens to also contain buildings which the university used to host conferences.

Note that the example file given doesn't (yet) follow my own advice, I'm still learning too!

Semantics (Schemas, OWL, Ontologies etc)

The semantic bit of the semantic web is that if you resolve the URI for a class or predicate you often get back some rdfs: and/or owl: describing what it means, and some semantics about it. This lets you do clever reasoning like knowing that 'ancestor' is a transative property so the ancestor of an ancestor of X is therefore also an ancestor of X.

This is complicated, and not required to get started. If it confuses you, don't worry too much about it.

Purists insist you should write schemas, just ignore them until YOU find a need to write a schema (eventually it turns out to be useful, but there's no hurry).

Lists

You can do lists by saying something like,

<Person> <hasToDoList> <List0001> .
<List0001> <label> "Marvin's TODO List" .
<List0001> rdf:type rdf:Seq .
<List0001> rdf:_1 "Buy Milk" .
<List0001> rdf:_2 "Walk Dog" .
<List0001> rdf:_3 "Drink Milk" .

Tools and Services

prefix.cc

You can get a list of the standard abbreviations for common namespaces using http://prefix.cc/

If you're really lazy, you can get the stub of an XML document out of it, for example http://prefix.cc/foaf,dcterms,geo.rdf

t-d-b.org

This is described by http://thing-described-by.org/ and provides a quick and simple way to generate a URI for something by using a web page about that thing.

http://www.williamshatner.com/ - A URI which is a URL for a webpage about William Shatner

http://t-d-b.org/?http://www.williamshatner.com/ - A URI for William Shatner himself (the thing-described-by the URL after the ? mark)

Triple Stores and SPARQL

A triple store is a bit like an SQL database, but optimised to just import, store, and query a huge pile of triples. Triple stores are queried using a language called SPARQL.

They are funky because rather than deal with triples document by document you can query over any facets of the data in the "SPARQL Endpoint". If you have the staff resources to do so, it's good practice to provide a SPARQL endpoint, but don't lose sleep if you don't.

These are useful and powerful but not required to produce and work with RDF and Open Linked Data.

Validation

It's quite easy to write valid RDF serialisations, but you can check them using the W3C Validator Service, what is also useful is to eyeball your data to make sure it looks sane. I use my own RDF Browser but you may prefer others. The Graphite Browser will show triples from any of N3, RDF+XML and RDFa.

Command Line Tools

There's a C-Library and unix/linux command line tool called "rapper" which can parse and validate various formats.

Guide to how to produce datasets

Well, this website is for that!

Other sites you should know about

Semantic Overflow

http://semanticoverflow.com/

Allows you to ask questions about this stuff, and see existing questions and answeres.

CKAN

http://ckan.org/

This site is a comprehensive index of sources of data sets on the web (many, but not all, in RDF). You should consider registering your datasets if you want other people to find them.

GetTheData.org

This site is used by people to ask where to find open data: http://getthedata.org/



Open Linked Data

Tim BL, inventer of the web, says this is cool and you should be doing it. See if you agree...

A nice explanation of "5 star linked data" is available at http://5stardata.info/

Linked Data

What is Linked Data? It's when you either:

  • Use URIs for Subjects and/or Objects from other peoples datasets.
  • Use owl:sameAs to link your identifiers to other peoples.

This lets people do really cool mashups using data from multiple sources. There's over 100 known sites in the world who link to other dataasets. See The Linked Data Cloud to see how they interlink.

It doesn't have to be RDF, but it usually is.

Open Data

Open data is data with an open license which makes it easy for people to reuse it with confidence, and available online, to all, without restrictions.

Open Linked Data

Is the combination of the two, obviously.

How does that relate to RDF

RDF is designed to be returned when resolving a URI, so it's ideal for linked data. It doesn't have to be Open, any more than XML does.

Closed Linked Data

It can be potentially useful to use the linked data techniques on confidential data. You can use it to pull information from many databases in a company and perform queries for "business intelligence" tasks.

If can also be useful to link private data to public data. eg. a student lecture timetable is not a public document, but each lecture is associated with a module, a course, a lecturer and a room. All of which may well have open data available about them. An app consuming the private timetable can augment it with links to this open data.

Personal tools