Linked Data Basics for Techies

From OpenOrg
Revision as of 00:06, 17 February 2012 by WikiSysop (talk | contribs) (URI vs URL)
Jump to: navigation, search


Introduction

Intended Audience

This is intended to be a crash course for a techie/programmer who needs to learn the basics ASAP. It is not intended as an introduction for managers or policy makers (I suggest looking at Tim Berners-Lee's TED talks if you want the executive summary).

It's primarily aimed at people who're tasked with creating RDF and don't have time to faff around. It will also be useful to people who want to work with RDF data. RDF is a data structure perfect for people creating mash-ups!

Please Feedback-- especially if something doesn't make sense!!!!

If you are new to RDF/Linked Data then you can help me!

I put a fair bit of effort into writing this, but I am too familar with the field!

If you are learning for the first time and something in this guide isn't explained very well, please drop me a line so I can improve it. cjg@ecs.soton.ac.uk

Warning

Some things in this guide are deliberately over-simplified. It is intended to get you started really fast, rather than cover every facet of the subject.


Why Should I Care?

At first glance it's not very easy to see why we might want yet another data format on top of XML, JSON, CSV etc. RDF is different and is better in some situations.

Structure

Tabular data: SQL Databases, Excel, CSV, etc. information is arranged in a strict grid. Adding data is easy, but changing the grid is a higher cost.

Tree data: JSON, XML. Easy to get your head around as you can worry about little bits at a time. Can still be tricky to modify the structure and merge data from multiple sources.

Graph data: RDF. A graph is a list of relationships between things. This can any shape. This can be a bit more work to get your head around when coding, compared to the more limited structures, but utlimately it's more flexible.

Merging

RDF uses globally unique identifiers for everything; the things we're talking about, the relationships, datatypes. This means two RDF graphs can be merged and there's no danger of having any confusion, eg. one dataset has a 'pitch' value for the frequency of a sound, and another for the angle of a roof. Doesn't happen in RDF as it's unamibuous. Of course, that makes if far more verbose, but TANSTAAFL.

Because all RDF really is is a list of unambiguous relationships, you can combine two RDF graphs and just get a bigger list of relationships between things. No other formats in common use allow this.

Why don't we use it for everything?

It's just not suited for all cases. It's verbosity can be a point against it, as can the flexibility. More flexible means that there's a higher amount of cognitive load on understanding the data.

RDF Data

RDF & Triples

RDF is a way of structuring information to be very interoperable and extendable. The most simple unit of information is called a 'triple'. A triple consists of three parts;

  1. the ID of a thing,
  2. the ID of a property that relates that thing to another thing or a value
  3. the ID of the thing it relates to OR a value, like text, a number or date.

For example:

<Person23> <hasDateOfBirth> "1969-05-23" .
<Person23> <name> "Marvin Fenderson" .
<Person23> <memberOf> <Group003> .

The first thing is called the Subject'. The second is sometimes called a Predicate,Property or Relation, the last bit is the Object. If the last bit is a value rather than the ID of a thing it's called a Literal. ID's may represent absolutely anything, but we use web addresses for them. These are called URIs (note that URI and URL are slightly different. It can be confusing at first because http://webscience.org/person/6 refers to a person, not a webpage, but it's a very handy way to ensure that these IDs are globally unique.

One little caveat, a literal can have a datatype (like integer or string, also represented by a URI, but we still call this a "Triple" (yes that's dumb)).

The neat thing about this structure is that you can represent almost any other kind of data using it. It's not great at doing ordered lists of values.

URI vs URL

A URI represents a single concept or thing, but many URIs can represent the same thing.

If you resolve a URI it's considered good practice to return some useful triples about the concept the URI represents, but don't lose sleep over doing that -- it's an optional bonus feature of RDF.

All URLs are URIs. Not all URIs are URLs.

URI: Universal Resource Indicator - this identifies something uniquely.

URL: Universal Resource Location - this not only identifies something, but also describes where it is located.

Example

<http://dbpedia.org/resource/Julius_Caesar> is a URI for Julius Caesar.

<http://en.wikipedia.org/wiki/Julius_Caesar> is a URL (and therefore also a URI) for a web page about Julius Caesar.

There is no URL for Julius Caesar as you can't download him via the web as he's dead and also not a string of ones and zeros!

Two URIs may indicate the same concept, just like two URLs may return exactly the same document, or you can have more than one name to address the same person. <http://www4.wiwiss.fu-berlin.de/gutendata/resource/people/Caesar_Julius_100_BC-44_BC> is another URI for Julius Caesar, created by a different organisation. You can choose if you treat two URIs as referring to the same concept or not. It depends on the problem you're trying to solve. There are no objective truths!

RDF Documents

RDF+XML

There are several ways of writing RDF triples into a file. The most common is called RDF+XML (which people often just called RDF). It usually looks something like this:

 <foaf:Person rdf:about='http://webscience.org/person/7'>
    <foaf:name>Christopher Gutteridge</foaf:name>
 </foaf:Person>

If you want to produce RDF+XML See this Guide. To parse RDF+XML just find and use a library, there's one in most popular langauges!

This wiki uses a simple subset of RDF+XML for examples.

Turtle (and N3)

N3 is quite complicated so some bright person defined a cut-down version called Turtle which is really easy to read and write, but is sadly not as widely supported as RDF+XML.

Turtle looks something like this:

 <http://webscience.org/person/7> a foaf:Person ;
    foaf:name "Christopher Gutteridge" .

Note that this is NOT XML. The angle brackets just go around URIs which are not abbreviated using a prefix. (see later in this guide)

RDFa

RDFa is a way to embed triples into an HTML document. It can be confusing for beginners, but some software tools generate valid RDFa which is fine, but don't try to create it by hand until you get some experience!

Other Serialisations

There's one for JSON, and N-Triples which just writes out triples, one per line (and is a subset of Turtle).

rdf:type and classes

The most common predicate (property) is 'rdf:type' to relate a thing to a class. For example, relating me to the fact I'm of rdf:type foaf:Person. Things can have any number of types.

The 'object' of rdf:type is often referred to as a class.

Namespaces

For URIs it's common to define a bunch of related concepts in the same "namespace". A namespace is bit like a directory on a filesystem; it usually ends with either "/" and "#" and the IDs in the namespace generally don't have "/" or "#" in as that confuses things.

You will probably define your own namespace for your own concepts, such as your organistations members, or the bus stops nearby, but for classes and predicates you'll often use other people's namespaces. in RDF+XML and Turtle it's common to use a namespace prefix to make things more readable. In RDF+XML you must use namespace prefixes for predicates. The following examples mean exactly the same thing:

Example 1 - RDF/XML

(in RDF+XML you have to use a namespace for the predicates to make them legal XML tags)

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/">
  <rdf:Description rdf:about="http://webscience.org/person/7">
     <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person" />
     <foaf:name>Christopher Gutteridge</foaf:name>
  </rdf:Description>
</rdf:RDF>

Example 2: Turtle

Same data as Example 1 (Turtle, but represents the same data as above). Turtle auto defines 'rdf:' so you don't have to (unlike RDF+XML where you always have to define it). An annoying quirk of turtle is that the value to the right of the prefix (eg. the 'bar' in foo:bar) must start with a letter, not a numerical digit, but if in doubt the prefix is a convenience, you can always just put full URIs in angle brackets instead.

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix wsperson: <http://webscience.org/person/> .
wsperson:p7 rdf:type foaf:Person .
wsperson:p7 foaf:name "Christopher Gutteridge" .

Example 3: N-Triples

Same data as Example 1 and 2.

<http://webscience.org/person/7> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<http://webscience.org/person/7> <http://xmlns.com/foaf/0.1/name> "Christopher Gutteridge" .
  • See also the prefix.cc service listed in the "Tools and Services" section of this guide.

Some Common Namespaces

Here's a quick summary of the most common namespaces.

rdf - has the core parts of RDF, but usually you'll only see rdf:type.

rdfs - used for making statements about predicates and classes, also has rdfs:label and rdfs:comment which are good basic ways of giving something a label and making comments about it.

owl - used for making much more complex statements about predicates and classes. This is cool, but don't worry about it too much when you're just getting started. Also defines owl:sameAs to indicate two URIs represent the same thing (in your opinion).

dcterms - Dublin Core terms. Very useful generic properties for making statements about resources -- who created them, when, who published it, title, description etc. An older version is called 'Dublin Core Elements' and this can be confusing. In general always use dcterms. Some people use "dc" as the abbreviation, but this is confusing as it's not obvious if it's dc-terms or dc-elements, so don't do that.

foaf - Friend of a Friend. This is good for describing the facts from a person or organisations 'profile', things like their email address, phone numbers, names, what groups they are a member of etc.

geo - Allows you to specify a latitude & longitude of something. Even of it's a big thing, then you can still give a useful reference point to navigate to.

dbpedia - DBPedia defines a URI for the primary of every page of wikipedia. So http://dbpedia.org/resource/Southampton is a URI representing the city of Southampton.

Also: skos, sioc, void, dcat, geonames, doap, vcard, org, event, prog, bibo, aiiso, vann, scovo

For a list of the namespaces on prefix.cc see; http://prefix.cc/popular/all

Semantics (Schemas, OWL, Ontologies etc)

The semantic bit of the semantic web is that if you resolve the URI for a class or predicate you often get back some rdfs: and/or owl: describing what it means, and some semantics about it. This lets you do clever reasoning like knowing that 'ancestor' is a transative property so the ancestor of an ancestor of X is therefore also an ancestor of X.

This is complicated, and not required to get started. If it confuses you, don't worry too much about it.

Purists insist you should write schemas, just ignore them until YOU find a need to write a schema (eventually it turns out to be useful, but there's no hurry).

Lists

You can do lists by saying something like,

<Person> <hasToDoList> <List0001> .
<List0001> <label> "Marvin's TODO List" .
<List0001> rdf:type rdf:Seq .
<List0001> rdf:_1 "Buy Milk" .
<List0001> rdf:_2 "Walk Dog" .
<List0001> rdf:_3 "Drink Milk" .

Tools and Services

prefix.cc

You can get a list of the standard abbreviations for common namespaces using http://prefix.cc/

If you're really lazy, you can get the stub of an XML document out of it, for example http://prefix.cc/foaf,dcterms,geo.rdf

Triple Stores and SPARQL

A triple store is a bit like an SQL database, but optimised to just import, store, and query a huge pile of triples. Triple stores are queried using a language called SPARQL.

They are funky because rather than deal with triples document by document you can query over any facets of the data in the "SPARQL Endpoint". If you have the staff resources to do so, it's good practice to provide a SPARQL endpoint, but don't lose sleep if you don't.

These are useful and powerful but not required to produce and work with RDF and Open Linked Data.

Validation

It's quite easy to write valid RDF serialisations, but you can check them using the W3C Validator Service, what is also useful is to eyeball your data to make sure it looks sane. I use my own RDF Browser but you may prefer others. The Graphite Browser will show triples from any of N3, RDF+XML and RDFa.

Command Line Tools

There's a C-Library and unix/linux command line tool called "rapper" which can parse and validate various formats.

Guide to how to produce datasets

Well, this website is for that!

Other sites you should know about

Semantic Overflow

http://semanticoverflow.com/

Allows you to ask questions about this stuff, and see existing questions and answeres.

CKAN

http://ckan.org/

This site is a comprehensive index of sources of data sets on the web (many, but not all, in RDF). You should consider registering your datasets if you want other people to find them.

GetTheData.org

This site is used by people to ask where to find open data: http://getthedata.org/



Open Linked Data

Tim BL, inventer of the web, says this is cool and you should be doing it. See if you agree...

Linked Data

What is Linked Data? It's when you either:

  • Use URIs for Subjects and/or Objects from other peoples datasets.
  • Use owl:sameAs to link your identifiers to other peoples.

This lets people do really cool mashups using data from multiple sources. There's over 100 known sites in the world who link to other dataasets. See The Linked Data Cloud to see how they interlink.

It doesn't have to be RDF, but it usually is.

Open Data

Open data is data with an open license which makes it easy for people to reuse it with confidence, and available online, to all, without restrictions.

Open Linked Data

Is the combination of the two, obviously.