Copyright © 2007 Creative Commons Corp.. Licensed under a Creative Commons Attribution 3.0 License except where otherwise noted.
This document is a request to the community of producers and consumers of RDF to follow certain practices around the use and resolution of URIs. The advice is formulated with the goal of promoting meaningful exchange and recombination of RDF artifacts and to help protect the meanings of these artifacts against the ravages of time.
The intended audience of this note is workers in the areas of life science research and health care who are building semantic web resources and tools, but it is hoped that it will be useful to others as well.
This is an editor's draft with no official standing.
The title attached to this draft is provisional.
This is a draft of a document written in response to needs expressed by the W3C Semantic Web Health Care and Life Sciences Interest Group (HCLS). It is intended for publication as an Interest Group Note on w3.org. Before publication there, it will be made to conform to standard W3C document and policies.
Recent changes:
10/31 (35) change 'URI owner' to 'naming authority'
10/30 (33) more explanation of why you want resolution rules
10/30 (33) further de-emphasize resolution rules in
main text
10/27 (31) replace figure series with single omnigraffle figure
10/27 (31) rework resolution appendix
10/27 (28) rework section on documents
10/25 (26) put all resolution stuff in appendix
10/25 (25) whimsical title (was "Note on Choosing and Using URIs")
10/25 (25) "term" changed to "name"
10/25 (25) fixes to terminology around "resolve" and "dereference"
10/25 (25) "nose-follow" changed to "meta-dereference"
URI Note home:
http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Tasks/URI_Best_Practices/Recommendations
How to comment on this draft:
Please put your comments on the
DraftTalk
wiki page, or if
commenting on a "major issue," join the fray at one of the pages
reserved for this purpose - see the list
here. I will attempt to address all
concerns and record dissenting views fairly.
This document is a request to the community of producers and consumers of RDF (see note: {what is RDF}) to follow certain practices around the use and resolution of URIs. The advice is formulated with the goal of promoting meaningful exchange and recombination of RDF, and the proposed solutions are meant to be protect investments made in composing RDF by enabling a sustainable Semantic Web infrastructure.
[JW: give a non-technical intro here. how science is done, how semantic web/RDF can help.]
The focus is the use of URIs, as names of particular things or as otherwise used technically. Names that denote individuals, relationships, and classes may be used sensibly in statements. This is in contrast with the conventional use of many URIs as specifying communication endpoints. Sometimes a URI is used both ways - it dereferences to what it denotes, presumably a document-like thing - in which case there is a pun. [consider reworking last sentence - not quite right]
Failure to handle URIs wisely leads to errors, inconsistencies, and lost opportunities. The specific sources of these problems include:
Most of the advice given here may be followed without the addition of new technical infrastructure. However, obtaining adequate generality and durability requires that publishers provide resolution information and that applications understand and make use of it. One approach to doing this is described in an appendix.
The document starts by describing an approach to specifying the intended usage of names that builds on current practice. The next two sections give a sort of "protocol" for finding usage specs for names and for establish usage specs for new names. Finally a treatment of the case of web documents is given. To make the outline of the argument easier to follow, details on a number of topics are relegated to endnotes.
The organizing principle proposed here is that for each name in use, there is a document [was: an RDF graph] designated as specifying correct usage of the name, somewhere.
A usage spec for a name is simply a graph that is designated as one that specifies when the name should and shouldn't be used. The usage spec contains descriptive statements that use the name to refer to the name's intended referent. The description is given in prose and/or RDF assertions. If someone uses the name, their use should be consistent with what its usage spec says.
Example:
specimen:S05-100_A_1_2.3
a dicom:Specimen ;
dicom:patient patient:65536 ;
dicom:machine dicom:AVUTRIX_MULTIPLE_B7792 ;
dicom:date_collected "2007-08-07"^^xsd:date .
could serve as a usage spec for the name specimen:S05-100_A_1_2.3 :
we would be saying that the name should be used to denote the
intended specimen.
Prose description and RDF description are intermixed by placing the prose in a literal string related via rdfs:comment.
Underspecification - usage spec that is not very specific - is to be discouraged, as it may easily lead to confusion.
A graph that merely uses a name to describe the name's referent is not necessarily a usage spec. Whether a graph is a usage spec depends on whether the naming authority has said that it is. [but see note: {neo-specifications}]
A usage spec will itself use a number of other names, and a full specification of the name would in principle require an understanding of those names.
Because the name's usage spec is the arbiter of what the name means and how to use it, the problem of finding usage specs, which are distributed around the network, is quite important. Stability of a usage spec is also important: changing a usage spec is a recipe for confusion, as different users of a name may rely on different versions of the usage spec without being aware that the change has occurred.
In order to enable RDF graphs that are meaningful and useful to both humans and machines, and that can be meaningfully combined with other, independently developed, graphs, the names that the graph uses must be meet certain quality benchmarks. We define a name to be sustainable when it obeys the following principles:
A new name should be established for a new meaning. Establishing a term consists of 'minting' it (deciding what its spelling should be), composing a usage spec, and publishing the usage spec. The overall requirement is to establish a name that satisfies the sustainability principles (above [section?]).
Should a new name be needed, establishing a new one requires these steps:
In the below, by "the URI" I mean "the URI that is the spelling of the name". [?? is this too labored ?? "spelling" is awkward ??]
[examples subclass, union, restriction -- not all agents can deduce via these -- but we should insist] [special note on sameAs -- extremely strong -- note: {equivalence discussion}]
http: or https: URIs. See note: {minting nonlocators}.
Doing so is not to be taken to imply that the name denotes a document.
http://example.org/mbl-lillie-building refers to the Lillie Building
at MBL.
rdf:type for
the name's referent using appropriate RDF statements.
(E.g. owl:thing is not informative.)
Loosely speaking, the "web of documents" is grandfathered into the semantic web, as described here, by considering web documents to be named by their URIs. If a document is obtained when interpreting a name, then by convention the name is taken to refer to the document. The document is not a usage spec for the term: a usage spec is part of the discourse being conducted in RDF, while a _document is merely one more thing that one might talk about. In contrast to usage specs, by mentioning a document there is no expectation that its contents are supposed to be believed.
While establishment of usage specs for document-denoting names would often be helpful - one could state type, title, authorship, revision status and so on ("metadata") - this is difficult, at least using HTTP, and one might be forgiven for not providing one. See appendix A, below, for a hack for simultaneously publishing data and metadata using HTTP.
In order to be considered true, a statement involving the name must apply not just to what has been observed at the time the statement was written, but also to what will be observed when someone else is trying to understand or use the statement. Server responses vary over variation in request details such as requested language or , as well as time.
Excluding time variation, any particular server response should be taken to say that the name denotes what the response communicates, not any particular byte sequence or sequences. That is, statements about documents (or at least those not varying in time) talk about what the document says irrespective of representation details.
To articulate the allowable inferences about documents that can be made based on server response, it is proposed here to classify documents according to consistency criteria:
If one of these types is given as the domain and/or range of properties, then not only can one state the assumptions under which statements are made, but having communicated those assumptions, further inference is enabled from inspection of the document. For example, if we know that x has a length, and that only fixed documents have a length, then we can infer that x is a fixed document and therefore that it also has a checksum, which can be computed by reading the octets of x.
Advice around documents:
http: or https: URI that has
no fragment id (#); defined technically and syntactically for
the purposes of this documentAssume that anything that can go wrong, will.
We've suggested various ways to find usage specs (Appendix A). If a _namingauthority has published more than one usage spec over time or in different places, or if someone has taken it upon themselves to change a usage spec and pass it off as correct (perhaps using a resolution rule), then there is a possibility of disagreement between usage specs.
Usage specs that differ in inconsequential ways - that is, that neither
broaden nor narrow the applicability of the name - are not in conflict.
For example, a newer usage spec may give more examples or explanation than
another, or provide statements (such as rdfs:seeAlso statements) that would
not affect the correct use of the name.
There is no formulaic way to solve true conflicts. Other things being equal, priority should go to the first usage spec for the name published by whomever was the naming authority at the time, or to a usage spec compatible with it; other published usage specs may be inconsistent with published use and should be examined with skepticism.
However, there may be rare circumstances in which it is preferable to use a revised usage spec - for example, a usage spec may be internally consistent in a nonobvious but easily fixed way. On the other hand, if there is inconsistency, a usage spec found via a resolution rule included in an author's RDF may be more likely to reflect the author's intent than one that was not so cited. Ultimately it is up to the community of users of the name to determine how to solve conflicts.
When you establish a name (or after), some of what you say is meant to be constraining on all uses, while some of what you say is incidental: either advisory, hypothetical, or unimportant. A discovery that incidental information was incorrect would not force a retraction of a usage spec.
Where do we put RDF statements about a name's referent that are not supposed to be constraining on the use of the name? We have no syntactic marker in RDF that can separate set of statements from another.
One solution is to grandfather existing ontologies by saying that this separation is an informal process or is simply not specified by this note; look elsewhere for guidance.
Another approach is to take all statements as constraining. The non-constraining statements should be placed in a separated document and a relation placed in the usage spec relating the usage spec to the secondary description via a predicate such as rdfs:seeAlso. [This is roughly the answer given here. See issue DefinitionDelineation.]
A way to help protect against accidental collisions over time
(publication of an inconsistent usage spec or other document by a
future site administrator or naming authority)
is to have the path component of the URI contain "site version"
information in the form YYYY, YYYY-MM, or YYYY-MM-DD (example:
http://www.w3.org/2001/XMLSchema).
Future administrators following this convention will either use no date or
will put a different date in the URIs they mint.
See [cite RFC 4151 tag: URI] for further information on this convention.
[this practice must be detailed somewhere, but where?]
[no reference in text]
Tools that care about accessing things (endpoints, usage specs, etc.) should understand use of resolution rules, so that they can properly implement relocation and redundant sourcing.
In particular, there is often occasion to present names in a web browser
or other user-facing interface. When arranging this, be prepared to
link to a usage spec or other appropriate document using
a browser-friendly URI, e.g. by routing through a proxy. The name's
spelling may be an inadequate locator for many browsers
(e.g. urn:lsid:, info:) or it may not lead to the correct usage spec. Observe
resolution rules that will help generate a locator that can be used for
hyperlinking. [details - presentation is not same as usage spec]
[Compare ARKs, handle proxies, etc.]
[TBD. Not linked from text yet. When to use/not use sameAs,
equivalentClass, etc. Use in constraining/nonconstraining situations.
When one of these constitutes a correct usage spec. The idea of
hypothetical sameAses as a way to modulate precision and recall. blah.]
Note: {how to get persistence}
By persistence we mean the ability for a name to resolve to its referent (if a document) and meta-resolve to a usage spec over the potential lifetime of the name. This could be anywhere from seconds to decades, although it is the latter that we usually have in mind.
Persistence has two aspects:
Because persistence implies possibly outliving any individual or organization involved in establishing the name, and perhaps even their interest in keeping it resolvable, persistence requires long-term institutional commitment to identifiers and accessibility.
The importance of persistence hinges on your attitude toward
mechanisms such as resolution rules. If you believe
that peer-to-peer resolution rules (or any similar mechanism) will
be understood, then a persistence service becomes less important.
If you believe that consumers you care about either will not
understand resolution rules or will not have adequate rules,
then a persistence service is more important than it would be otherwise.
[See AttitudeTowardMigration.]
Note: {minting nonlocators}
Locators have the advantage of nonlocators in that they are more likely to lead to documents. Clients that do not understand a nonlocator natively, and that either do not understand resolution rules or have resolution rules that lead to usage spec or other document, may still be able to access the document if the URI is a locator. (Of course this is of no help if the link is broken.)
Rather than mint a non-locator URI, you can use a proxy service prefix to create a locator from the non-locator URI. Arrange, somehow, for everyone performing this transformation to use the same proxy prefix. State an equivalence (e.g. owl:sameAs) in case anyone uses the bare non-locator URI in RDF - or as a way of specifying what you mean by the proxy-relative form.
For a concrete case study see TDWG Life Sciences Identifiers Applicability Statement.
[For summary of this issue see AttitudeTowardNonlocators; also see AttitudeTowardMigration.]
For purposes of this document the naming authority of a name is defined
to be the entity that has the "right" to
say how the name ought to be used. For locators, the naming
authority is the entity who is allowed to determine HTTP server
behavior at the designated location: 200 OK (names a document,
generally speaking), 301 (Moved Permanently), 303 (See Other), or some
other response.
Naming authority coincides with the concept of
"URI owner" as specified in section 2.2.2.1 of
Architecture of the WWW (which should be consulted in conjunction with other
URI schemes), but "URI owner" has bred some confusion around exactly what
rights are conferred and how permanent those rights are.
Note: {NCName pragmatics}
We encourage NCName suffixes, or at least SPARQL-liberalized-NCName suffixes, for all names. This helps make Turtle and SPARQL queries more concise. [explain] [explain bug in the RDF/XML spec, SPARQL's extension, etc.]
In the unlikely event a name is in wide use but its usage spec is unpublished, lost, or only ephemerally published - for example, if it is known only from use - and the naming authority cannot or will not establish a new usage spec, an expert might compose and publish a graph that they believe to correspond to community practice, and attempt to get the community to accept the graph as specification to be followed. This neo-usage spec has no naming authority, but may be of use to the community. The neo-usage spec might be publicized using a resolution rule.
[hypothetical situation, should I flush this note? important illustration of how community process should trump priority in extreme circumstances & how there are no rigorous rules governing this process. Better approach: mint a new term, then assert that the new one and old one are equivalent as names. ]
TBD: A versioning story: database records, databases, ontologies, usage specs. Why this is critical:
Look at continuant/occurrent theory, DAV, etc.
For the purposes of this document, "RDF" means either Turtle or an established RDF standard.
Do RDFa documents qualify as RDF documents? I.e. should we recommend using them as usage specs? Problems: (1) they don't have their own MIME types, so can't be recognized or requested, and (2) they don't work with # URIs.
The location for finding a usage spec is a problem because the HTTP protocol has no native way to provide it. Often the usage spec (or similar document) is made available via simple dereference, and while this may be OK for access by humans, it leaves open the question of whether what you get when you get an OK is a usage spec or the denotation (the document) and makes reliable processing by machine difficult.
Two solutions have emerged for use with HTTP, and we recommend their use. In both cases one obtains a second URI that is we call here the usage-spec-name for the term; the usage-spec-name may then be resolved to the usage spec.
Location: header in a See Other HTTP response.
Although these conventions are not in universal or exclusive use,
they are of value when you know that one of the conventions is
in use, or when the agent is forgiving enough to tolerate situations
where the putative usage-spec-name doesn't, or isn't known to, lead to a usage spec.
Note: {why new terminology?}
Here are some excuses for not reusing certain terms from RFC 2616 or Architecture of the WWW.
thing (instead of "resource") - I really mean anything, not just the resources considered in RFC 2616. Alternative: "entity" (this is being argued on the www-tag list)
locator (instead of "URL") - "URL" is defined quite broadly in various RFCs; I mean to restrict it to the least common denominator among deployed web agents
usage spec - I'm still searching for a term for definition-like things that I feel comfortable with. I have used "definition," "defining description," "defining document," "declaration," "declaration document," "correct use specification" (CUSP), "correct use recommendation", "normative description", "agreement for use", "license to use", "deed", "statement of applicability", "recommendation for use", and many variations. The idea is almost the same as "declaration page" in Booth's article, except that here it is required to be RDF, and in the terms of Architecture of the WWW it is really more of a information-resource-essence than a "page".
[Not on this note, that is - work to be left until after the note the is done.]
A name may be associated with either of two kinds of document:
To resolve a name, a set of applicable asserted resolution rules is found (perhaps via query). Rules are meant to resolve names to their referents or names to their usage specs. Often this is done by replacing the name with another name: either a synonym, or, in the case of meta-resolution, a second name that denotes the first name's usage spec (a "usage-spec-name").
One standard resolution rule expresses the common treatment of # URIs: The URI's racine (the part before the #) is specified to be a _metaterm (`'_term for its usage spec).
The default (when no rule applies) is to attempt to dereference (or
meta-dereference) the URI. This
means using standard protocols
(cf. IANA URI scheme registry) guided by the spelling of the name.
Some URI schemes, such as ftp: and data:,
only specify how to dereference,
while others may give separate methods for dereference and meta-dereference.
An important third case is that of the HTTP protocol, where the distinction has
been overlaid on existing practices. (A protocol designed with the
usage spec/denotation distinction in mind would have simply provided two
different access methods for the two cases. You know who you are.)
With HTTP you can't say ahead of time which document you're
looking for; you have to use the single operation
(GET) provided to retrieve one of the two, and the HTTP response code
lets you check to see whether what you got is what you
wanted [cite httpRange-14]:
(A document denoted by two names can both resolve and meta-resolve: one name dereferences to the document (200) while the other meta-resolves. The synonym relationship can be established using a resolution rule. -- enough to make you want to invent a new protocol that fixes this problem, huh?)
These two strategies failing, a search (manual or automated) might be mounted using a search engine or a plea sent to an individual or community that might know how to resolve the name. As this is likely to be a bit of work, any resolution information that turns up ought to be passed along to anyone receiving communication from you that uses the name.
Summary of resolution tactics:
| Situation | To get usage-spec-name (meta-resolve) |
To get usage spec (meta-resolve) |
To get referent (resolve) | ||
| 1. resolution rules | redirection | redirect rules, then usage-spec-name rules |
get usage-spec-name, then resolve |
redirect rules, then resolve | |
| other tactics | TBD | ||||
| 2. dereference | |||||
| # URI | #-truncate | get _metaterm, then resolve |
N/A | ||
| http, https schemes | GET to 303 | get usage-spec-name, then resolve |
GET to 200 | ||
| other schemes | per protocol | ||||
| 3. cast a wide net | |||||
The purpose of resolution rules is essentially to deal with the "broken link" problem on the client side. It acts as an insurance policy that protects against a situation where a document (including a usage spec) is available, but not by presenting the name to an HTTP client module. This can happen when content moves, when mirrored content is unavailable at its primary location, or when someone decided (against the advice of this document) to mint a non-HTTP URI.
A broken link on the "document web" leads to inconvenience to the human reader during navigation. Broken links are generally repaired quickly because the server operator is usually motivated to make the site content work well for visitors. The operator learns of a broken link either automatically through validation and error reporting, or through complaints lodged by readers.
With the expansion of the use of URIs from navigation to use in meaningful assertions, a broken link becomes a threat to any kind of interpretation of the page, and therefore jeopardizes the value of the document per se. At the same time, the demand for shared meaningful _names will lead to the use of _names whose accessibility is not highly reliable or durable.
The worry is not so much over content loss as over loss of opportunity: the failure to connect a _name in use with information that will make it meaningful. It is essential therefore that uses of an unresolvable name by connected somehow to documents found in secondary locations. This must be done in a way that does not require involvement of the original publisher, who may be defunct or may simply not care.
A related purpose of resolution rules is to allow the use of
RDF written using non-locators by "low-tech" client software that only
understands HTTP. This problem reduces to the first, as we may treat
challenging URIs such as tag: URIs as we would broken links.
Resolutionrules are used simply by providing assertions giving the locations of usage specs and referent documents either specifically (one URI at a time) or generically (by URI string match and replacement). A producer of RDF includes in an RDF document a resolution rule for any names whose usage spec may be difficult for a consumer to find, and a consumer makes use of resolution rules using logic inserted at the point where any name is to be dereferenced.
We seek answers to the following questions:
Answers are written using the relations
Trivial examples:
<http://www.w3.org/TR/rdf-concepts/> tns:isDenotedBy "http://www.w3.org/TR/rdf-concepts/"^^xsd:anyURI . <http://www.w3.org/TR/rdf-schema/> tns:specifiesUsageFor "http://www.w3.org/TR/rdf-schema/type"^^xsd:anyURI .The first just says that the document denoted by the name may be resolved by way of the URI that is the name. The second says that the /type URI usage is specified in the indicated graph; it doesn't say how that graph is to be found, which would require resolution.
Schematic rewrite rules permits the expression of rules that map terms to terms. There are two kinds of rewrite rules:
A _rule is an instance of one of the classes above, with two string-valued properties giving the input pattern and output template for the rule. Deductions about new ways to find denotations and usage specs can be made by instantiating the pattern and template at a particular URI.
For example, the rule
_ a tns:RedirectRule;
tns:hasPattern "http://stale.example.com/{more}";
tns:hasTemplate "http://current.example.com/{more}".
says that any URI matching the pattern
denotes the same thing as the
corresponding URI matching the template (assuming either denotes
something), and permits the inference
<http://stale.example.com/bland.png> tns:isDenotedBy "http://current.example.com/bland.png"^^xsd:anyURI .which justifies the use of HTTP with the 'current' URI to obtain the document (or whatever) named by the 'stale' URI.
As an example of name to usage-spec-name conversion,
_ a MetaTermRule;
tns:hasPattern "{schemepath}#{frag}";
tns:hasTemplate "{schemepath}".
permits inference (assuming the URI denotes anything at all) of
<http://example.com/hashola> tns:specifiesUsageFor "http://example.com/hashola#"^^xsd:anyURI .which justifies the use of
http://example.com/hashola as a name for
http://example.com/hashola#'s _deed. (This _term can then be
further resolved to get the usage spec itself.) [Probably inaccurate
syntactically since the # might be in a query string, etc. - do we
need more powerful matching?]
Figure legend:
owl:sameAs, which has no imputation
that the terms have the
same usage spec. However owl:sameAs can be used for alternative
resolution by denoting the thing in the owl:sameAs assertion
using two different names.
Special thanks to David Booth for help with document organization and technical issues.
The following people commented on drafts: (reverse chronologically) Dan Corwin, David Booth, Alan Bawden, Sankar Virdhagriswaran, Gerald Jay Sussman, Jake Beal, Eric Prud'hommeaux, Bijan Parsia, Mark Tobenken, Chimezie Ogbuji, Kaitlin Thaney. Thank you.