HTTP vs. HTTPS in resource identification

Angelegt von Hentschke, Jana, zuletzt geändert am 2019-02-01

Sie zeigen eine alte Version dieser Seite an. Zeigen Sie die aktuelle Version an.

Unterschiede anzeigen Seitenhistorie anzeigen

In November 2018 at SWIB18 conference a breakout session "Cool URIs might be insecure if they don't change" took place. Participants wished to find a place to record the results and follow up on the discussion. This wiki page is meant to serve this purpose.

It is hosted by the German DINI-AG Competence Centre Interoperable Metadata (KIM). Further exchange of any parties interested in the topic could be initiated through comments on this page or through the mailing list of KIM's Working Group "Identifiers" (mailing list registration, maling list public archive).

SWIB18 Breakout Session: Cool URIs might be insecure if they don't change

2018-11-27. Bonn, Germany. Friedrich-Ebert-Stiftung.

Participants: Lars G. Svensson (Deutsche Nationalbibliothek) (Initiator), Jana Hentschke (Deutsche Nationalbibliothek) (Initiator), Raphaëlle Lapôtre (Bibliothèque nationale de France), Pascal Christoph (Hochschulbibliothekszentrum NRW (hbz)), Michele Casalini (Casalini Libri), Carsten Klee (Staatsbibliothek zu Berlin - Preußischer Kulturbesitz), Tom Baker (Dublin Core Metadata Initiative), Alexander Jahnke (Niedersächsische Staats- und Universitätsbibliothek Göttingen), Martin Scholz (Friedrich-Alexander-Universität Erlangen-Nürnberg), Joachim Laczny (Staatsbibliothek zu Berlin – Preußischer Kulturbesitz)

Teaser:

Many (most?) providers of linked data publish their resources using http URIs as identifiers. http, however, is a very insecure protocol and there is a movement towards making the web – and thus the Semantic Web – a more secure and trusted place by moving from http to https. For the cosmos of linked RDF data, the seemingly small addition of one character changes a lot: the URIs of a resource. And cool URIs don't change ...

Redirects from http to https URIs seem to solve the problem at first glance. But do they really? As long as an initial request for a resource is directed at its cool http-URI, there is unprotected exchange of data that could be intercepted (or even altered). On the other hand, changing http://example.org/resource123 to https://example.org/resource123 might cause some discomfort at the data consumer side as data stores need to be updated and queries amended.

How are "Cool URIs" to be weighed out against trustworthiness of data providers, privacy of users ... ?

Introductory slides: Should the Semantic Web switch to HTTPS.pptx (by Jana)

Discussion notes (arranged by subtopics by Jana, compiled from memory and collective etherpad notes):

Why can identification and data transfer not be considered two different things?

Statement #1 of the Linked Data Principles says "Use URIs as names for things" and statement #2 says "Use http URIs to that people can look up those names". This stresses that URIs are used for the purpose of identification as well as to initiate data transfer.

Why do 301 redirects not solve the problem?

The problem is that http traffic can be intercepted, so redirects aren't a solution, since the initial http request can be intercepted and e.g. a 301 redirect can be replaced by a redirect to another resource

Why do owl:sameAs relations not solve the problem?

On the data level they do solve the problem - when inferencing is applied (which is the idea about linking anyway - the linked data cosmos is principally able to handle different URIs for the same things).

On the protocol level, however, it doesn't help. When you have an http:xxx sameAs https:xxx you have to exploit the (insecure) http data first, before going to https. Any http transfer makes the data exchange insecure. So owl:sameAs relations between http and https URIs just blow up the data volume and doesn't solve the problem. According to Halpin there is no way in RDF to say that one URI is equivalent to another URI (owl:sameAs links individual/entities but not the identifiers).

Is it really so important for our kind of data to make them interception-prove?

Noone has heard of a case of library data being manipulated in bad faith. However, this argument doesn't work out because we are talking about a matter of principle. Trustworthy data exchange is needed (no matter if libraries should be the first to demand it).

Why are identifiers not independent from protocol?

That currently is the way it is defined in the RDF standard.

Linked Data is about the web. Deploying something like a resolver in between breaks the web approach: you have to educate users to use the resolver instead (which btw has other disadvantages - bad experience with resolvers with DOI because the resolvers behave so differently. BnF experience: It's not sufficient to set an ARK on everything)

Halpin thinks that it would be a logical step in the future for the RDF standard to declare https und http URIs equivalent. Recently there is much discussion on the semweb mailing list regarding features for RDF2 (incl. other topics such as literals as subjects, b-nodes as predicates etc.)

Why not simply change to https? What happens, exactly, if URIs are not cool and change?

The problem is not your own data, but everyone else that uses it. Even with long-term 301 redirects from the http to the succeeding https URIs the clients/other systems still need to look up the data they already have to find out that there was a change and that two URIs refer to the same thing. The problem is when you have a mixture of legacy and new data, e.g. SPARQL queries will be much harder to write. It can be argued that this, today, affects a relatively small group which would need to adapt only this one time.

Why not start assigning https identifiers to new resources while keeping http for the existing ones?

We could assign https to new data and keep http for "old" data. However, that would be awkwardly inconsistent and hard to explain to non-LD people. And still involves http traffic.

How do search engines react?

Websites get downgraded by search engines if they don't serve https. Also, if you have mixed content, browsers will flag that.

How to prevent mixed content warnings?

When http resources are linked from https resources, modern browsers will display "mixed content" warnings. Assuming there will always remain linked data providers that don't use https-URIs, mixed content warnings are likely to occur when data is being linked. This is the chicken or the egg problem. It can be considered another argument for doing nothing: if not everyone does it, the ecosystem will be http-based anyway. (This attitude, however, will not enable improvement ...)

Can the problem with http identifiers be solved on the client side?

Why can't e.g. SPARQL clients secretly update to HTTPS before sending a query - just as modern browsers do.

Another idea: applying a checksum upon data transfer to make sure that data can not be corrupted? To prevent those checksums from being manipulated over http, the checksum would be the same in http and https, so you would be able to see that the checksum isn't the same over http and https.

Relying on technical advancement and client side is also what is recommended in a 2016 W3C Blog post: (summarised by Sandro Hawke) '[...] keep writing “http:” and trust that the infrastructure will quietly switch over to TLS (https) whenever both client and server can handle it. Meanwhile, let’s try to get SemWeb software to be doing TLS+UIR+HSTS and be as secure as modern browsers.'

The question is: why wait? Harry Halpin sees no reason to wait for RDF standard or client side changes because the (semantic) web is young enough to change to https.

What about URIs of RDF Element Sets|Vocabularies

Does it make sense to change them to https as long as rdf, owl, skos, xsl ... don't change? Policy of these is to wait and hope that HSTS and UIR will still become good.
DCMI has had only one request for https so far and therefore currently doesn't see urgent need for action.

Who does (or thinks) what?

Bibliothèque nationale de France has upgraded to https in the data on the web pages, but not in the SPARQL endpoint. Switching the website to https was a decision made by the directorate aimed at the positive image of the institution when maintaining a secure website. And BnF projects are about bringing traffic from search engines (see above).
The unchanged data at the SPARQL endpoint was not so much in focus because the endpoint is not much used and also they are looking at new API on top of that endpoint. Perhaps the http/https issue can be addressed in that layer.

Since there are several active users of the DNB Linked Data Service in the room they are asked: What happens to your system if DNB changed to HTTPS tomorrow? One wouldn't mind at all. Two would need advanced warning, but general attitude "changes are ok as technology advances. Just need to be announced". Users of the Zeitschriftendatenbank Linked Data Service are assumed to have a similar attitude but would need to be asked.

Conclusion (emotion/tendency) of the group

Leans towards switching to https URIs as early as possible with advanced warning to data users.

Materials

W3C Blog post "HTTPS and the Semantic Web/Linked Data", 20 May 2016 (gist: keep writing “http:” and trust in infrastructure and software to advance)
Harry Halpin: "Semantic Insecurity: Security and the Semantic Web", October 2017 (gist: switch now to https it's not too late yet and the web will be more secure immediately)
2017 "The use of https: IRIs on the semantic web" thread on semantic-web@w3.org (about https in vocabulary URI schemes in particular, gist: strong tendency to go for https)
Discussion 2015 in the W3C Web Annotation Working Group (gist: some statements for "don't use https URIs as identifiers" but discussion closed down for other, formal reasons)
W3C Design Issues, Tim Berners-Lee "Web Security - TLS Everywhere, not https: URIs", 2015 (gist: the "s" breaks the web, use HTTPS but don't change the URI prefix)
Discussion page of the Wikidata community 2017-2018 (gist: no change to https URIs. Possibly more due to the fact that the discussion subsided rather than an active decision)

More observations/inquiries results

Wikidata has a 301 redirect from http to https URIs in place but retains http URIs as "Concept URIs", i.e. in RDF data.
schema.org vocabulary retains http in the URIs, has a 301 redirect from http to https URIs and a triple in the RDF data connecting the two with schema:sameAs, e.g. schema:CreativeWork schema:sameAs <https://schema.org/CreativeWork> . (https://schema.org/CreativeWork.ttl, Status 2019-02-01)

Keine Stichwörter

Seitenhierarchie