Friday 26 October 2012

Doing RSS right (2) - including content

In addition to the issues I described in 'Doing RSS right', there's another problem with RSS feeds, though at least this one doesn't apply to Atom.

The problem is that there's nothing in RSS to say if the various blocks of text are allowed to contain markup, and if so which. Apparently (see here):
"Userland's RSS reader—generally considered as the reference implementation—did not originally filter out HTML markup from feeds. As a result, publishers began placing HTML markup into the titles and descriptions of items in their RSS feeds. This behavior has become expected of readers, to the point of becoming a de facto standard"
This isn't just difficult, it's unresolvable. If you find

<strong>Boo!</strong>

in feed data you simply can't know if the author intended it as an example of HTML markup, in which case you should escape the brackets before including them in your page, or as 'Boo!', in which case you probably expected to include the data as it stands.

And if you are expected to include the data as it stands you have the added problem that including HTML authored by third parties in your pages is dangerous. If they get their HTML wrong they could wreck the layout of you page (think missing close tag) and, worse, they could inject JavaScript into your pages or open you up to cross site scripting attacks by others. As I wrote here and here, if you let other people add any content to your pages then you are essentially giving them editing rights to the entire page, and perhaps the entire site.

However, given how things are and unless you know from agreements or documentation that a feed will only ever contain text then you are going to have to assume that the content includes HTML. Stripping out all the tags would be fairly easy, but probably isn't going to be useful because it will turn the text into nonsense - think of a post that includes a list.

The only safe way to deal with this is to parse the content and then only allow that subset of HTML tags and/or attributes that you believe to be safe. Don't fall for the trap of trying to filter out only what you consider to be dangerous because that's almost impossible to get right, and don't let all attributes through because they can be dangerous too - consider <a href="javascript:...">.

What should you let through? Well, that's hard to say. Most of the in-line elements, like <b>, <strong>, <a> (carefully), etc. will probably be needed. Also at least some block level stuff - <p>, <div>, <ul>, <ol>, etc. And note that you will have to think carefully about the character encoding both of the RSS feed and the page you are substituting it into, otherwise you might not realise that +ADw-script+AD4- could be dangerous (hint: take a look at UTF7)

If at all possible I'd try to avoid doing this yourself and use a reputable library for the purpose. Selecting such a library is left as an exercise for the reader.

See also Doing RSS right (3) - character encoding.

Doing RSS right - retrieving content

Feeds, usually RSS but sometimes Atom or other formats, are a convenient way of including syndicated content into web pages - indeed the last 'S' of 'RSS' stands for 'syndication' in one of the two possible ways of expanding the acronym.

The obvious way to include the content of a feed in a dynamically-generated web page (such as the 'News' box on the University's current home page) is to include in the code that generates the page something that retrieves the page's feed data, parses it, and then marks it up and includes it in the page.

But this obvious approach comes with some drawbacks. Firstly the process of retrieving and parsing the feed may be slow and may be resource intensive. Doing this on every page load may slow down page rendering and will increase the load on the web server doing the work - it's easy to forget that multiple page renderings can easily run in parallel if several people look at the same page at about the same time.

Secondly, fetching the feed on every page load could also throw an excessive load on the server providing the feed - this is at least impolite and could trigger some sort of throttling or blacklisting behaviour.

And thirdly there's the problem of what happens if the source of the feed becomes unreachable? Unless it's very carefully written the retrieval code will probably hang, waiting for the feed to arrive, probably preventing the entire page from rendering and giving the impression that you site is down, or at least very slow. And even if the fetching code can quickly detect that the feed really isn't going to be available (and doing that is harder than it sounds), what do you then display in your news box (or equivalent)?

A better solution is to separate out the fetching part of the process from the page rendering part. Get a background process (a cron job, say, or a long ruining background thread) to periodically fetch the feed and cache it somewhere local, say in a file, in a database, or in memory for real speed. While it's doing this it it might as well check the feed for validity and only replace the cached copy if it passes. This process can use standard HTTP mechanisms to check for changes in the feed and so only transfer it when actually needed - it's likely to need to remember the feeds last modification timestamp from every fetch to make this work.

That way, once you've retrieved it once you'll always have something to display even if the feed becomes unavailable or the content you retrieve is corrupt. It would be a good idea to alert someone if this situation persists, otherwise the failure might go un-noticed, but don't do so immediately or on every failure since it seems common for some feeds to be at least temporally unavailable. Since the fetching job is parsing the feed it could store the parsed result in some easily digestible format to further reduce the cost of rendering the content into the relevant pages.

Of course this, like most caching strategies, has the drawback that there will now be a delay between the feed updating and the change appearing on your pages - in some circumstances the originators of feeds seem very keen that any changes are visible immediately. In practice, as long as they know what's going on they seem happy to accept a short delay. There's also the danger that you will be fetching (or at least checking) a feed that no longer used or very rarely viewed. Automatically keeping statistics on how often a particular feed is actually included in page would allow you to tune the fetching process (automatically or manually) to do the right thing.

If you can't do this, perhaps because you are stuck with a content management system that insists on doing things it's way, then one option might be to arrange to fetch all feeds via a local caching proxy. That way the network connections being made for each page view will be local and should succeed. Suitable configuration of the cache should let you avoid hitting the origin server too often, and you may even be able to get it to continue to serve stale content if the origin server becomes unavailable for a period of time.

See also Doing RSS right (2) - including content and Doing RSS right (3) - character encodings.