debuggable

 
Contact Us
 

Update to the RSS feed parser Model

Posted on 6/9/06 by Felix Geisendörfer

A couple of days ago I got contacted by James Archer of Forty Media who pointed out a little issue with the RSS Model I developed a while ago. Even so it works well for parsing blog feeds like the one wordpress is putting out, it had difficulties with podcast feeds or other feeds that use node's with a notation like that:

This was due to the fact that I forgot to consider this node type in xml.

While trying to fix this issue I noticed that some feeds use line breaks inside their tag attributes. For example something like this:

<node attr1="valueC"
     attr2="valueB"

So I ended up remodelling my regex that matches the elements inside the <channel> and <item> elements of an RSS feed. What I came up with looks sort of nightmarish if you aren't pretty familiar with regex, but it has a lot of advantages to the original one. So for those interested in Regex here comes a little comparision:

The old Regex looked like this:
/\<(.+)(.*)\>(.*)\n?\<\/\\1\>/

It had two big weak points:

  • It assumes there are no new lines in the . character classes which is wrong
  • It does not match <node ... />-style nodes

The unoptimized version of the new Regex looks like this:
/\<(.+)( .*)?\>(.*)\<\/\\1\>|\<(.+)( .*)?\/\>/sU'

Now while this verison fixes all the issues from the regex above it still has one problem: it's slow. For a normal sized RSS feed this regex causes the parsing to take ~1 second. Not too bad for one feed, but on my Cake News site I use 10 feeds right now and so this get's me to 10 seconds of feed parsing.

The problem is the /s modifier that turns on line break matching for the dot (.) character. Now in my regex I only need this behavior on attributes and node values, but not for the node names. So I came up with this little optimization:
/\<(.+)( [^\x00]*)?\>([^\x00]*)\<\/\\1\>|\<(.+)( [^\x00]*)?\/\>/U

Here I removed the /s and replaced the dot's where I needed line break matching with the character class [^\x00] which basically means "match any character but 0x00". Since I doubt there is a RSS feed out there containing 0x00 (binary) this should be a save thing to do. It speeds up the parsing by a factor of about 4x which is nice.

Get the new Version

So if you stayed awake while reading through my regex explanation here comes the reward in form of the link to the new version of the RSS model.

-- Felix Geisendörfer aka the_undefined

 
&nsbp;

You can skip to the end and add a comment.

This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.