CJU.com - Parsing and Using Blogger & Flickr Feeds with Perl

Parsing Flickr and Blogger Feeds with Perl and XML::Parser
By Christopher Uriarte (chrisjur@CJU.com)
Id: parseflickrblogger.cgi,v 1.2 2006/02/23 22:40:35 chrisjur Exp

You may have noticed a few places on this site where there is a table that looks something like this:

New on CJU.com
[Feed updated ever 15 minutes - see details]
Latest from the CJU.com blog:

Latest pics from my Flickr Gallery:

The table above shows the last five entries posted to my blog, as well as the last six photos posted to my photostream on Flickr. If you're reading this, you probably already know what Flickr and Blogger are, so I won't waste our time going into that, but you may not know that the content you post on both of those sites can be syndicated via RSS and Atom-style feeds, both rather easily. If you're not familiar with what RSS is, take a quick read of the Wikipedia article, which addresses both RSS and Atom technologies.

There are many ways that you can read syndicated XML feeds: Web browsers, such as Safari and Firefox have built in support; Third party applications like NetNewsReader provide robust interfaces for reading feeds; Portals such as Google customized front-page allow you to have these at your fingertips every time you fire up your browser and go to Google's homepage. Reading RSS feeds is no problem, but what happens if you want to use the contents of these feeds for something?

The good news is that RSS and Atom feeds are both XML-based and both have standard, well-documented formats, although the technical documents can be a bit tricky when you just want to bang out a quick web app without getting into the nitty-gritty of the protocol spec. this article talks about how you can use Perl to retrieve, parse and utilize these types of feeds.

For simplicity's sake, I'll be focusing on the Atom format, which is easier to parse, IMO and is the only format currently available via Blogger. Flickr, thankfully, provides flexibility to syndicate feeds in just about any XML/RSS/Atom format available today.

Determining the URL of Your Feed
First and foremost, you must figure out what the URL is for the feed that you are trying to grab.

For Blogger feeds, it's pretty simple. The format of your blogger feed is always:

http://yourblog.blogspot.com/atom.xml

For example, my blog URL is http://chrisjur.blogspot.com, so my Blogger Atom feed is http://chrisjur.blogspot.com/atom.xml.

For Flickr feeds, it's a bit more complicated. Flickr allows you to specify feeds for many different attributes of the site, which can include feeds for specific users and tags. Flickr has a specification for how to configure your feeds here. To get you started, however, it's easy to use your photostream feed, which represents the last ten photos that you've uploaded to Flickr. The feed's URL uses the following format for Atom feeds:

http://www.flickr.com/services/feeds/photos_public.gne?id=your_flickr_nsid&format=atom_03/FONT>

where the red colored "your_flickr_nsid" should be substituted with your Flickr NSID, which is a unique number that identifies you on the Flickr site. Note that this is NOT your Flickr username. You can have Flickr automatically generate this URL for you by going to your photostream and clicking on the "Feeds for (your username's) photostream Available as RSS 2.0 and Atom" at the bottom of the page. Copy/Paste this URL and keep it in a safe place. Note that your Flickr NSID will be in the URL after the "id=" token of the query string. My feed URL, for example, looks like this:

http://www.flickr.com/services/feeds/photos_public.gne?id=64426228&format=atom_03

Required Perl Modules
There are a few methods to attack the parsing of the feed with Perl. Many people still parse XML "manually", by writing their own parsers that use a lot of regular expressions and pattern matching. This can be a lot of work and you're always at risk of slight changes in the feed format, which might throw your parsing routines off. There are several Perl modules written for RSS and Atom feeds, such as XML::Atom, but I have found that interfaces to these modules only allow you to pull out certain attributes about feed, which limits what you can do with it. All of the RSS/Atom modules, however, are built on top of Perl's XML::Parser modules, which is a generic, event-based parser based on the Expat C XML parser. For flexibility's sake, I'll be using XML::Parser to do all of my feed parsing. No parsing libraries, however, provide methods to retrieve these feeds via HTTP, so you'll have to do that yourself. The easiest way is to use the libwwwperl package, which provides you with the "LWP::" module namespace. Therefore, to get started, you'll have to march down to your local CPAN and pick up XML::Parser and LWP. To validate if you have these installed, issue these two commands:

perl -MLWP::Simple -e "print;"
perl -MXML::Parser -e "print;"

They should both return nothing. If you see something like this, however, you don't have the module installed:

Can't locate Foobar.pm in @INC (@INC contains: /System/Library/Perl/5.8.6/darwin-thread-multi-2level /System/Library/Perl/5.8.6 
/Library/Perl/5.8.6/darwin-thread-multi-2level /Library/Perl/5.8.6 
/Library/Perl /Network/Library/Perl/5.8.6/darwin-thread-multi-2level
/Network/Library/Perl/5.8.6 /Network/Library/Perl /System/Library/Perl/Extras/5.8.6/darwin-thread-multi-2level
/System/Library/Perl/Extras/5.8.6 /Library/Perl/5.8.1 .).
BEGIN failed--compilation aborted.

Parsing the Feeds
We'll start with the Flickr Feed. You may just want to jump to the "Summary" part of this article if you're not interested in how the code words, but just want to get the code and the details on the supporting files. The code we use to parse the Flickr feed is as follows:

1:      #!/usr/bin/perl
2:
3:      #(c) 2005 Christopher Uriarte
4:      # rssflckr.pl - parses a flckr atom feed to show the most recent MAXENTRIES, links and date added
5:
6:      use LWP::Simple;
7:      use XML::Parser;
8:
9:      ####
10:     #Static Values
11:     ####
12:     $MAXENTRIES = 5;
13:     $url = 'http://www.flickr.com/services/feeds/photos_public.gne?id=64426228@N00&format=atom_03';
14:     #Tag Tracker
15:     my $thistag;
16:     #Track if we're in an <entry> block
17:     my $entryflag;
18:     #count the number of entries
19:     my $count = 0;
20:
21:     ####
22:     #Retrieve XML Date
23:     ####
24:     my $data;
25:     $SIG{ALRM} = \&timed_out;
26:     eval    {
27:             alarm(30);
28:             #get the base URL
29:             $data = get($url) || die "Can't retrieve feed at $url";
30:             alarm(0);
31:             };
32:
33:     sub timed_out   {
34:             my $time; 
35:             $time = localtime();
36:             die "$time: Timed out while accessing feed at $url\n";
37:     }
38:
39:
40:
41:
42:     my $parser = new XML::Parser(ErrorContext => 2);
43:     $parser->setHandlers(Start => \&start_handler,
44:                           End   => \&end_handler,
45:                           Char  => \&char_handler);
46:       
47:     $parser->parse($data);
48:
49:     #We start here when we encounter an HTML Tag
50:     sub start_handler {
51:             my $expat = $_[0];
52:             my $element = $_[1];
53:             #print "Encountered element $element\n";
54:
55:             #If we enter an <entry> tag, we have a new element
56:             #Increase the county by 1 and set the entry flag to 1
57:             if ($element eq "entry")
58:                     {
59:                     $count++;
60:                     $entryflag = 1;
61:                     #print "Count is now $count and entryflag is set to $entryflag.\n";
62:                     }
63:
64:             if ($element eq "title" && $entryflag == 1)
65:                     {
66:                     $thistag = "title";
67:                     }
68:
69:
70:             #Grab the title and href element of the second "link" tag
71:             #exclude service.edit links, we want the "alternate" tag link
72:             #print "Encountering Element=$element,entryflag=$entryflag,dollar_1,2,3=$_[1],$_[2],$_[3] \n";
73:             if ( ($element eq "link") and ($entryflag == 1) and ($_[3] eq "alternate") )
74:                     {
75:                     $ENTRIES{$count}->{link} = $_[7];
76:                     #print "Link: $_[7]\n";
77:                     }
78:
79:
80:             if ($element eq "issued" && $entryflag == 1)
81:                     {
82:                     $thistag = "issued";
83:                     #print "Added: $_";
84:                     }
85:     }
86:
87:
88:     #This is where we handle the values within the tag
89:     sub char_handler {
90:             my ($p, $data) = @_;
91:
92:             #print the modified date
93:             if ($thistag eq "issued" && $entryflag == 1)
94:                     {
95:                     #Get the first 11 Chars of the date, that's all we care about
96:                     $date = substr($data,0,10);
97:                     #print "$date\n";
98:                     $ENTRIES{$count}->{date} = $date;
99:                     $thistag = "";
100:                    }
101:
102:            if ( ($thistag eq "title") and ($entryflag == 1) )
103:                    {
104:                    $ENTRIES{$count}->{title} = $data;
105:                    $thistag = "";
106:                    }
107:
108:            1;
109:
110:    }
111:
112:
113:    sub end_handler {
114:            my $expat = shift; 
115:            my $element = shift;
116:
117:            #If we're at the end of an <entry> block, clear the entry flag
118:            if ($element eq "entry")
119:                    {
120:                    $entryflag = 0;
121:                    #print "\n\n";
122:                    }
123:
124:            1;
125:    }
126:
127:
128:
129:    #Determine how many entries to display
130:
131:    #If our set maximum amt of entries is less than what we encountered
132:    #we only display up to $MAXENTRIES 
133:    if ($MAXENTRIES < $count)
134:            {
135:            $MAX = $MAXENTRIES;
136:            }
137:    #Otherwise, we display what we encountered
138:    else
139:            {
140:            $MAX = $count;
141:            }
142:
143:    #Map through the %ENTRIES hash from 1 to $MAX  to display the links
144:    for ($c=1; $c<=$MAX; $c++)
145:            {
146:            #print "Loop is $c, count is $count\n";
147:            $title = $ENTRIES{$c}->{title};
148:            $link = $ENTRIES{$c}->{link};
149:            $date = $ENTRIES{$c}->{date};
150:            print "-<A HREF=\"$link\">$title (added: $date)
\n";
151:            }

Here's a walkthrough of some of the code:

Lines 13-14: Configurable Values
These are the only two configurable values in the script. The $MAXENTRIES variable indicates the maximum number of entries you want to print out after parsing the feed. If your feed contains 100 entries, you may only wish to print out, say 5. The $url variable specifies the URL to your feed.
Lines 24-47: Parser Setup and Timeouts
This block initiates the XML::Parse object, retrieves your XML feed via HTTP and sets a timeout of 30 seconds on the HTTP retrieval. If the retrieval fails, the script exits.
Lines 50-85: The XML::Parser Start Handler
This block is the tag start handler for XML::Parser, which is the sub-routine executed when a new XML tag is encountered. For the Atom feeds, we're really only interested in the tags contained within and XML <ENTRY> tags. If we find a new entry, we add a new element to the %ENTRIES hash array on line 75, which uses a global counter as the key. We make this hash multi-dimensional by setting $ENTRIES{$count}->{link} to the URL of the photo, which is the 7th element of a <LINK> tag for that entry, e.g.
<link rel="alternate" type="text/html" href="http://www.flickr.com/photos/chrisjur/103480915/"/>
If we've encountered the title of issued date tag, we set a variable ($this_tag) that indicates what we've encountered and wait until the next sub-routine for further processing of these tags.
Lines 88-110: The XML::Parser Char Handler
This block is the tag char handler for XML::Parser, which is the sub-routine executed when we are examining the data contained between a start and end XML tag. As noted earlier, we've set flags for when we've encountered the issued date and and title tags. When we encounter the contents of each of these, we add them to the %ENTRIES hash array using the same key. The issued and title tags are added as $ENTRIES{$count}->{date} and $ENTRIES{$count}->{date}, respectively.
Lines 113-125: The XML::Parser End Handler
This block is the tag end handler for XML::Parser, which is the sub-routine executed the close of an XML tag is encountered. In this sub-routine, we mainly just do some cleanup of state variables. If we've hit the end of an ENTRY tag, we set the state variable accordingly.
Lines 131-151: Printing the Contents of your Feed
In this block we determine the number of feed elements to print out, based o the number of elements encountered in the feed and what you've previous set $MAXENTRIES to in line 12. The format of the printing is done in line 150, where each entry is printed out in HTML format. This line can be modified accordingly to fit your requirements. All output is made to STDOUT.

The strategy for parsing the blogger feed is similar, which you can explore by examining the code itself (see SUMMARY section below).

Using the Output of the Feeds
Now that you've parsed the feeds and have the output, you need to incorporate them into your webpage, email or whatever your delivery mechanism is for this information. As I noted earlier, the script above outputs to STDOUT, so you can easily "catch" the output into a file by using simple re-direction, e.g.:

perl rssflickr.pl > flckrfeed.txt

You can then incorporate the contents of this text file into your website by simply reading the contents of the file. In order to keep the feed up-to-date, however, you will need to run this script automatically, which can be done through a standard UNIX cron job. I have separate cronjob entries for both my Flickr and Blogger feeds, which run every hour, e.g.:

0 * * * * perl ~chrisjur/www/cgi-bin/rssblogspot.pl > ~chrisjur/www/cgi-bin/blogfeed.txt 
0 * * * * perl ~chrisjur/www/cgi-bin/rssflckr.pl > ~chrisjur/www/cgi-bin/flckrfeed.txt

Since I have several pages that call the display the same feed information, I like to keep the interface to these feed files consistent. I do this by using a simple 'include' perl route, which returns the contents of the feed based on a feed "keyword" that is passed to it. This script is the "roadmap" to all my feed files and how I access them. A simple example is this type of file follows:

1:      #!/usr/bin/perl
2:
3:      #dumpfeed.pl - returns HTML code pulled from CJU.com RSS/Atom feed grabbers.
4:      # use:
5:      #
6:      # require dumpfeed.pl
7:      # $blog = dumpfeed('blog);
8:      # print $blog;
9:
10:     sub dumpfeed {
11:
12:             # Configuration Hash uses key => value, where key is a keywork passed
13:             # to the routine, specifying what feed you want and value is the file
14:             # containing the feed contents
15:             my %feeds = (
16:                     'blog' => 'blogfeed.txt',
17:                     'flckr' => 'flckrfeed.txt'
18:                     );
19:
20:             my $requestfeed = shift();
21:             my $feedfile = $feeds{$requestfeed};
22:
23:             # Open the feed file
24:             my $text;
25:             my $failed;
26:             open (F, "$feedfile") || ($failed = 1);
27:                     while (<F>)
28:                     {
29:                             $text = $text . $_;
30:                     }
31:                     close F;
32:
33:             if ($failed == 1)
34:                     {
35:                     $text = "Error:  Could not open feed with token $requestfeed using source $feedfile.";
36:                     }
37:
38:             # Send it back.
39:             return $text;
40:     }
41:
42:     1;

You can call the subroutine as many times as you want from within your .cgi script or whatever is driving the display of your feed content. These 3 lines assign the contents of the blog feed to the $blog variable, which can be printed out at any point in your .cgi:

1:      require dumpfeed.pl
2:      $blog = dumpfeed('blog');
3:      print $blog;

Summary Information and Files

Required Perl Modules:

XML::Parser - available at http://search.cpan.org/~msergeant/XML-Parser-2.34/Parser.pm
LWP - libwwwperl - available at http://search.cpan.org/~gaas/libwww-perl-5.805/lib/LWP.pm

Files from Examples Above:

rssflickr.pl - Parses Flickr Atom feeds and prints summary output to STDOUT. Modify the $url variable to specify your Flickr feed URL. Modify the $MAXENTRIES variable to specify how many entries you want to print.
rssblogspot.pl - Parses Blogger Atom feeds and prints summary output to STDOUT.Modify the $url variable to specify your Blogger feed URL. Modify the $MAXENTRIES variable to specify how many entries you want to print.
dumpfeed.pl - include subroutine used to access various local feeds that you wish to incorporate into your output.