Scraping Well-Constructed Web Domains (Like Ficlets)

I've studied Les Orchard's "Ficlets enhanced author feed" hack in some detail (the XSL part of it, anyway). Because Ficlets.com has by nature a "linked-list" type of structure (you can write prequels and sequels to stories), the capabilities of XSL for "walking" node lists make it an ideal tool for gathering content (i.e., "scraping") from the site.

The Ficlets Atom Feed

If you look at a Ficlets feed (available in subdirectory feeds/author/AuthorName at ficlets.com -- for example, my rather small feed is at http://ficlets.com/feeds/author/diyincite; author l_m_orchard's more substantial feed is at http://ficlets.com/feeds/author/l_m_orchard), you see the following information structure for each of the author's ficlets:

  • entry
    • title
    • content
    • id
    • published
    • updated
    • author
      • name
      • uri

Extracting information from a properly constructed feed shouldn't be a difficult problem. The feed has a defined format and should be easily processed by a valid XSL transform.

DECAFBAD RSS 2.0 Feed

But when you look at what comes out of the transformation process at DECAFBAD, for example l_m_orchard's enhanced feed, you see something quite different. First, the individual entries are named "item." You see a record of each ficlet with the basic information (title, date, link, and the ficlet itself).

But beyond this, the enhanced feed has detailed information about each ficlet that is not available in the base feed provided by ficlets.com. For example, there are items that are comments that someone left about the ficlet.

If you look more closely, even the <title> elements are different in the DECAFBAD feed: there are labels that identify what kind of item you're looking at (story, comment, prequel, or sequel). If the item is a story, and the story has been rated, the average number of stars the story is rated is displayed in parentheses. If the item is a comment, and the commenter rated the story, the commenter's number of stars rating is displayed in the feed.

Where Is All This Information Coming From?

This is really the key question, and it's the key to why this type of transformation and enhancement of the base Ficlets feed is possible. If you look at the ficlets.xsl file, you'll see this line:

<xsl:variable name="page"         select="document($url)" />

What this line is doing is loading an individual ficlet's page into the XSL variable named "page." For example, when constructing my enhanced feed, the entire web page at http://ficlets.com/stories/202 was read into the variable page.

So? Why's That So Great?

So, the point is: Ficlets.com was designed is a structure that makes this possible, and makes it possible for that document, stored in XSL variable "page," to be further processed using the instructions in ficlets.xsl as though it was an XML document.

Load a ficlets story page into your browser and view the page source. Then look at ficlets.xsl. You'll see lines like:

<xsl:for-each select="$page//html:div[@id='comments']/html:ol/html:li">

Here, the XSL transform is "walking" the story page, selecting all the <li> items in the ordered list <ol> in the <div> element that has an ID of "comments." In other words, the XSL code that follows processes each comment for this particular ficlet.

Well-Structured Web Pages Are APIs

"What's so remarkable about that?" you ask.

What's remarkable is that it's possible! What's remarkable is that this particular Web page can be treated like a reliably structured XML document -- because it is one -- and hence it can be processed using a technology like XSL transformations. Ficlets story pages are a different kind of Web page.

Convinced? Do you see why we're excited about this? Is there a Ficlets API? A Ficlets web service? No.

Yet, the very construction of the pages creates an API, that can be queried and read, "scraped" if you will... But scraped with a difference. For example, the Ficlets.com story page format could be enhanced in many different ways without breaking an XSL-based scraper. New elements could be added to the page and have no effect at all on the API that the page represents from the point of view of the XSL.

If the Web becomes an API that can be accessed and selectively processed using technologies like XSLT, and those APIs are also tagged for meaning (using, for example, microformats), then the semantic web we've heard about and dreamed about for so long begins to come into actual being!

To me, that is indeed exciting.

-- Kevin Farnham
O'Reilly Media

like it!!!

nice structure

convergence towards structure

It's definitely exciting to see a convergence towards structured and meaningful data on the web. I talk about a similar idea in my most recent post, but applied to building standards-friendly HTML pages.

In the long run, all of this means a bigger, better, and richer web experience for all of us. Exciting indeed!