SourceForge.net Logo

pullparser

A simple "pull API" for HTML parsing, after Perl's HTML::TokeParser. Many simple HTML parsing tasks are simpler this way than with the HTMLParser module. pullparser.PullParser is a subclass of HTMLParser.HTMLParser.

Examples:

This program extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the <a>...</a> tags:

import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
for token in p.tags("a"):
    if token.type == "endtag": continue
    url = dict(token.attrs).get("href", "-")
    text = p.get_compressed_text(endat=("endtag", "a"))
    print "%s\t%s" % (url, text)
This program extracts the <title> from the document:
import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
if p.get_tag("title"):
    title = p.get_compressed_text()
    print "Title: %s" % title
Thanks to Gisle Aas, who wrote HTML::TokeParser.

Download

All documentation (including this web page) is included in the distribution.

Development release.

For installation instructions, see the INSTALL file included in the distribution.

FAQs

John J. Lee, May 2004.