A simple "pull API" for HTML parsing, after Perl's
HTML::TokeParser
. Many simple HTML parsing tasks are
simpler this way than with the HTMLParser
module.
pullparser.PullParser
is a subclass of
HTMLParser.HTMLParser
.
Examples:
This program extracts all links from a document. It will print one line for
each link, containing the URL and the textual description between the
<a>...</a>
tags:
import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) for token in p.tags("a"): if token.type == "endtag": continue url = dict(token.attrs).get("href", "-") text = p.get_compressed_text(endat=("endtag", "a")) print "%s\t%s" % (url, text)This program extracts the
<title>
from the document:
import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) if p.get_tag("title"): title = p.get_compressed_text() print "Title: %s" % titleThanks to Gisle Aas, who wrote
HTML::TokeParser
.
All documentation (including this web page) is included in the distribution.
Development release.
For installation instructions, see the INSTALL file included in the distribution.
2.2 or above.
The Perl Artistic license (included in distribution). This may change to BSD (more liberal) at some point.
John J. Lee, May 2004.