[This file contains a description of the file loading module. See hacking.txt or hacking.html for an overview of the manual.]
Every open input resource has a descriptor associated with it. The descriptor is created when opening the with init_load().
The struct for this descriptor is declared in load.h, and contains:
init_load() takes two arguments: a string containing the target URL to be loaded, and a "struct Url" containing a base URL. (Normally the URL of the file visible in the pager up to now.) Besides of creating the descriptor it also allocs the "buf[]" and prepares the resource for reading data.
Calling this function with only a base URL and no target means that a page is to be reloaded from history.
For all other target URLs, first the effective URL is determined. This is done by merge_urls(), which first splits the URL up into its components using split_url(), and then merges it with the base URL components. A pointer to the resulting merged URL ist stored in "res->url".
If merging the URLs failed for some reason, the resource type is set to "RES_FAIL", and init_load() returns normally. Thus, there is no special error handling necessary in the calling function; all necessary handling is done in load().
If the URL is "-", meaning load from stdin, "res->url->proto.type" is set to PT_INTERNAL (meaning no relative links are allowed, and the page is not to be kept in history).
No URL merging is done if a page from history is reloaded. The split URL passed as the base is already the destination URL.
Now having the destination URL, the resource is opened appropriately, depending on what resource type is set in the URL.
If it is a HTTP URL, by default http_init_load() is used to open a connection to the server and read the HTTP headers. "res->type" is set to HTTP.
Alternatively, "cfg.wget" may be set (using --no-builtin-http), in which case "wget" is used instead of the builtin HTTP code. This is done by init_wget(), which starts wget with the correct filename, and initiates reading its standard output through a pipe. The popen() function returns our pipe end as a stream; we use this as the input. "res->type" is set to "RES_PIPE".
FTP URLs are always fetched using wget.
Local files are simply opened as the input stream, and "res->type" is set to "RES_FILE".
Internal URLs presently always mean loading from stdin. (Error pages also use "PT_INTERNAL", but init_load() returns immediataly after setting it, so this needn't be handled in the switch.) The standard input is reopend as a stream, and this one is used as the input. As we need some way to read user commands from the terminal, "stderr" is reopend in place of "stdin". (Some programs reopen "/dev/tty" instead -- no idea which one is better...) This way the normal "stdin" descriptor points to a terminal again, while the input file is read from the pipe that is (hopefully) connected to the original standard input. "res->type" is set to "RES_STDIN".
Things get a bit more complicated if there was no protocol specification in the URL, and no base was supplied (meaning the URL has to be treated as an absolute URL in any case). In this case we have to guess whether it is a local file or an HTTP URL.
First we try to open a local file with the path returned by merge_urls(). (Should be identical to the given URL.) If this succeeds, we set the protocol to "PT_FILE" and the resource type to "RES_FILE", and that's it.
If opening the local file fails, and the URL string doesn't start with '/', the URL is assumed to be an HTTP URL. As these URLs are split in another way, we have to prepend the "http://" to the URL string and call merge_urls() again. Afterwards, we proceed just as with any other HTTP URL.
If opening a file fails, init_load() returns immediately, only setting "res->type" to "RES_FAIL" and "res->url->proto" to "PT_INTERNAL". "RES_FAIL" is then handled appropriately in load(); this way, the caller needn't bother about it. (Except an additional error message and setting an error code in parse_syntax().) "PT_INTERNAL" means that the page isn't to be kept in history, that following relative URLs isn't allowed etc. Thus, the link/page history handling functions also do not need special handling for the error pages.
HTTP loading errors inside http_init_load() are handled the same way.
Data is read by calling load() with the resource descriptor returned by init_load().
Every call to load() reads one data block (of size BUF_SIZE) into a buffer. The reading function (parse_syntax()or parse_header()) then processes the data, keeping track of the current read position inside the buffer by "res->buf_ptr"; when it reaches the end of the data block, load() is called again to read the next block.
If "res->type" is "RES_HTTP", read() is used to read a data block from the socket; otherwise, fread() is used to read data from the input stream. (It doesn't matter if this stream is a normal file (RES_FILE), stdin (RES_STDIN), or a pipe (RES_PIPE).)
"RES_FAIL" means that opening the input resource failed for some reason, or an error emerged in a previous load() call. In this case, no data is read; an empty buffer is returned, which normally would mean EOF. This causes parse_syntax() to generate an empty page, or stops parsing at this point if same data already has been read before the error occured. For the latter case, some (little) additional handling is necessary in parse_syntax(), to ensure that an appropriate error message is printed and an error code returned to main(). (Causing a keywait before starting the pager.)
The data block is stored in the "buf[]" referenced by the descriptor, and "buf_end" is set to the end of the data inside the buffer; "buf_ptr" is set to the beginning of the buffer.
uninit_load() is used to close the input stream, and free the memory used to read the file.
If the input was a pipe created by popen() (RES_PIPE), it needs to be closed with pclose() instead of fclose(). This function also passes the exit code from wget, which is necessary to decide whether the load via wget was sucessfull.
After closing the stream, the input buffer and the "res" struct are freed.
This file contains a couple of functions for handling of URLs, which are used chiefly by the file loading module.
To allow operating on the URLs and loading the addressed files, the URL string given by the user or a link needs to be split up into components. split_url() parses the URL string, and returns the components by a pointer to a newly allocated "struct Url".
The Url struct is used for all following processing steps. It contains the following data:
The parser is very similar to the HTML parser in parse_syntax.c: The URL string is processed char by char in a loop. In every iteration, one char is examined, and action is taken (in a switch statement) depending on what character it is, and in what mode the parser currently is.
There is a parsing mode for every URL component. (There are also some additional ones for constructs like the "://" after the protocol specification.) Every time we encounter some special character seperating different components, the mode is switched to the one fitting the next component, and its beginning position (normaly the char after the one causing the mode change) is stored in "word_start". Everything between the previous "word_start" and the current char is stored to the respective split URL field of the component parsed up to now, using store_component().
url: "http://domain:80/dir1/dir2/name.ext?params#fragment" ^ ^url_char word_start parse_mode: PM_HOST
url: "http://domain:80/dir1/dir2/name.ext?params#fragment" url_char^^ word_start parse_mode: PM_PORT components->host: "domain"
If the separating char is not the one that would introduce the component normally following now, that means that component is missing, and we immediately have to proceed with the next one.
url: "http://domain/dir1/dir2/name.ext#fragment" url_char^^ word_start parse_mode: PM_PARAMS
This is done by setting "recycle", meaning that the current character is to be parsed again. In this new iteration the parser will see the separating char again, thus introducing a second mode change. store_component() will store a NULL string for that component, and the parser will go on with parsing the next one.
url: "http://domain/dir1/dir2/name.ext#fragment" url_char^^ word_start parse_mode: PM_FRAG components->params: NULL
The first mode change is a bit more tricky: At the beginning of the URL, we do not know if it is a full qualified one (starting with a protocol specification), or a relative URL without a protocol. At first we assume that it starts with the protocol. If a ":" follows the first word, our guess was right, and we proceed normally with the host.
url: "http://domain:80/dir1/dir2/name.ext?params#fragment" url_char^^ word_start parse_mode: PM_PROTO_END1 components->proto.str: "http"
If any other separating char occurs instead, we have to skip protocol, host, and port, and switch immediately to path parsing. We also do a "recycle" then, as the current char needs to be parsed in "PM_PATH" mode.
url: "dir1/dir2/name.ext?params#fragment" ^url_char ^word_start parse_mode: PM_PATH
Path parsing is also a bit more complicated, as the directory name and the file name are stored separately. For that purpose, while parsing the path we keep track where the last '/' was (in "name_start"); everything before it (inclusive) belongs to the directory, and what follows it is the file name.
url: "http://domain:80/dir1/dir2/name.ext?params#fragment" ^ ^ ^url_char word_start name_start |<- dir ->| |<name>| parse_mode: PM_PARA components->dir: "/dir1/dir2/" components->name: "name.ext"
"full_url" and "path" aren't filled in split_url(), as this URL won't be used directly; they are only necessary for the final URL created in merge_urls(). The "absolute" and "local" flags are also set only in merge_urls().
If an error occurs during URL parsing (either the protocol specification contains an unknown protocol type, or an unexpected character is encountered), split_url() sets "proto.type" to "PT_INTERNAL" and immediately returns. As "PT_INTERNAL" normally can't ever be generated in split_url(), the caller knows it was an error by this. Misusing "PT_INTERNAL" for this is surely a bit confusing; it has a big advantage, though: The created (empty) pages are correctly handled as temporary by all the page history functions without needing any exception handling.
When loading a relative URL using the ":e" command, or when following a link, the absolute URL of the target needs to be determined by combining the URL of the current page with the given relative URL; and for both relative and absolute URLs, fields not given in the URL(s) need to be set to default values. All that is done by merge_urls().
This function takes a base URL supplied as a "Url" struct (already split up), and a main url given as string and split up from within merge_urls(); it returns a newly allocated "Url" struct pointer. Base being NULL means to treat the main URL as an absolute one.
But for a few exeptions, merging is done component by component. If a component is present in the main URL, it is taken from there; if it's not, either it is taken from "base_url", or a default value is used if no "base_url" is given. After the first component was taken from "main_url", "base_url" is no longer used; all following components have to be specified, or default values are taken. This is achieved by setting "base_url" to NULL.
The "port" component has no test on its own -- it is always taken from where the "host" is taken.
The handling of "dir" is a bit more complicated: If "main_url" contains a relative "dir" (not starting with '/'), and a "base_url" is given, the new "dir" has to be created by joining both. (Concatenating the one in "main_url" to the one in "base_url".) There is also an exeption about the default value if nothing was supplied: For local (or unknown) URLs there is no default dir (the current directory is used), while HTTP and other use the root dir ('/') as default.
After merging the URLs, the "full_url" and "path" components have to be set. The "full_url" is created by concatenating all components (and separators). (Concatenating is done by the str_append() function created for that. Maybe we should try to use asprintf() or something instead?) "path" is simply a pointer to the starting of the "dir" (and/or following) component inside the "full_url" string.
The "absolute" flag is determined after merging the protocol specification. If "base_url" is NULL at this point, we know we have an absolute URL: Either it was NULL from the beginning (meaning absolute in any case), or it was reset when merging the protocol, because "main_url" contains a protocol specification -- also meaning an absolute URL.
Setting the "local" flag is very similar: If, after all components except the fragment identifier have been merged, "base_url" is still not NULL, we know that "main_url" didn't contain any components up to now; it *can* only consist of a fragment identifier, meaning it references a local anchor.
If an error occured during URL splitting, merge_urls() only prints an additional error message, and proceeds normally. The result is that the "PT_INTERNAL" indicating the error is stored in the merged URL, and can be handled by the caller, just as for split_url() itself.
This function destroys a split URL structure by freeing the memory used by all the component strings, and afterwards the struct itself.
The functions for handling HTTP resources are a bit more complicated, and reside in a source file on their own.
http_init_load() is called from init_load(). It prepares the handle, and opens the HTTP connection. This includes creating a socket, connecting to the desired server, creating and sending a HTTP request for the desired page, and reading/parsing the HTTP header of the file returned by the server.
Errors in HTTP loading are handled by setting "RES_FAIL" and "PT_INTERNAL", just like file loading errors are handled in init_load().