Sphinx-II User Guide

CMU Sphinx Group

Original by Ravishankar Mosur (Ravi) (rkm@cs.cmu.edu)
Maintained by Ravi and Kevin A. Lenzo (lenzo@cs.cmu.edu)

School of Computer Science
Carnegie Mellon University
Copyright (c) 1997-2005 Carnegie Mellon University.


This document is not complete, but should be helpful during construction.
Last updated 2005-01-26.






Introduction

Sphinx2 is a decoding engine for the Sphinx-II speech recognition system developed at Carnegie Mellon University. It can be used to build small, medium or large vocabulary applications. Its main features are:

Sphinx2 consists of a set of libraries that include core speech recognition functions as well as auxiliary ones such as low-level audio capture. The libraries are written in C and have been compiled on several Unix platforms (Linux, DEC Alpha, Sun Sparc, HPs) and Pentium/PentiumPro PCs running WindowsXP, WindowsNT or Windows95. A number of demo applications based on this recognition engine are also provided.

Several features specifically intended for developing real applications have been included in Sphinx2. For example, many aspects of the decoder can be reconfigured at run time. New language models can be loaded or switched dynamically. Similarly, new words and pronunciations can be added. The audio input data can be automatically logged to files for any future analysis.

The rest of this document is structured as follows:







Obtaining Sphinx2 Software

The Sphinx2 software is available on SourceForge.
Note: This section is under construction.





Models for Running Sphinx2 Applications

In order to run Sphinx2 applications, several model files or databases are needed. They fall into three broad categories: the pronunciation lexicon or dictionary, the acoustic model, and the language model. Applications should configure the decoder with the appropriate pronunciation, acoustic and language models through command-line style arguments that are described later in this document.

The following is a very brief description of each model. (For those unfamiliar with these topics or with Sphinx, a somewhat longer description, and the associated Sphinx terminology, is in the html documentation for the Sphinx3 decoder, available from CMU Sphinx at SourceForge.)

Pronunciation Lexicon

The decoder must be initialized with one pronunciation lexicon or dictionary that defines all the words of interest to the application and the phonemic pronunciation for each word.

The dictionary can be modified at run time. New words and their pronunciations can be added to the lexicon in between utterances. (An utterance is the unit of decoding; see Section The Recognition Engine below.) There is no mechanism for removing a word from the lexicon.

In addition to ordinary words, a set of noise or filler words can be specified by the application, by placing them in a corresponding dictionary (indicated by the -ndictfn argument). Sphinx2 also automatically added the silence word SIL, with the silence phone as its pronunciation, to the set of filler words. A filler dictionary is not required, but the silence word is always added. The significance of filler words is that they can occur anywhere in the utterance, transparent to the language model (see below).

Finally, Sphinx2 also adds the distinguished begin-sentence and end-sentence symbols, <s> and </s>, to the vocabulary. These words must be present in any N-gram language model (see below).

Acoustic Model

The Sphinx2 decoder can use either semi-continuous or continuous density acoustic models generated by the Sphinx acoustic model trainer. The decoder first checks the -mdeffn argument for a a Sphinx3-format continuous density model definition file. If present, continuous model decoding is assumed. Otherwise, the decoder looks for semi-continuous acoustic models.

The decoder has to be initialized with one acoustic model. It is not possible to load multiple models and switch between them dynamically at run time.

Note: The current implementation does not include any speed optimizations to the continuous density model evaluation functions. Hence, applications using continuous density models may be restricted to small vocabulary or small model configurations if they are to run in real time.

Semi-continuous Models

A Sphinx2 format semi-continuous acoustic model is a set of several files:

Note: the Sphinx acoustic model trainer actually generates Sphinx3 format acoustic models, which consist of four files: means, variances, mixture weights and transition matrices. They have to be converted to Sphinx2 format using utilities provided with the trainer.

The semi-continuous model files generated by the trainer have model parameter values in 32-bit format. Sphinx2 memory requirements can be reduced considerably by combining all the mixture weights files into a single, compressed senone dump file that uses 8-bit parameter values. See Section Building 8-Bit Senone Dump Files for details.

Given the acoustic model and a pronunciation dictionary, there is also an associated mapping information that defines the senone mapping for each triphone state encountered in the dictionary. For semi-continuous models, this information is available in .phone and .map files (specified by -phnfn and -mapfn arguments). These two files are also generated by utilities in the Sphinx trainer package, based on a corresponding Sphinx3-format model definition file.

Continuous Density Models

Sphinx2 directly uses most of the Sphinx3 format continuous density acoustic model files without requiring any format conversion. As mentioned above, if the -mdeffn argument is specified, continuous density models are assumed. This argument specifies a Sphinx3 format model definition file, containing triphone-state to senone mapping information. There is no need to convert this file into .phone and .map files, unlike in the case of semi-continuous models.

However, in the current implementation, the decoder does internally generate .phone and .map files from the model definition file. It writes them to a directory specified by the -kbdumpdir argument. Applications may save the generated files, and specify them as input in subsequent decoder runs.

To summarize, if the -mdeffn argument is specified, the decoder assumes continuous density models. If the -phnfn and -mapfn arguments are not specified, the decoder automatically generates .phone and .map files in the -kbdumpdir directory, and reads them back in. If the -phnfn and -mapfn are specified, on the other hand, the decoder does not generate new .phone and .map files. Instead, it simply reads in the files specified by these arguments, with the assumption that they are compatible with the model definition file.

As for the continuous density acoustic models themselves, the decoder can directly read in the means, variances and mixture weights files generated by the Sphinx3 trainer, without any format conversion. The transition matrices file does, however, have to be converted to Sphinx2 format (.chmm files).

There are a few restrictions on the structure of continuous density acoustic models that the Sphinx2 decoder can handle:



Language Model

Sphinx2 accepts two flavors of language models (LMs) or grammars: finite-state (FSG), and N-gram (for N = 2 or 3, i.e., bigrams or trigrams). One can load multiple LMs, N-gram or FSG, into the decoder, either during initialization or at run time. However, only one LM can be active for a given utterance. LMs are identified by a string name. The application can switch LMs in between utterances. The N-gram language model specified by the -lmfn argument is unnamed; it has the empty string as its name.

The active vocabulary during the decoding of an utterance is the set of words that is present in both the pronunciation dictionary and the currently active LM. The decoder is incapable of recognizing any word outside the active vocabulary.

Any new word added to the dictionary is also automatically added as a unigram to the unnamed N-gram language model. (This is a HACK. There ought to be a mechanism for adding a word to a specified language model. Currently the only way to accomplish this is to delete a currently loaded language model, create a new model with the new word, and load it. This is also the case with FSGs; there is no way to dynamically modify a loaded FSG.)

As mentioned earlier, an N-gram language model must include the begin-sentence and end-sentence symbols, <s> and </s>. Filler words are transparent to any language model, N-gram or finite state. That is, the decoder can transparently try to insert them anywhere in the utterance, but they don't exist as far as the language model is concerned.

Large N-gram LMs load very slowly. The delay can be avoided by providing a binary dump version of LM files along with the original LMs. The Sphinx2 decoder can automatically create LM dump files for large N-gram LMs, which can be used in subsequent decoder runs. (See Section Building LM Dump Files.)

Note that the current implementation of finite-state grammars, or FSGs, is not the most efficient. In particular, transitions are represented using a full NxN matrix, where N is the number of states. Hence, FSGs containing several thousands of states may run inefficiently.

No LM is required for operation in allphone or forced alignment recognition modes.







The Recognition Engine

The core speech decoder operates on finite-length segments of speech or utterances, one utterance at a time. (Operationally, an utterance is the chunk of speech data processed between calls to the Sphinx2 API functions uttproc_begin_utt and the next uttproc_end_utt; see further below.) An utterance can be up to a minute long. In practice, most applications would typically treat a sentence or a phrase as an utterance, which would be much shorter than the maximum of 60 sec.

Basic Recognition

The recognition structure in Sphinx2 depends on whether the currently active language model is an N-gram or an FSG model. (The API, however, is substantially the same for both.)

N-gram Decoding

In N-gram decoding, each utterance is decoded using up to three passes, two of which are optional: The optional passes generally improve recognition accuracy. However, the second pass (flat Viterbi search) can increase latency significantly. The passes to be active are configured once at initialization. From then on, the presence of the multiple passes is invisible to the application. It only receives the result from the last active pass. However, the word lattice can subsequently be searched for additional, alternative--or N-best--hypotheses by the application.

FSG Decoding

In FSG decoding, on the other hand, there is only one pass of Viterbi search. The optional passes described above are not used, and neither is it possible to obtain N-best lists (yet). (Moreover, implementation of even the one pass is a distinct from the N-gram decoder. This is in order to facilitate porting FSG decoding to Sphinx3.)

Details of the Sphinx2 recognition engine (with N-gram models) can be found in Ravishankar's Ph.D thesis.

Forced Alignment and Allphone Recognition Modes

The recognizer can be run to time-align given transcripts to input speech, producing time segmentations for the input transcripts, as well as identifying silence regions. Time-alignment is only available in batch mode. It is covered in more detail below. (Note: Forced alignment can also be accomplished using the recently added finite-state grammar capability, which can also be used in live mode. However, it cannot provide phoneme and state-level segmentation.)

Sphinx2 can also be used in allphone mode to produce a purely phonetic recognition instead of the normal word recognition. The allphone recognition API is available to user-written applications as well. However, the input can only be from pre-recorded files.

Note: The recognition engine is configured in one of normal, forced-alignment, or allphone modes during initialization. It cannot be dynamically switched between these modes at run time.





The Application Programming Interface

There are three main groups of functions or application programming interface (API) available with Sphinx2: raw audio access, continuous listening/silence filtering, and the core decoder itself.

As we shall see below, none of the core decoder API functions directly accesses any audio device. Rather, the application is responsible for collecting audio data to be decoded. This gives applications the freedom to decode audio data originating at any source at all---standard audio devices, pre-recorded files, data received from a remote location over a socket connection, etc. Since most applications ultimately need to access common audio devices and to perform some form of silence filtering to detect speech/no-speech conditions, the two additional modules are provided as a convenience.

(NOTE: The APIs often use int32 and int16 types, which are basically 32-bit and 16-bit integer types. Similarly, uint32 and uint16 are the unsigned versions.)

Low-Level Audio Access

No two platforms provide the same interface to audio devices. To accommodate this diversity, the platform-dependent code is encapsulated within a generic interface for low-level audio recording and playback. The following functions are for recording. Complete details can be found in include/ad.h.
  • ad_open:
  • Opens an audio device for recording. Returns a handle to the opened device. (Currently 8KHz or 16KHz mono, 16-bit PCM only.)
  • ad_start_rec:
  • Starts recording on the audio device associated with the specified handle.
  • ad_read:
  • Reads up to a specified number of samples into a given buffer. It returns the number of samples actually read, which may be less than the number requested. In particular it may return 0 samples if no data is available. Most operating systems have a limited amount of internal buffering (at most a few seconds) for audio devices. Hence, this function must be called frequently enough to avoid buffer overflow.
  • ad_stop_rec:
  • Stops recording. (However, the system may still have internally buffered data remaining to be read.)
  • ad_close:
  • Closes the audio device associated with the specified audio handle.
    See examples/adrec.c and examples/adpow.c for two examples demonstrating the use of the above functions.

    A similar set of playback functions are provided (currently implemented only on PC/Windows platforms):
  • ad_open_play:
  • Opens an audio device for playback. Returns a handle to the opened device. (Currently 8KHz or 16KHz mono, 16-bit PCM only.)
  • ad_start_play:
  • Starts playback on the device associated with the given handle.
  • ad_write:
  • Sends a buffer of samples for playback. The function may accept fewer than the samples provided, depending on available internal buffers. It returns the number of samples actually accepted. The application must provide data sufficiently rapidly to avoid breaks in playback.
  • ad_stop_play:
  • End of playback. Playback is continued until all buffered data has been consumed.
  • ad_close_play:
  • Closes the audio device associated with the specified handle.
    Finally, the audio library includes a function ad_mu2li for converting 8-bit mu-law samples into 16-bit linear PCM samples.

    See examples/adplay.c for an example that plays back audio samples from a given input file.

    The implementation of the audio API for various platforms is contained in analog-to-digital library for the given architecture.

    Continuous Listening and Silence Filtering

    As mentioned earlier, Sphinx2 can only decode utterances that are limited to less than about 1 min. at a time. However, one often wants to leave the audio recording running continuously and automatically determine utterance boundaries based on pauses in the input speech. The continuous listening module in Sphinx2 provides the mechanisms for this purpose.

    The silence filtering module is interposed between the raw audio input source and the application. The application calls the function cont_ad_read instead of directly reading the raw A/D input source. cont_ad_read returns only those segments of input audio that it determines to be non-silence. Additional timestamp information is provided to inform the application about silence regions that have been dropped.

    The complete continuous listening API is defined in include/cont_ad.h and is summarized below:
  • cont_ad_init:
  • Associates a new continuous listening module instance with a specified raw A/D handle and a corresponding read function pointer. E.g., these may be the handle returned by ad_open and function ad_read described above.
  • cont_ad_calib:
  • Calibrates the background silence level by reading the raw audio for a few seconds. It should be done once immediately after cont_ad_init, and after any environmental change.
  • cont_ad_read:
  • Reads and returns the next available block of non-silence data in a given buffer. (Uses the read function and handle supplied to cont_ad_init to obtain the raw A/D data.) More details are provided below.
  • cont_ad_reset:
  • Flushes any data buffered inside the module. Useful for discarding accumulated, but unprocessed speech.
  • cont_ad_get_params:
  • Returns the current values of a number of parameters that determine the functioning of the silence/speech detection module.
  • cont_ad_set_params:
  • Sets a number of parameters that determine the functioning of the silence/speech detection module. Useful for fine-tuning its performance.
  • cont_ad_set_thresh:
  • Useful for adjusting the silence and speech thresholds. (It's preferable to use cont_ad_set_params for this purpose.)
  • cont_ad_detach:
  • Detaches the specified continuous listening module from its currently associated audio device.
  • cont_ad_attach:
  • Attaches the specified continuous listening module to the specified audio device. (Similar to cont_ad_init, but without the need to calibrate the audio device. The existing parameter values are used instead of being reset to default values.)
  • cont_ad_close:
  • Closes the continuous listening module.
    Some additional details on the cont_ad_read function are in order. Operationally, every call to cont_ad_read causes the module to read the associated raw A/D source (as much data as possible and available), scan it for speech (non-silence) segments and enqueue them internally. It returns the first available segment of speech data, if any. In addition to returning non-silence data, the function also updates a couple of parameters that may be of interest to the application:

    So, for example, if on two successive calls to cont_ad_read, the timestamp is 100000 and 116000, respectively, the application can determine that 1 sec (16000 samples) of silence have been gobbled up between the two calls.

    Silence regions aren't chopped off completely. About 50-100ms worth of silence is preserved at either end of a speech segment and passed on to the application.

    Finally, the continuous listener won't concatenate speech segments separated by silence. That is, the data returned by a single call to cont_ad_read will not span raw audio separated by silence that has been gobbled up.

    cont_ad_read must be called frequently enough to avoid loss of input data owing to buffer overflow. The application is responsible for turning actual recording on and off, if applicable. In particular, it must ensure that recording is on during calibration and normal operation.

    See examples/cont_adseg.c for an example that uses the continuous listening module to segment live audio input into separate utterances. Similarly, examples/cont_fileseg.c segments a given pre-recorded file containing audio data into utterances.

    Speech-to-Text Decoding

    There are several aspects to speech decoding: initialization, basic speech decoding, management of multiple grammars (LMs), logging and book-keeping, etc. This section briefly describes the related Sphinx2 API functions. Note that not all the available functions are documented here. Please refer to the functions and data types defined in include/fbs.h for further details.

    The two functions pertaining to initialization and final cleanup are:
  • fbs_init:
  • Initializes the decoder. The input arguments (in the form of the common command line argument list argc,argv) specify the input databases (acoustic, lexical, and language models) and various other decoder configuration options. (See Arguments Reference.) If batch-mode processing is indicated (see -ctlfn option below) it happens as part of this initialization.
  • fbs_end:
  • Cleans up the internals of the decoder, such as printing summaries and closing log files, before the application exits.


    Sphinx2 applications can use the following functions to decode speech into text, one utterance at a time:
  • uttproc_begin_utt:
  • Begins decoding the next utterance. The application can assign an id string to it. If not, one is automatically created and assigned.
  • uttproc_rawdata:
  • Processes (decodes) the next chunk of raw A/D data in the current utterance. This can be non-blocking, in which case much of the data may be simply queued internally for later processing. Note that only single-channel (mono) 16-bit linear PCM-encoded samples can be processed.
  • uttproc_cepdata:
  • This is an alternative to uttproc_rawdata if the application wishes to decode cepstrum data instead of raw A/D data.
  • uttproc_end_utt:
  • Indicates that all the speech data for the current utterance has been provided to the decoder.
  • uttproc_result:
  • Finishes processing internally queued up data and returns the final recognition result string. It can also be non-blocking, in which case it may return after processing only some of the internally queued up data.
  • uttproc_result_seg:
  • Like uttproc_result, but returns additional information for each word in the result, such as time segmentation (measured in 10msec frames), acoustic and language model scores, etc. (See structure search_hyp_t in file include/fbs.h.) One can use either this function or uttproc_result to finish decoding, but not both.
  • uttproc_partial_result:
  • This function can be used to obtain the most up-to-date partial result while utterance decoding is in progress. This may be useful, for example, in providing feedback to the user.
  • uttproc_partial_result_seg:
  • Like uttproc_partial_result, but returns word segmentation information (measured in 10msec frames) instead of the recognition string.
  • uttproc_abort_utt:
  • This is an alternative to uttproc_end_utt that terminates the current utterance. No further recognition results can be obtained for it.
  • search_get_alt:
  • Returns N-best hypotheses for the utterance. Currently, this does not work with finite state grammars. (See further details in include/fbs.h).
    The non-blocking option in some of the above functions is useful if decoding is slower than real-time, and there is a chance of losing input A/D data if processing them takes too long. In the non-blocking mode, the data may simply be queued up internally and processed only after all the input data for the current utterance has been acquired. Similarly, the non-blocking option in uttproc_result allows the application to respond to user-interface events in real-time.

    The application code fragment for decoding one utterance typically looks as follows:

        uttproc_begin_utt (....)
        while (not end of utterance) {   /* indicated externally, somehow */
    	read any available A/D data; /* possibly 0 length */
            uttproc_rawdata (A/D data read above, non-blocking);
        }
        uttproc_end_utt ();
        uttproc_result (...., blocking);
    
    See several demo applications in the directory examples/ for some variations.

    Multiple, named LMs can be resident with the decoder module, either read in during initialization, or dynamically at run time. However, exactly one LM must be selected and active for decoding any given utterance. As mentioned earlier, the active vocabulary for each utterance is given by the intersection of the pronunciation dictionary and the currently active LM. The following functions allow the application to control language modelling related aspects of the decoder:
  • lm_read:
  • Reads in a new N-gram language model from a given file, and associates it with a given string name. The application needs this function only if it needs to create and load LMs dynamically at run time, rather than at initialization via the -lmfn command line argument.
  • lm_delete:
  • Deletes the N-gram LM with the given string name from the decoder repertory.
  • uttproc_set_lm:
  • Tells the decoder to switch the active grammar to the N-gram LM with the given string name. Subsequent utterances are decoded with this grammar, until the next uttproc_set_lm or uttproc_set_fsg operation. This function can only be invoked between utterances, not in the midst of one.
  • uttproc_set_context:
  • Sets a two-word history for the next utterance to be decoded, giving its first words additional context that can be exploited by the LM. (Useful only with N-gram LMs.)
  • uttproc_load_fsgfile:
  • Loads the given finite-state grammar (FSG) file into the system and returns the string name associated with the FSG. (Unlike the N-gram LM, the string name is contained in the FSG file.) The application needs this function only if it needs to create and load FSGs dynamically at run time, rather than at initialization via the -fsgfn or -fsgctlfn command line arguments.
  • uttproc_load_fsg:
  • Similar to uttproc_load_fsgfile, but the input FSG is provided in the form of an s2_fsg_t data structure (see include/fbs.h), instead of a file.
  • uttproc_set_fsg:
  • Tells the decoder to switch the active grammar to the FSG with the given string name. Subsequent utterances are decoded with this grammar, until the next uttproc_set_fsg or uttproc_set_lm operation. This function can only be invoked between utterances, not in the midst of one.
  • uttproc_del_fsg:
  • Deletes the FSG with the given string name from the decoder repertory.


    The raw input data for each utterance and/or the cepstrum data derived from it can be logged to specified directories:
  • uttproc_set_rawlogdir:
  • Specifies the directory to which utterance audio data should be logged. An utterance is logged to file <id>.raw, where <id> is the string ID assigned to utterance by uttproc_begin_utt.
  • uttproc_set_mfclogdir:
  • Specifies the directory to which utterance cepstrum data should be logged. Like A/D files above, an utterance is logged to file <id>.mfc.
  • uttproc_get_uttid:
  • Retrieves the utterance ID string for the current or most recent utterance. Useful for locating the logged A/D data and cepstrum files, for example.

    In addition, the decoder configuration includes a number of parameters that can be tuned for a given application to give optimum performance. They are set at initialization time via command-line style arguments (during the fbs_init call. The parameters determine various aspects of the decoder, such as beam pruning thresholds, the relative weights of acoustic and language model scores, etc. They are covered in more detail in Section Arguments Reference below.

    Allphone Decoding

    In batch mode, Sphinx2 runs in allphone mode if the -allphone flag is TRUE. In this mode, no language model should be provided; i.e., the -lmfn, -lmctlfn, -fsgfn and -fsgctlfn arguments should be omitted. A phone transition probability matrix can be specified using the -phonetpfn argument.

    In addition, the API for allphone decoding includes a single function that supports recognition from pre-recorded files:
  • uttproc_allphone_file:
  • Performs allphone recognition on the given file and returns the resulting phone segmentation. The input file can contain either audio data, or cepstrum data. The -adcin argument, and any related ones, should be set accordingly. (See arguments reference below.)


    Forced Alignment

    Sphinx2 (in batch mode) can be used for aligning transcripts to speech, in order to obtain time-segmentations at the word, phone, or state levels. In this mode, no language model should be provided; i.e., the -lmfn, -lmctlfn, -fsgfn and -fsgctlfn arguments should be omitted.

    The set of utterances (speech data) is given by the -ctlfn argument, as usual. (See arguments reference below.) In addition, the corresponding transcripts should be given in a parallel file, which should be the -tactlfn argument. The first line of this file should contain just the string *align_all*. This should be followed by the transcripts for all the utterances to be aligned, one line per utterance, in the same order as in the -ctlfn file. The transcripts must not include any utterance ID.

    Alignments at the word, phone and state levels can be obtained by setting the flags -taword, -taphone, and -tastate individually to TRUE or FALSE. Alignments are written to stdout (the log file).





    Application Examples

    Two simple speech decoding applications, implemented with a tty-based interface as well as with a Windows interface, are included in directory examples:







    Compiling the Libraries and Demos

    To compile Sphinx2 libraries on Unix platforms:

    To compile Sphinx2 libraries on Windows platforms:

    MS Visual Studio will build the executables under .\bin\Release or .\bin\Debug (depending on the setting you choose on MS Visual Studio), and the libraries under .\lib\Release or .\lib\Build.

    In a successful installation, the test produces the recognition result GO FORWARD TEN METERS







    Arguments Reference

    The core Sphinx2 decoding engine accepts a long list of arguments during initialization. These are the arguments to the library function fbs_init(int argc, char *argv[]) defined in include/fbs.h. (Applications built around the Sphinx2 libraries, of course, can have additional arguments.) Many arguments, such as the input model databases, must be specified by the user. We cover the more important ones below (the remaining have reasonable default values):

    Input Model Databases

    Flag Description Default
    -lmfn Optional DARPA format N-gram LM file with the empty string as its name. None.
    -lmctlfn Optional control file with a list of N-gram LM files and associated string names (one line per entry). This is how multiple LMs can be loaded during initialization. None.
    -kbdumpdir Optional directory containing precompiled binary versions of N-gram LM files (see Building LM Dump Files). Also, directory in which Sphinx2 format map and phone files are created if -mdeffn is specified but -phnfn and -mapfn are omitted. None.
    -fsgfn Optional finite state grammar file. None.
    -fsgctlfn Optional control file with a list of FSG grammar filenames. One line per entry. Blank and comment lines (beginning with a # char) are allowed. This is how multiple FSGs can be loaded during initialization. None.
    -fsgusealtpron Whether the decoder should expand a transition in an FSG to include all the alternative pronunciations of the associated word. TRUE
    -fsgusefiller Whether the decoder should insert a filler word (self-loop) transition automatically at every state. TRUE
    -dictfn Main pronunciation dictionary file. None.
    -oovdictfn Optional out-of-vocabulary (OOV) pronunciation dictionary. These are added to the unnamed LM (read from -lmfn file) with unigram probability given by -oovugprob. None.
    -ndictfn Optional "noise" words pronunciation dictionary. Noise words are not part of any LM and, like silence, can be inserted transparently anywhere in the utterance. None.
    -phnfn
    -mapfn
    Phone and map files with senone mapping information for the given dictionary and acoustic model. Can be omitted for continuous models (i.e., if -mdeffn is specified). None.
    -cbdir Directory containing Sphinx-2 format semi-continuous acoustic model codebook files (.vec and .var files). None.
    -hmmdir Directory containing Sphinx-2 format semi-continuous acoustic model senone weights files (.ccode, .d2code, .p3code, and .xcode files). None.
    -hmmdirlist Directory containing Sphinx2-format acoustic model transition matrices files (.chmm files). (For both semi-continuous and continuous density models.) None.
    -sendumpfn
    -8bsen
    Optional 8-bit senone mixture weights file created from the 32-bit mixture weights files (see Building 8-Bit Senone Dump Files). -8bsen should be TRUE if the 8-bit senones are used. None.
    -mdeffn Sphinx3 format model definition file for continuous density models. None.
    -meanfn
    -varfn
    -mixwfn
    Sphinx3 format acoustic model files: means, variances and senone mixture weights. (The transition matrices file has to be converted to Sphinx2 format and specified via the -hmmdirlist argument.) None.
    -varfloor Floor for variance values in the -varfn file. Smaller variance values are raised to this floor. 0.0001
    -mixwfloor Floor for mixture weight values in the -mixwfn file. Smaller values are raised to this floor. 0.0000001
    -phonetpfn Phone transition probability (actually counts) matrix input file, for use as a "language model" in allphone recognition mode. (Also see associated arguments -ptplw and -uptpwt.) None.


    Decoder Configuration

    Flag Description Default
    -ctlfn
    -ctloffset
    -ctlcount
    Batch-mode control file listing utterance files (without their file-extension) to decode. -ctloffset is the number of initial utterances in the file to be skipped, and -ctlcount the number to be processed (after the skip, if any). -ctlfn must not be specified for live-mode or application-driven operation. None
    0
    All
    -datadir
    If the control file (-ctlfn argument) entries are relative pathnames, an optional directory prefix for them may be specified using this argument. None
    -allphone
    Should be TRUE to configure the recognition engine for allphone mode operation. FALSE
    -tactlfn
    Input transcript file, parallel to the control file (-ctlfn) in forced alignment mode. None
    -adcin
    -adcext
    -adchdr
    -adcendian
    In batch mode, -adcin selects A/D (TRUE) or cepstrum input data (FALSE). If TRUE, -adcext is the file extension to be appended to names listed in the -ctlfn argument file, -adchdr the number of bytes of header in each input file, and -adcendian their byte ordering: 0 for big-endian, 1 for little-endian. With these flags, most A/D data file formats can be processed directly. FALSE
    raw
    0
    1
    -normmean
    -nmprior
    Cepstral mean normalization (CMN) option. If -nmprior is FALSE, CMN computed on current utterance only (usually batch mode), otherwise based on past history (live mode). TRUE
    FALSE
    -compress
    -compressprior
    Silence deletion (within decoder, not related to continuous listening). If -compressprior is FALSE, based on current utterance statistics (batch mode), otherwise based on past history (live mode). -compress should be FALSE if continuous listening is used. FALSE
    FALSE
    -agcmax
    -agcemax
    Automatic gain control (AGC) option. In batch mode only -agcmax should be TRUE, and in live mode only -agcemax. FALSE
    FALSE
    -live Forces some live mode flags: -nmprior -compressprior and -agcemax to TRUE if any AGC is on. FALSE
    -samp Sampling rate; must be 8000 or 16000. 16000
    -fwdflat Run flat-lexical Viterbi search after tree-structured pass (for better accuracy). Usually FALSE in live mode. TRUE
    -bestpath Run global best path search over Viterbi search word lattice output (for better accuracy). TRUE
    -compallsen Compute all senones, whether active or inactive, in each frame. FALSE
    -latsize Word lattice entries to be allocated. Longer sentences need larger lattices. 50000
    -fsgbfs If FALSE, backtrace from state with best path score, instead of from FSG final state. (FSG mode only.) TRUE


    Beam Widths

    Flag Description Default
    -top Number of codewords computed per frame. Usually, narrowed to 1 in live mode. 4
    -beam
    -npbeam
    Main pruning thresholds for tree search. Usually narrowed down to 2e-6 in live mode. 1e-6
    1e-6
    -lpbeam Additional pruning threshold for transitions to leaf nodes of lexical tree. Usually narrowed down to 2e-5 in live mode. 1e-5
    -lponlybeam
    -nwbeam
    Yet more pruning thresholds for leaf nodes and exits from lexical tree. Usually narrowed down to 5e-4 in live mode. 3e-4
    3e-4
    -maxwpf
    Maximum number of words, ranked according to score, that can be recognized and entered in the Viterbi history in each frame. Essentially, an absolute pruning parameter complementing the beam pruning parameter. (Not used in FSG mode.) Infinity
    -maxhmmpf
    Absolute pruning threshold. Maximum number of HMMs to keep active in each frame (approx.). Implemented only in FSG mode. Infinity
    -fwdflatbeam
    -fwdflatnwbeam
    Main and word-exit pruning thresholds for the optional, flat lexical Viterbi search. 1e-8
    3e-4
    -topsenfrm
    -topsenthresh
    No. of lookahead frames for predicting active base phones. (If <=1, all base phones assumed to be active every frame.) -topsenthresh is log(pruning threshold) applied to raw senone scores to determine active phones in each frame. 1
    -60000


    Language Weights/Penalties

    Flag Description Default
    -langwt
    -fwdflatlw
    -rescorelw
    Language weights applied during lexical tree Viterbi search, flat-structured Viterbi search, and global word lattice search, respectively. 6.5
    8.5
    9.5
    -ugwt Unigram weight for interpolating unigram probabilities with uniform distribution. Typically in the range 0.5-0.8. 1.0
    -inspen
    -silpen
    -fillpen
    Word insertion penalty or probability (for words in the LM), insertion penalty for the silence word, and insertion penalty for noise words (from -ndictfn file) if any. 0.65
    0.005
    1e-8
    -oovugprob Unigram probability (logprob) for OOV words from -oovdictfn file, if any. -4.5
    -ascrscale Scaling of acoustic scores (continuous density acoustic models only). Raw acoustic scores are first scaled down (i.e., shifted down) by so many bits. 0 (no scaling)
    -ptplw Phone transition language weight applied to phone transition probability matrix (see -phonetpfn argument). 5.0
    -uptpwt Linear interpolation constant, for interpolating phone transition probabilities between uniform and those specified by the -phonetpfn argument. Values closer to 1.0 weight uniform distribution more, closer to 0.0 weight it less. 0.001


    Output Specifications

    Flag Description Default
    -matchfn Filename to which final recognition string for each utterance written. (Old format, word-id at the end.) None
    -matchsegfn Like -matchfn, but contains word segmentation info: startframe #frames word... (New format, word-id at the beginning.) None
    -partial If greater than 0, print any available partial hypothesis to stdout every so many frames. 0
    -partialseg Like -partial, but include segmentation information for each word in the partial hypothesis. 0
    -reportpron Causes word pronunciation to be included in output files. FALSE
    -rawlogdir If specified, logs raw A/D input samples for each utterance to the indicated directory. (One file per utterance, named <uttid>.raw.) None
    -mfclogdir If specified, logs cepstrum data for each utterance to the indicated directory. (One file per utterance, named <uttid>.mfc.) None
    -dumplatdir If specified, dumps word lattice for each utterance to a file in this directory. The filename is created from the utterance ID. None
    -logfn Filename to which decoder logging information is written. stdout/stderr
    -backtrace Includes detailed word backtrace information in log file. TRUE
    -nbest No. of N-best hypotheses to be produced. Currently, this flag is only useful in batch mode. But an application can always directly invoke search_get_alt to obtain them. Also, the current implementation is lacking in some details (e.g., in returning detailed scores). 0
    -nbestdir Directory to which N-best files written (one/utterance). Current dir.
    -phoneconf Writes a form of phoneme scores for the utterance to the logfile. Mainly for diagnostics purposes. FALSE
    -pscr2lat Write a phoneme lattice for the utterance to the logfile, i.e., the top-scoring phonemes per frame. Mainly for diagnostics purposes. FALSE
    -taword
    -taphone
    -tastate
    Whether word, phone, and state alignment output should be produced when running in forced alignment mode. TRUE
    TRUE
    FALSE


    Finally, one of the arguments can be: -argfile filename. This causes additional arguments to be read in from the given file. Lines beginning with the '#' character in this file are ignored. Recursive -argfile specifications are not allowed.

    Alphabetical List of Arguments

  • 8bsen:
  • Use 8-bit senone dump file.
  • adcendian:
  • A/D input file byte-ordering.
  • adcext:
  • A/D input file extension.
  • adchdr:
  • No. bytes of header in A/D input file.
  • adcin:
  • Input file contains A/D samples or cepstra (TRUE/FALSE).
  • agcemax:
  • Compute AGC (max C0 normalized to 0; estimated, live mode).
  • agcmax:
  • Compute AGC (max C0 normalized to 0 based on current utterance).
  • argfile:
  • Arguments file.
  • ascrscale:
  • Acoustic scores scaling for continuous density models.
  • backtrace:
  • Provide detailed backtrace in log file.
  • beam:
  • Main pruning beamwidth.
  • bestpath:
  • Run global best path algorithm on word lattice.
  • cbdir:
  • Directory containing semi-continuous acoustic model codebook files.
  • compallsen:
  • Compute all senones.
  • compress:
  • Remove silence frames (based on C0 statistics).
  • compressprior:
  • Remove silence frames (based on C0 statistics from prior history).
  • ctlcount:
  • No. of utterances to decode in batch mode.
  • ctlfn:
  • Control file listing utterances to decode in batch mode.
  • ctloffset:
  • No. of initial utterances to be skipped from control file.
  • datadir:
  • Directory prefix for control file entries.
  • dictfn:
  • Main pronunciation dictionary.
  • dumplatdir:
  • Directory for dumping word lattices.
  • fillpen:
  • Noise word penalty (probability).
  • fsgctlfn:
  • Control file listing several FSG files to be loaded at initialization.
  • fsgfn:
  • Finite state grammar file to be loaded at initialization time.
  • fsgusealtpron:
  • Consider alternative pronunciations for every FSG transition.
  • fsgusefiller:
  • Add a filler self-loop transition at every FSG state.
  • fwdflat:
  • Run flat-lexical Viterbi search.
  • fwdflatbeam:
  • Main beam width for flat search.
  • fwdflatlw:
  • Language weight for flat search.
  • fwdflatnwbeam:
  • Word-exit beam width for flat search.
  • hmmdir:
  • Directory containing semi-continuous senone mixture weights files.
  • hmmdirlist:
  • Directory containing semi-continuous HMM transition matrices files.
  • inspen:
  • Word insertion penalty (probability).
  • kbdumpdir:
  • Directory containing LM dump files.
  • langwt:
  • Language weight for lexical tree search.
  • latsize:
  • Size of word lattice to be allocated.
  • live:
  • Live mode.
  • lmctlfn:
  • Control file listing named language model files to be loaded at initialization.
  • lmfn:
  • Unnamed language model file to load at initialization.
  • logfn:
  • Output log file.
  • lpbeam:
  • Transition to last phone beam width.
  • lponlybeam:
  • Last phone internal beam width.
  • mapfn:
  • Senone mapping file.
  • matchfn:
  • Output match file.
  • matchsegfn:
  • Output match file with word segmentation.
  • maxhmmpf:
  • Max HMMs to keep active per frame (absolute pruning, approx.).
  • maxwpf:
  • Max words to be recognized per frame (absolute pruning).
  • mdeffn:
  • Model definition file for continuous density acoustic model.
  • meanfn:
  • Continuous density acoustic model Gaussian means file.
  • mfclogdir:
  • Directory for logging cepstrum data for each utterance.
  • mixwfloor:
  • Floor value for continuous density senone mixture weights.
  • mixwfn:
  • Continuous density acoustic model senone mixture weights file.
  • nbest:
  • No. of N-best hypotheses to be produced/utterance.
  • nbestdir:
  • Directory for writing N-best hypotheses files.
  • ndictfn:
  • Noise words dictionary.
  • nmprior:
  • Cepstral mean normalization based on prior utterances statistics.
  • normmean:
  • Cepstral mean normalization.
  • npbeam:
  • Next phone beam width for tree search.
  • nwbeam:
  • Word-exit beam width for tree search.
  • oovdictfn:
  • Out-of-vocabulary words pronunciation dictionary.
  • oovugprob:
  • Unigram probability for OOV words.
  • partial:
  • Frequency of partial hypothesis reporting.
  • partialseg:
  • Frequency of partial hypothesis (with segmentation) reporting.
  • phnfn:
  • Phone file (senone mapping information).
  • phoneconf:
  • Whether to generate phone segmentation and scores.
  • pscr2lat:
  • Whether to output a phone lattice.
  • rawlogdir:
  • Directory for logging A/D data for each utterance.
  • reportpron:
  • Show actual word pronunciation in output match files.
  • rescorelw:
  • Language weight for best path search.
  • samp:
  • Input audio sampling rate(16000/8000).
  • sendumpfn:
  • (8-bit) Senone dump file.
  • silpen:
  • Silence word penalty (probability).
  • tactlfn:
  • Forced alignment transcript file.
  • taphone:
  • Whether phone-level alignment information should be output.
  • tastate:
  • Whether state-level alignment information should be output.
  • taword:
  • Whether word-level alignment information should be output.
  • top:
  • No. of top codewords to evaluate in each frame.
  • topsenfrm:
  • No. of frames to lookahead to determine active base phones.
  • topsenthresh:
  • Pruning threshold applied to determine active base phones.
  • ugwt:
  • Unigram weight for interpolating unigram probability with uniform probability.
  • varfloor:
  • Floor value for continuous density Gaussian variance values.
  • varfn:
  • Continuous density acoustic model Gaussian variances file.







    Frequently Asked Questions

    A fair amount of the available functionality in Sphinx2 hasn't been documented above. In addition to the topics covered below, the reader is encouraged to look at include/fbs.h carefully, in order to find out more.

    Decoder Tuning

    There are several ways to speed up decoding:

    When using continuous density models, the tuning parameters can be quite different from those for semi-continuous models. In particular, the dynamic range of acoustic scores is usually larger, hence the language-weight (-langwt argument) probably needs to be increased. All the pruning thresholds (-beam, -npbeam, -nwbeam, -lpbeam, -lponlybeam, -fwdflatbeam and -fwdflatnwbeam) have to be lowered as well. It is preferable to disable phone lookahead (i.e., not specifying the -topsen argument), since phone lookahead requires the evaluation of all senones every frame, which is expensive for continuous density models.

    Finally, to reduce the problem of integer overflow caused by the larger dynamic range of scores, an acoustic scaling parameter (-ascrscale argument) has been added. This parameter specifies the number of bits by which raw acoustic likelihoods are scaled down. For instance, a value of 1 implies that raw acoustic scores are halved, thus compressing the dynamic range. (Unfortunately, this usually requires further tuning of language-weight and pruning threshold parameters.)

    Building LM Dump Files

    LM files are usually ASCII files. If they are large, it is time consuming to read them into the decoder. A binary "dump" file is much faster to read and more compact.

    LM dump files can be created by either a standalone program examples/lm3g2dmp.c or the decoder. The standalone version can be compiled from the examples directory. The program takes two arguments, the LM source file and a directory in which the dump file is to be created. It reads the header from the original LM file to determine the size of the LM. It then forms the binary dump file name by appending a .DMP extension to the LM file name. This file is written to the second (directory) argument. (NOTE: The dump file must not already exist!!)

    Any version of the decoder can also automatically create binary "dump" files similar to the standalone version described above. It first looks for the dump file in the directory given by the -kbdumpdir argument. If the dump file is present it reads it and ignores the rest of the original LM file. Otherwise, it reads the LM file and creates a dump file in the -kbdumpdir directory so that it can be used in subsequent decoder runs.

    The decoder does not create dump files for small LMs that have fewer than an internally defined number of bigrams and trigrams.

    Building 8-Bit Senone Dump Files

    The Sphinx-II senonic acoustic model files contain 32-bit data. (These are in the directory specified by the -hmmdir argument.) However, they can be clustered down to 8-bits for memory efficiency, without loss of recognition accuracy. The clustering is carried out by an offline process as follows:
    1. Create a temporary 32-bit senone dump file by running the decoder with the -sendumpfn flag set to the temporary file name, the -8bsen flag set to FALSE, and omitting the -lmfn argument. The decoder can be killed after it creates the 32-bit senone dump file, which happens during the initialization and is announced in the log output.
    2. Run: /afs/cs/project/plus-2/s2/Sphinx2/bin/alpha/pdf32to8b 32bit-file 8bit-file
      to create the 8-bit senone dump file. That is, the first argument to pdf32to8b is the temporary 32-bit dump file created above, and the second argument is the 8-bit output file.
    3. Delete the temporary 32-bit file.
    The 8-bit senone dump file can now be used as the -sendumpfn argument to the decoder with the -8bsen argument set to TRUE.

    Finite State Grammar File Format

    The finite state grammar file format is quite rudimentary. It can be thought of as an "assembly language" level specification. It is hoped that most, if not all, existing formats can be compiled or pre-processed down to this level. (Note: The file format may change in the future, depending on feedback from users.)

    An FSG file looks as follows:

       FSG_BEGIN [<fsgname>]
       NUM_STATES <#states>
       START_STATE <start-state>
       FINAL_STATE <final-state>
       TRANSITION <from-state> <to-state> <prob> [<word-string>]
       TRANSITION <from-state> <to-state> <prob> [<word-string>]
       ... (any number of state transitions)
       FSG_END
    

    The FSG spec begins with the line containing the keyword FSG_BEGIN. (All preceding lines are comments.) It may have an optional FSG name string. If no name is present, the FSG has the empty string as its name. Following the FSG_BEGIN declaration is the number of states in the FSG, the start state, and the final state, each on a separate line. States are numbered in the range [0 .. <#states>-1]. These are then followed by all the state transitions, in no particular order, and each transition on a separate line. The FSG specification is ended by the FSG_END line. The keywords NUM_STATES, START_STATE, FINAL_STATE, and TRANSITION, can be abbreviated to N, S, F, and T, respectively.

    A transition specifies the source state, the destination state, the prior probability of the transition being taken, and it optionally emits a word. If no word emission is specified, it is an epsilon or null transition. If a transition emits a word that is not in the pronunciation lexicon, the transition is simply ignored.

    Note that the decoder automatically adds filler word transitions (such as the silence word) at every state (i.e., from a state to itself; assuming the -fsgusefiller argument is TRUE). This is the principal mechanism for transparently allowing optional silences to occur in between any two words in any sentence allowed by the FSG.

    Comment lines can also be embedded anywhere in the file. Any line that begins with the # character is treated as a comment line. Blank lines are also allowed anywhere in the file.

    Phone Transition Probability File Format

    This file, specified by the -phonetpfn argument, is actually a counts file, indicating the frequency with which transitions between any pair of phonemes take place. This information can be easily obtained from some suitable training text. Each line in the file specifies a source phoneme, a destination phoneme, and the associated count for transitions from the former to the latter. Comment lines are allowed, indicated by a # character in the first column.



    Ravishankar Mosur
    Last modified: Tue May 24 13:12:24 EDT 2005