Original by Ravishankar Mosur (Ravi) (rkm@cs.cmu.edu)
Maintained by Ravi and Kevin A. Lenzo (lenzo@cs.cmu.edu)
School of Computer Science
Carnegie Mellon University
Copyright (c) 1997-2005 Carnegie Mellon University.
Sphinx2 consists of a set of libraries that include core speech recognition functions as well as auxiliary ones such as low-level audio capture. The libraries are written in C and have been compiled on several Unix platforms (Linux, DEC Alpha, Sun Sparc, HPs) and Pentium/PentiumPro PCs running WindowsXP, WindowsNT or Windows95. A number of demo applications based on this recognition engine are also provided.
Several features specifically intended for developing real applications have been included in Sphinx2. For example, many aspects of the decoder can be reconfigured at run time. New language models can be loaded or switched dynamically. Similarly, new words and pronunciations can be added. The audio input data can be automatically logged to files for any future analysis.
The rest of this document is structured as follows:
The following is a very brief description of each model. (For those unfamiliar
with these topics or with Sphinx, a somewhat longer description, and the
associated Sphinx terminology, is in the html documentation for the Sphinx3 decoder,
available from CMU Sphinx
at SourceForge.)
The dictionary can be modified at run time. New words and their pronunciations can be added to the lexicon in between utterances. (An utterance is the unit of decoding; see Section The Recognition Engine below.) There is no mechanism for removing a word from the lexicon.
In addition to ordinary words, a set of noise or filler words can
be specified by the application, by placing them in a corresponding dictionary
(indicated by the -ndictfn
argument). Sphinx2 also
automatically added the silence word SIL
, with the
silence phone as its pronunciation, to the set
of filler words. A filler dictionary is not required, but the silence word is
always added.
The significance of filler words is that they can occur anywhere
in the utterance, transparent to the language model (see below).
Finally, Sphinx2 also adds the distinguished begin-sentence and end-sentence symbols,
<s>
and </s>
,
to the vocabulary. These words must be present in any N-gram language model
(see below).
-mdeffn
argument for a
a Sphinx3-format continuous density model definition file. If present,
continuous model decoding is assumed. Otherwise, the decoder looks for
semi-continuous acoustic models.
The decoder has to be initialized with one acoustic model. It is not possible to load multiple models and switch between them dynamically at run time.
Note: The current implementation does not include any speed optimizations to the continuous density model evaluation functions. Hence, applications using continuous density models may be restricted to small vocabulary or small model configurations if they are to run in real time.
.vec
and .var
files in the directory specified by the -cbdir
argument),
.ccode
,
.d2code
, .p3code
,
and .xcode
files, in the directory specified
by the -hmmdir
argument),
.chmm
files, in the directory specified by the
-hmmdirlist
argument).
Note: the Sphinx acoustic model trainer actually generates Sphinx3 format acoustic models, which consist of four files: means, variances, mixture weights and transition matrices. They have to be converted to Sphinx2 format using utilities provided with the trainer.
The semi-continuous model files generated by the trainer have model parameter values in 32-bit format. Sphinx2 memory requirements can be reduced considerably by combining all the mixture weights files into a single, compressed senone dump file that uses 8-bit parameter values. See Section Building 8-Bit Senone Dump Files for details.
Given the acoustic model and a pronunciation dictionary,
there is also an associated mapping information that defines the senone
mapping for each triphone state encountered in the dictionary.
For semi-continuous models, this information is available in
.phone
and .map
files
(specified by -phnfn
and
-mapfn
arguments). These two files are also
generated by utilities in the Sphinx trainer package, based on a corresponding
Sphinx3-format model definition file.
-mdeffn
argument is specified, continuous density
models are assumed. This argument specifies a Sphinx3 format model
definition file, containing triphone-state to senone mapping information.
There is no need to convert this file into .phone
and .map
files, unlike in the case of
semi-continuous models.
However, in the current implementation, the decoder does internally generate
.phone
and .map
files from the model definition file. It writes them to a directory specified
by the -kbdumpdir
argument. Applications may
save the generated files, and specify them as input in subsequent decoder runs.
To summarize, if the -mdeffn
argument is specified,
the decoder assumes continuous density models.
If the -phnfn
and -mapfn
arguments are not specified, the decoder automatically generates
.phone
and .map
files
in the -kbdumpdir
directory, and reads them back in.
If the -phnfn
and -mapfn
are specified, on the other hand, the decoder does not generate new
.phone
and .map
files. Instead, it simply reads in the files specified by these arguments, with
the assumption that they are compatible with the model definition file.
As for the continuous density acoustic models themselves, the decoder can directly
read in the means, variances and
mixture weights files generated by the Sphinx3 trainer, without any
format conversion. The transition matrices file does, however, have to
be converted to Sphinx2 format (.chmm
files).
There are a few restrictions on the structure of continuous density acoustic models that the Sphinx2 decoder can handle:
1s_12c_12d_3p_12dd
feature type in the Sphinx3
trainer).
-lmfn
argument is unnamed; it has the empty string as its name.
The active vocabulary during the decoding of an utterance is the set of words that is present in both the pronunciation dictionary and the currently active LM. The decoder is incapable of recognizing any word outside the active vocabulary.
Any new word added to the dictionary is also automatically added as a unigram to the unnamed N-gram language model. (This is a HACK. There ought to be a mechanism for adding a word to a specified language model. Currently the only way to accomplish this is to delete a currently loaded language model, create a new model with the new word, and load it. This is also the case with FSGs; there is no way to dynamically modify a loaded FSG.)
As mentioned earlier, an N-gram language model must include the begin-sentence
and end-sentence symbols, <s>
and
</s>
.
Filler words are transparent to any language model,
N-gram or finite state. That is, the decoder can transparently try to insert
them anywhere in the utterance, but they don't exist as far as the language model
is concerned.
Large N-gram LMs load very slowly. The delay can be avoided by providing a binary dump version of LM files along with the original LMs. The Sphinx2 decoder can automatically create LM dump files for large N-gram LMs, which can be used in subsequent decoder runs. (See Section Building LM Dump Files.)
Note that the current implementation of finite-state grammars, or FSGs, is not the most efficient. In particular, transitions are represented using a full NxN matrix, where N is the number of states. Hence, FSGs containing several thousands of states may run inefficiently.
No LM is required for operation in allphone or forced alignment recognition modes.
uttproc_begin_utt
and the next
uttproc_end_utt
; see further below.)
An utterance can be up to a minute long. In practice, most applications would
typically treat a sentence or a phrase as an utterance, which would be much
shorter than the maximum of 60 sec.
Details of the Sphinx2 recognition engine (with N-gram models) can be found in
Ravishankar's Ph.D
thesis.
As we shall see below, none of the core decoder API functions directly accesses any audio device. Rather, the application is responsible for collecting audio data to be decoded. This gives applications the freedom to decode audio data originating at any source at all---standard audio devices, pre-recorded files, data received from a remote location over a socket connection, etc. Since most applications ultimately need to access common audio devices and to perform some form of silence filtering to detect speech/no-speech conditions, the two additional modules are provided as a convenience.
(NOTE: The APIs often use int32
and int16
types, which are basically 32-bit and
16-bit integer types. Similarly, uint32
and
uint16
are the unsigned versions.)
include/ad.h
.
ad_open:
|
Opens an audio device for recording. Returns a handle to the opened device. (Currently 8KHz or 16KHz mono, 16-bit PCM only.) |
ad_start_rec:
|
Starts recording on the audio device associated with the specified handle. |
ad_read:
|
Reads up to a specified number of samples into a given buffer. It returns the number of samples actually read, which may be less than the number requested. In particular it may return 0 samples if no data is available. Most operating systems have a limited amount of internal buffering (at most a few seconds) for audio devices. Hence, this function must be called frequently enough to avoid buffer overflow. |
ad_stop_rec:
|
Stops recording. (However, the system may still have internally buffered data remaining to be read.) |
ad_close:
|
Closes the audio device associated with the specified audio handle. |
examples/adrec.c
and examples/adpow.c
for two examples demonstrating the use of the above functions.
A similar set of playback functions are provided (currently implemented only on PC/Windows platforms):
ad_open_play:
|
Opens an audio device for playback. Returns a handle to the opened device. (Currently 8KHz or 16KHz mono, 16-bit PCM only.) |
ad_start_play:
|
Starts playback on the device associated with the given handle. |
ad_write:
|
Sends a buffer of samples for playback. The function may accept fewer than the samples provided, depending on available internal buffers. It returns the number of samples actually accepted. The application must provide data sufficiently rapidly to avoid breaks in playback. |
ad_stop_play:
|
End of playback. Playback is continued until all buffered data has been consumed. |
ad_close_play:
|
Closes the audio device associated with the specified handle. |
ad_mu2li
for converting 8-bit mu-law samples into
16-bit linear PCM samples.
See examples/adplay.c
for an example that plays back audio samples from a given input file.
The implementation of the audio API for various platforms is contained in
analog-to-digital library for the given architecture.
The silence filtering module is interposed between the raw audio input source
and the application. The application calls the function
cont_ad_read
instead of directly reading the raw A/D
input source.
cont_ad_read
returns only those segments of
input audio that it determines to be non-silence. Additional timestamp information
is provided to inform the application about silence regions that have been dropped.
The complete continuous listening API is defined in
include/cont_ad.h
and is summarized below:
cont_ad_init:
|
Associates a new continuous listening module instance with a specified raw A/D
handle and a corresponding read function pointer. E.g., these
may be the handle returned by ad_open and function
ad_read described above.
|
cont_ad_calib:
|
Calibrates the background silence level by reading the raw audio for a few seconds.
It should be done once immediately after cont_ad_init ,
and after any environmental change.
|
cont_ad_read:
|
Reads and returns the next available block of non-silence data in a given buffer.
(Uses the read function and handle supplied to
cont_ad_init to obtain the raw A/D data.) More details
are provided below.
|
cont_ad_reset:
|
Flushes any data buffered inside the module. Useful for discarding accumulated, but unprocessed speech. |
cont_ad_get_params:
|
Returns the current values of a number of parameters that determine the functioning of the silence/speech detection module. |
cont_ad_set_params:
|
Sets a number of parameters that determine the functioning of the silence/speech detection module. Useful for fine-tuning its performance. |
cont_ad_set_thresh:
|
Useful for adjusting the silence and speech thresholds. (It's preferable to use
cont_ad_set_params for this purpose.)
|
cont_ad_detach:
|
Detaches the specified continuous listening module from its currently associated audio device. |
cont_ad_attach:
|
Attaches the specified continuous listening module to the specified audio device.
(Similar to cont_ad_init , but without the need to
calibrate the audio device. The existing parameter values are used instead of
being reset to default values.)
|
cont_ad_close:
|
Closes the continuous listening module. |
cont_ad_read
function
are in order.
Operationally, every call to cont_ad_read
causes the
module to read the associated raw A/D source (as much data as possible and available),
scan it for speech (non-silence) segments and enqueue them internally. It returns
the first available segment of speech data, if any. In addition to returning
non-silence data, the function also updates a couple of parameters that may be of
interest to the application:
siglvl
member variable of the
cont_ad_t
structure returned by
cont_ad_init()
.
cont_ad_read()
call. This is in
the read_ts
member variable of the
cont_ad_t
structure.
cont_ad_read
, the timestamp is 100000 and 116000,
respectively, the application can determine that 1 sec (16000 samples) of silence
have been gobbled up between the two calls.
Silence regions aren't chopped off completely. About 50-100ms worth of silence is preserved at either end of a speech segment and passed on to the application.
Finally, the continuous listener won't concatenate speech segments separated by
silence. That is, the data returned by a single call to
cont_ad_read
will not span raw audio separated by
silence that has been gobbled up.
cont_ad_read
must be called frequently enough to avoid
loss of input data owing to buffer overflow. The application is responsible for
turning actual recording on and off, if applicable. In particular, it must ensure
that recording is on during calibration and normal operation.
See examples/cont_adseg.c
for an example that uses the continuous listening module to segment live audio input
into separate utterances. Similarly,
examples/cont_fileseg.c
segments a given pre-recorded file containing audio data into utterances.
include/fbs.h
for further
details.
The two functions pertaining to initialization and final cleanup are:
fbs_init:
|
Initializes the decoder. The input arguments (in the form of the common
command line argument list argc,argv ) specify the
input databases (acoustic, lexical, and language models) and various other
decoder configuration options. (See Arguments Reference.)
If batch-mode processing is indicated
(see -ctlfn
option below) it happens as part of this initialization.
|
fbs_end:
|
Cleans up the internals of the decoder, such as printing summaries and closing log files, before the application exits. |
Sphinx2 applications can use the following functions to decode speech into text,
one utterance at a time:
uttproc_begin_utt:
|
Begins decoding the next utterance. The application can assign an id string to it. If not, one is automatically created and assigned. |
uttproc_rawdata:
|
Processes (decodes) the next chunk of raw A/D data in the current utterance. This can be non-blocking, in which case much of the data may be simply queued internally for later processing. Note that only single-channel (mono) 16-bit linear PCM-encoded samples can be processed. |
uttproc_cepdata:
|
This is an alternative to uttproc_rawdata if the
application wishes to decode cepstrum data instead of raw A/D data.
|
uttproc_end_utt:
|
Indicates that all the speech data for the current utterance has been provided to the decoder. |
uttproc_result:
|
Finishes processing internally queued up data and returns the final recognition result string. It can also be non-blocking, in which case it may return after processing only some of the internally queued up data. |
uttproc_result_seg:
|
Like uttproc_result , but returns additional information
for each word in the result, such as time segmentation (measured in 10msec frames),
acoustic and language model scores, etc. (See structure search_hyp_t
in file include/fbs.h .)
One can use either this function or uttproc_result to
finish decoding, but not both.
|
uttproc_partial_result:
|
This function can be used to obtain the most up-to-date partial result while utterance decoding is in progress. This may be useful, for example, in providing feedback to the user. |
uttproc_partial_result_seg:
|
Like uttproc_partial_result , but returns word
segmentation information (measured in 10msec frames) instead of the recognition
string.
|
uttproc_abort_utt:
|
This is an alternative to uttproc_end_utt that
terminates the current utterance. No further recognition results can be obtained for it.
|
search_get_alt:
|
Returns N-best hypotheses for the utterance. Currently, this does not work with
finite state grammars.
(See further details in include/fbs.h ).
|
uttproc_result
allows the application to respond
to user-interface events in real-time.
The application code fragment for decoding one utterance typically looks as follows:
uttproc_begin_utt (....) while (not end of utterance) { /* indicated externally, somehow */ read any available A/D data; /* possibly 0 length */ uttproc_rawdata (A/D data read above, non-blocking); } uttproc_end_utt (); uttproc_result (...., blocking);See several demo applications in the directory
examples/
for some variations.
Multiple, named LMs can be resident with the decoder module, either read in during initialization, or dynamically at run time. However, exactly one LM must be selected and active for decoding any given utterance. As mentioned earlier, the active vocabulary for each utterance is given by the intersection of the pronunciation dictionary and the currently active LM. The following functions allow the application to control language modelling related aspects of the decoder:
lm_read:
|
Reads in a new N-gram language model from a given file,
and associates it with a given string name. The application needs this function
only if it needs to create and load LMs dynamically at run time, rather than
at initialization via the -lmfn command line
argument.
|
lm_delete:
|
Deletes the N-gram LM with the given string name from the decoder repertory. |
uttproc_set_lm:
|
Tells the decoder to switch the active grammar to the N-gram LM with
the given string name. Subsequent utterances are decoded with this grammar,
until the next uttproc_set_lm or
uttproc_set_fsg operation.
This function can only be invoked between utterances, not in the midst of one.
|
uttproc_set_context:
|
Sets a two-word history for the next utterance to be decoded, giving its first words additional context that can be exploited by the LM. (Useful only with N-gram LMs.) |
uttproc_load_fsgfile:
|
Loads the given finite-state grammar (FSG) file into the system and returns the
string name associated with the FSG. (Unlike the N-gram LM, the string name
is contained in the FSG file.) The application needs this function only
if it needs to create and load FSGs dynamically at run time, rather than
at initialization via the -fsgfn or
-fsgctlfn command line arguments.
|
uttproc_load_fsg:
|
Similar to uttproc_load_fsgfile , but the input
FSG is provided in the form of an s2_fsg_t data
structure (see include/fbs.h ), instead of a file.
|
uttproc_set_fsg:
|
Tells the decoder to switch the active grammar to the FSG with
the given string name. Subsequent utterances are decoded with this grammar,
until the next uttproc_set_fsg or
uttproc_set_lm operation.
This function can only be invoked between utterances, not in the midst of one.
|
uttproc_del_fsg:
|
Deletes the FSG with the given string name from the decoder repertory. |
The raw input data for each utterance and/or the cepstrum data derived from it
can be logged to specified directories:
uttproc_set_rawlogdir:
|
Specifies the directory to which utterance audio data should be logged. An
utterance is logged to file <id>.raw, where <id> is the string
ID assigned to utterance by uttproc_begin_utt .
|
uttproc_set_mfclogdir:
|
Specifies the directory to which utterance cepstrum data should be logged. Like A/D files above, an utterance is logged to file <id>.mfc. |
uttproc_get_uttid:
|
Retrieves the utterance ID string for the current or most recent utterance. Useful for locating the logged A/D data and cepstrum files, for example. |
In addition, the decoder configuration includes a number of parameters that can
be tuned for a given application to give optimum performance. They
are set at initialization time via command-line style arguments (during the
fbs_init
call. The parameters determine
various aspects of the decoder, such as beam pruning thresholds,
the relative weights of acoustic and language model scores, etc. They are
covered in more detail in Section Arguments Reference
below.
-allphone
flag is TRUE
. In this mode, no
language model should be provided; i.e., the -lmfn
,
-lmctlfn
, -fsgfn
and
-fsgctlfn
arguments should be omitted. A phone
transition probability matrix can be specified using the
-phonetpfn
argument.
In addition, the API for allphone decoding includes a single function that supports recognition from pre-recorded files:
uttproc_allphone_file:
|
Performs allphone recognition on the given file and returns the resulting phone
segmentation. The input file can contain either audio data, or cepstrum data.
The -adcin argument, and any related ones, should
be set accordingly. (See arguments reference below.)
|
-lmfn
,
-lmctlfn
, -fsgfn
and
-fsgctlfn
arguments should be omitted.
The set of utterances (speech data) is given by
the -ctlfn
argument, as usual.
(See arguments reference below.) In addition,
the corresponding transcripts should be given
in a parallel file, which should be the -tactlfn
argument. The first line of this file
should contain just the string *align_all*
.
This should be followed by the transcripts for all the utterances to be
aligned, one line per utterance, in the same order as in
the -ctlfn
file.
The transcripts must not include any utterance ID.
Alignments at the word, phone and state levels can be obtained by setting the flags
-taword
, -taphone
, and
-tastate
individually to
TRUE
or FALSE
. Alignments
are written to stdout (the log file).
examples:
sphinx2-ptt:
demonstrates an application in which the user explicitly indicates the start and
end of each utterance using the <RETURN
> keyboard key.
(On WindowsNT/Windows95 systems, the ending <RETURN
> is not
used. Instead, the utterance is terminated after a fixed duration.)
sphinx2-continuous:
demonstrates the interaction of continuous listening and decoding. An endless
audio input stream is automatically segmented into utterances using the continuous
listening module, and the utterances are decoded. The timestamps returned by the
continuous listening module are used to locate gaps in speech data of at least 1 sec,
thus marking the utterance boundaries.
sh autogen.sh
if necessary (not necessary if you are using the Sphinx2 release package)
./configure
./configure --prefix=[path_to_install_dir]
if you do not want to install Sphinx2 in the default location
make
make test
make install
To compile Sphinx2 libraries on Windows platforms:
.\sphinx2.dsw
onto Microsoft Visual C++ 6.0 or better.
sphinx2-continuous
, minimally
Start
" -> "Run
" and type cmd
), cd
to the location where you installed sphinx2 (e.g., c:\sphinx2-0.5
)
cd .\win32\batch
.
sphinx2-test.bat
MS Visual Studio will build the executables under
.\bin\Release
or
.\bin\Debug
(depending on the setting
you choose on MS Visual Studio), and the libraries under
.\lib\Release
or
.\lib\Build
.
In a successful installation, the test produces the recognition result GO FORWARD TEN METERS
fbs_init(int argc, char *argv[])
defined in
include/fbs.h
.
(Applications built around the Sphinx2 libraries, of course, can have additional
arguments.) Many arguments, such as the input model databases, must be
specified by the user. We cover the more important ones below (the remaining
have reasonable default values):
Flag | Description | Default |
---|---|---|
-lmfn
|
Optional DARPA format N-gram LM file with the empty string as its name. | None. |
-lmctlfn
|
Optional control file with a list of N-gram LM files and associated string names (one line per entry). This is how multiple LMs can be loaded during initialization. | None. |
-kbdumpdir
|
Optional directory containing precompiled binary versions of N-gram LM files
(see Building LM Dump Files). Also, directory in
which Sphinx2 format map and phone files are
created if -mdeffn is specified but
-phnfn and -mapfn
are omitted.
|
None. |
-fsgfn
|
Optional finite state grammar file. | None. |
-fsgctlfn
|
Optional control file with a list of FSG grammar filenames. One line per entry. Blank
and comment lines (beginning with a # char) are allowed.
This is how multiple FSGs can be loaded during initialization.
|
None. |
-fsgusealtpron
|
Whether the decoder should expand a transition in an FSG to include all the alternative pronunciations of the associated word. |
TRUE
|
-fsgusefiller
|
Whether the decoder should insert a filler word (self-loop) transition automatically at every state. |
TRUE
|
-dictfn
|
Main pronunciation dictionary file. | None. |
-oovdictfn
|
Optional out-of-vocabulary (OOV) pronunciation dictionary. These are added to the
unnamed LM (read from -lmfn file) with unigram
probability given by -oovugprob .
|
None. |
-ndictfn
|
Optional "noise" words pronunciation dictionary. Noise words are not part of any LM and, like silence, can be inserted transparently anywhere in the utterance. | None. |
-phnfn -mapfn
|
Phone and map files with senone mapping information for the given dictionary and
acoustic model. Can be omitted for continuous models (i.e., if
-mdeffn is specified).
|
None. |
-cbdir
|
Directory containing Sphinx-2 format semi-continuous acoustic model codebook
files (.vec and .var
files).
|
None. |
-hmmdir
|
Directory containing Sphinx-2 format semi-continuous acoustic model senone
weights files (.ccode ,
.d2code , .p3code ,
and .xcode files).
|
None. |
-hmmdirlist
|
Directory containing Sphinx2-format acoustic model transition matrices files
(.chmm files). (For both semi-continuous and
continuous density models.)
|
None. |
-sendumpfn -8bsen
|
Optional 8-bit senone mixture weights file created from the 32-bit mixture weights
files (see Building 8-Bit Senone Dump Files).
-8bsen should be TRUE if
the 8-bit senones are used.
|
None. |
-mdeffn
|
Sphinx3 format model definition file for continuous density models. | None. |
-meanfn -varfn -mixwfn
|
Sphinx3 format acoustic model files: means, variances and senone mixture weights.
(The transition matrices file has to be converted to Sphinx2 format and specified
via the -hmmdirlist argument.)
|
None. |
-varfloor
|
Floor for variance values in the -varfn file.
Smaller variance values are raised to this floor.
|
0.0001 |
-mixwfloor
|
Floor for mixture weight values in the -mixwfn file.
Smaller values are raised to this floor.
|
0.0000001 |
-phonetpfn
|
Phone transition probability (actually counts) matrix input file, for use
as a "language model" in allphone recognition mode. (Also see associated
arguments -ptplw and
-uptpwt .)
|
None. |
Flag | Description | Default |
---|---|---|
-ctlfn -ctloffset -ctlcount
|
Batch-mode control file listing utterance files (without their file-extension)
to decode. -ctloffset is the number of initial
utterances in the file to be skipped, and -ctlcount
the number to be processed (after the skip, if any).
-ctlfn must not be specified for live-mode or
application-driven operation.
|
None 0 All |
-datadir |
If the control file (-ctlfn argument) entries are relative pathnames, an optional directory prefix for them may be specified using this argument. | None |
-allphone |
Should be TRUE to configure the recognition engine for
allphone mode operation.
|
FALSE
|
-tactlfn |
Input transcript file, parallel to the control file (-ctlfn )
in forced alignment mode.
|
None |
-adcin -adcext -adchdr -adcendian
|
In batch mode, -adcin selects A/D
(TRUE ) or cepstrum input data
(FALSE ).
If TRUE , -adcext
is the file extension to be appended to names listed in the
-ctlfn argument file,
-adchdr the number of bytes of header in each
input file, and -adcendian their byte ordering:
0 for big-endian, 1 for little-endian. With these flags, most A/D data file
formats can be processed directly.
|
FALSE raw 0 1 |
-normmean -nmprior
|
Cepstral mean normalization (CMN) option. If -nmprior
is FALSE , CMN computed on current utterance only
(usually batch mode), otherwise based on past history (live mode).
|
TRUE FALSE
|
-compress -compressprior
|
Silence deletion (within decoder, not related to continuous
listening). If -compressprior is
FALSE , based on current utterance statistics
(batch mode), otherwise based on past history (live mode).
-compress should be
FALSE if continuous listening is used.
|
FALSE FALSE
|
-agcmax -agcemax
|
Automatic gain control (AGC) option. In batch mode only
-agcmax should be TRUE ,
and in live mode only -agcemax .
|
FALSE FALSE
|
-live
|
Forces some live mode flags: -nmprior
-compressprior and
-agcemax to TRUE
if any AGC is on.
|
FALSE
|
-samp
|
Sampling rate; must be 8000 or 16000. |
16000
|
-fwdflat
|
Run flat-lexical Viterbi search after tree-structured pass (for better
accuracy). Usually FALSE in live mode.
|
TRUE
|
-bestpath
|
Run global best path search over Viterbi search word lattice output (for better accuracy). |
TRUE
|
-compallsen
|
Compute all senones, whether active or inactive, in each frame. |
FALSE
|
-latsize
|
Word lattice entries to be allocated. Longer sentences need larger lattices. | 50000 |
-fsgbfs
|
If FALSE , backtrace from state with best path score,
instead of from FSG final state. (FSG mode only.)
|
TRUE
|
Flag | Description | Default |
---|---|---|
-top
|
Number of codewords computed per frame. Usually, narrowed to 1 in live mode. | 4 |
-beam -npbeam
|
Main pruning thresholds for tree search. Usually narrowed down to 2e-6 in live mode. |
1e-6 1e-6 |
-lpbeam
|
Additional pruning threshold for transitions to leaf nodes of lexical tree. Usually narrowed down to 2e-5 in live mode. | 1e-5 |
-lponlybeam -nwbeam
|
Yet more pruning thresholds for leaf nodes and exits from lexical tree. Usually narrowed down to 5e-4 in live mode. |
3e-4 3e-4 |
-maxwpf |
Maximum number of words, ranked according to score, that can be recognized and entered in the Viterbi history in each frame. Essentially, an absolute pruning parameter complementing the beam pruning parameter. (Not used in FSG mode.) | Infinity |
-maxhmmpf |
Absolute pruning threshold. Maximum number of HMMs to keep active in each frame (approx.). Implemented only in FSG mode. | Infinity |
-fwdflatbeam -fwdflatnwbeam
|
Main and word-exit pruning thresholds for the optional, flat lexical Viterbi search. |
1e-8 3e-4 |
-topsenfrm -topsenthresh
|
No. of lookahead frames for predicting active base phones. (If <=1, all base phones
assumed to be active every frame.) -topsenthresh is
log(pruning threshold) applied to raw senone scores to determine active phones in each
frame.
|
1 -60000 |
Flag | Description | Default |
---|---|---|
-langwt -fwdflatlw -rescorelw
|
Language weights applied during lexical tree Viterbi search, flat-structured Viterbi search, and global word lattice search, respectively. |
6.5 8.5 9.5 |
-ugwt
|
Unigram weight for interpolating unigram probabilities with uniform distribution. Typically in the range 0.5-0.8. | 1.0 |
-inspen -silpen -fillpen
|
Word insertion penalty or probability (for words in the LM),
insertion penalty for the silence word, and
insertion penalty for noise words (from -ndictfn file) if any.
|
0.65 0.005 1e-8 |
-oovugprob
|
Unigram probability (logprob) for OOV words from
-oovdictfn file, if any.
|
-4.5 |
-ascrscale
|
Scaling of acoustic scores (continuous density acoustic models only). Raw acoustic scores are first scaled down (i.e., shifted down) by so many bits. | 0 (no scaling) |
-ptplw
|
Phone transition language weight applied to phone transition
probability matrix (see -phonetpfn argument).
|
5.0 |
-uptpwt
|
Linear interpolation constant, for interpolating phone transition probabilities
between uniform and those specified by the
-phonetpfn argument. Values closer to 1.0 weight
uniform distribution more, closer to 0.0 weight it less.
|
0.001 |
Flag | Description | Default |
---|---|---|
-matchfn
|
Filename to which final recognition string for each utterance written. (Old format, word-id at the end.) | None |
-matchsegfn
|
Like -matchfn , but contains word segmentation
info: startframe #frames word...
(New format, word-id at the beginning.)
|
None |
-partial
|
If greater than 0, print any available partial hypothesis to stdout every so many frames. | 0 |
-partialseg
|
Like -partial , but include segmentation
information for each word in the partial hypothesis.
|
0 |
-reportpron
|
Causes word pronunciation to be included in output files. |
FALSE
|
-rawlogdir
|
If specified, logs raw A/D input samples for each utterance to the indicated directory. (One file per utterance, named <uttid>.raw.) | None |
-mfclogdir
|
If specified, logs cepstrum data for each utterance to the indicated directory. (One file per utterance, named <uttid>.mfc.) | None |
-dumplatdir
|
If specified, dumps word lattice for each utterance to a file in this directory. The filename is created from the utterance ID. | None |
-logfn
|
Filename to which decoder logging information is written. | stdout/stderr |
-backtrace
|
Includes detailed word backtrace information in log file. |
TRUE
|
-nbest
|
No. of N-best hypotheses to be produced. Currently, this flag is only useful in batch
mode. But an application can always directly invoke
search_get_alt to obtain them.
Also, the current implementation is lacking in some details (e.g., in returning
detailed scores).
|
0 |
-nbestdir
|
Directory to which N-best files written (one/utterance). | Current dir. |
-phoneconf
|
Writes a form of phoneme scores for the utterance to the logfile. Mainly for diagnostics purposes. |
FALSE
|
-pscr2lat
|
Write a phoneme lattice for the utterance to the logfile, i.e., the top-scoring phonemes per frame. Mainly for diagnostics purposes. |
FALSE
|
-taword -taphone -tastate
|
Whether word, phone, and state alignment output should be produced when running in forced alignment mode. |
TRUE TRUE FALSE
|
Finally, one of the arguments can be:
-argfile
filename.
This causes additional arguments to be read in from the given
file. Lines beginning with the '#'
character in this file are ignored.
Recursive -argfile
specifications are not allowed.
Use 8-bit senone dump file. | |
A/D input file byte-ordering. | |
A/D input file extension. | |
No. bytes of header in A/D input file. | |
Input file contains A/D samples or cepstra (TRUE/FALSE). | |
Compute AGC (max C0 normalized to 0; estimated, live mode). | |
Compute AGC (max C0 normalized to 0 based on current utterance). | |
Arguments file. | |
Acoustic scores scaling for continuous density models. | |
Provide detailed backtrace in log file. | |
Main pruning beamwidth. | |
Run global best path algorithm on word lattice. | |
Directory containing semi-continuous acoustic model codebook files. | |
Compute all senones. | |
Remove silence frames (based on C0 statistics). | |
Remove silence frames (based on C0 statistics from prior history). | |
No. of utterances to decode in batch mode. | |
Control file listing utterances to decode in batch mode. | |
No. of initial utterances to be skipped from control file. | |
Directory prefix for control file entries. | |
Main pronunciation dictionary. | |
Directory for dumping word lattices. | |
Noise word penalty (probability). | |
Control file listing several FSG files to be loaded at initialization. | |
Finite state grammar file to be loaded at initialization time. | |
Consider alternative pronunciations for every FSG transition. | |
Add a filler self-loop transition at every FSG state. | |
Run flat-lexical Viterbi search. | |
Main beam width for flat search. | |
Language weight for flat search. | |
Word-exit beam width for flat search. | |
Directory containing semi-continuous senone mixture weights files. | |
Directory containing semi-continuous HMM transition matrices files. | |
Word insertion penalty (probability). | |
Directory containing LM dump files. | |
Language weight for lexical tree search. | |
Size of word lattice to be allocated. | |
Live mode. | |
Control file listing named language model files to be loaded at initialization. | |
Unnamed language model file to load at initialization. | |
Output log file. | |
Transition to last phone beam width. | |
Last phone internal beam width. | |
Senone mapping file. | |
Output match file. | |
Output match file with word segmentation. | |
Max HMMs to keep active per frame (absolute pruning, approx.). | |
Max words to be recognized per frame (absolute pruning). | |
Model definition file for continuous density acoustic model. | |
Continuous density acoustic model Gaussian means file. | |
Directory for logging cepstrum data for each utterance. | |
Floor value for continuous density senone mixture weights. | |
Continuous density acoustic model senone mixture weights file. | |
No. of N-best hypotheses to be produced/utterance. | |
Directory for writing N-best hypotheses files. | |
Noise words dictionary. | |
Cepstral mean normalization based on prior utterances statistics. | |
Cepstral mean normalization. | |
Next phone beam width for tree search. | |
Word-exit beam width for tree search. | |
Out-of-vocabulary words pronunciation dictionary. | |
Unigram probability for OOV words. | |
Frequency of partial hypothesis reporting. | |
Frequency of partial hypothesis (with segmentation) reporting. | |
Phone file (senone mapping information). | |
Whether to generate phone segmentation and scores. | |
Whether to output a phone lattice. | |
Directory for logging A/D data for each utterance. | |
Show actual word pronunciation in output match files. | |
Language weight for best path search. | |
Input audio sampling rate(16000/8000). | |
(8-bit) Senone dump file. | |
Silence word penalty (probability). | |
Forced alignment transcript file. | |
Whether phone-level alignment information should be output. | |
Whether state-level alignment information should be output. | |
Whether word-level alignment information should be output. | |
No. of top codewords to evaluate in each frame. | |
No. of frames to lookahead to determine active base phones. | |
Pruning threshold applied to determine active base phones. | |
Unigram weight for interpolating unigram probability with uniform probability. | |
Floor value for continuous density Gaussian variance values. | |
Continuous density acoustic model Gaussian variances file. |
include/fbs.h
carefully, in order to find out more.
-beam
-npbeam
-lpbeam
-lponlybeam
,
and
-nwbeam
uniformly by a factor >1.
-maxwpf
(Ngram mode), or
-maxhmmpf
(FSG mode). Statistics
on the number of active HMMs and words hypothesized are written to the log file
(the -logfn
argument), which can be examined to
estimate reasonable values for these parameters.
-top
from 4 to 2 or 1.
-topsenfrm
>1,
and adjusting the corresponding pruning beamwidth
-topsenthresh
.
The former can be set, for example, to 4, and the latter between -60000 and -80000.
(Threshold values closer to 0 provide tigther pruning.)
-compallsen
.
When -top
is 1,
it is generally more efficient to compute all senones, but not when
-top
is 4. However,
when using very small vocabularies of just tens of words, it is preferable to
compute only the active senones, regardless of the value of
-top
. (But if
-topsenfrm
>1,
all senones are computed anyway.)
When using continuous density models, the tuning parameters can be quite different
from those for semi-continuous models. In particular, the dynamic range of
acoustic scores is usually larger, hence the language-weight
(-langwt
argument) probably needs to be increased.
All the pruning thresholds (-beam
,
-npbeam
, -nwbeam
,
-lpbeam
, -lponlybeam
,
-fwdflatbeam
and
-fwdflatnwbeam
) have to be lowered as well.
It is preferable to disable phone lookahead (i.e., not specifying the
-topsen
argument), since phone lookahead requires
the evaluation of all senones every frame, which is expensive for continuous
density models.
Finally, to reduce the problem of integer overflow caused by the larger dynamic
range of scores, an acoustic scaling parameter
(-ascrscale
argument) has been added. This
parameter specifies the number of bits by which raw acoustic likelihoods are
scaled down. For instance, a value of 1 implies that raw acoustic scores are
halved, thus compressing the dynamic range. (Unfortunately, this usually
requires further tuning of language-weight and pruning threshold parameters.)
LM dump files can be created by either a standalone program
examples/lm3g2dmp.c
or the decoder. The standalone version can be compiled from the
examples
directory.
The program takes two arguments, the LM source file and a directory in which the
dump file is to be created. It reads the header from the original LM file to determine
the size of the LM. It then forms the binary dump file name by appending a
.DMP
extension to the LM file name. This file
is written to the second (directory) argument. (NOTE: The dump
file must not already exist!!)
Any version of the decoder can also automatically create binary "dump" files
similar to the standalone version described above. It first looks for the
dump file in the directory given by the
-kbdumpdir
argument. If the dump file is present it reads it and ignores the rest of the
original LM file. Otherwise, it reads the LM file and creates a dump file in the
-kbdumpdir
directory so that it can be used in subsequent decoder runs.
The decoder does not create dump files for small LMs that have fewer than an
internally defined number of bigrams and trigrams.
-hmmdir
argument.) However, they can be
clustered down to 8-bits for memory efficiency, without loss of recognition
accuracy. The clustering is carried out by an offline process as follows:
-sendumpfn
flag set to the temporary file name,
the -8bsen
flag set to
FALSE
, and omitting the
-lmfn
argument.
The decoder can be killed after it creates the 32-bit senone dump file, which happens
during the initialization and is announced in the log output.
/afs/cs/project/plus-2/s2/Sphinx2/bin/alpha/pdf32to8b 32bit-file 8bit-file
pdf32to8b
is the temporary 32-bit dump file created above, and the second argument is the
8-bit output file.
-sendumpfn
argument to the decoder with the -8bsen
argument set to TRUE
.
An FSG file looks as follows:
FSG_BEGIN [<fsgname>] NUM_STATES <#states> START_STATE <start-state> FINAL_STATE <final-state> TRANSITION <from-state> <to-state> <prob> [<word-string>] TRANSITION <from-state> <to-state> <prob> [<word-string>] ... (any number of state transitions) FSG_END
The FSG spec begins with the line containing the keyword FSG_BEGIN. (All
preceding lines are comments.)
It may have an optional FSG name string. If no name is present, the FSG
has the empty string as its name.
Following the FSG_BEGIN declaration is the number of states in the FSG,
the start state, and the final state, each on a separate line.
States are numbered in the range [0 .. <#states>-1].
These are then followed by all the state transitions, in no particular order,
and each transition on a separate line. The FSG specification is ended
by the FSG_END line.
The keywords NUM_STATES
,
START_STATE
,
FINAL_STATE
, and
TRANSITION
, can be abbreviated to
N
,
S
,
F
, and
T
, respectively.
A transition specifies the source state, the destination state, the prior probability of the transition being taken, and it optionally emits a word. If no word emission is specified, it is an epsilon or null transition. If a transition emits a word that is not in the pronunciation lexicon, the transition is simply ignored.
Note that the decoder automatically adds filler word transitions (such as
the silence word) at every state (i.e., from a state to itself; assuming the
-fsgusefiller
argument is
TRUE
). This is the principal mechanism for
transparently allowing optional silences to occur in between any two words
in any sentence allowed by the FSG.
Comment lines can also be embedded anywhere in the file.
Any line that begins with the #
character is
treated as a comment line. Blank lines are also allowed anywhere in the file.
-phonetpfn
argument,
is actually a counts file, indicating the frequency with which
transitions between any pair of phonemes take place. This information can be
easily obtained from some suitable training text. Each line in the file
specifies a source phoneme, a destination phoneme, and the associated count
for transitions from the former to the latter. Comment lines are allowed,
indicated by a #
character in the first column.