Overview

Espec files, which stand for experiment specification, are general files for specifying just about anything. Their usage is described here. This document describes a particular usage of the espec file for interfacing with a .corpus file to extract needed information. The interface described here uses certain keys and values (ie, `corpus:', `set:', etc) -- all lines that don't match these keys are simply ignored.

Many SUMMIT tools that compute quantities across many utterances understand the espec control usage, including:

ctl_from_espec
word_graph_nbest
word_graph_forced_search
ctl_compute_subset
ngram_create
ngram_perplexity
rec_server
rec

Espec files are easier and faster to use then control files, but the control file usage is still accepted and useful in cases when a corpus file has not been created. These tools all accept either a control file or an espec file, and, based on the file's extenstion, decide how to load the file.

Using an espec file, one can succintly specify a particular set of data, expressing it in terms of union's, intersection's, negations, etc, of other sets or individual data. Sets can be either built in to the corpus (eg, <train>, <dev>, etc), or user-defined in the espec file using define. Sets can be further constrained based on individual properties of the data -- these are called "constraint sets". Once the set is chosen, a template mechanism is used to fill in and compute a shell-style control line.

Set Designation

The control keywords in an espec file designate two pieces of information. The first is which set of data you wish to compute over, which is described in this section. The second is what to compute across that data.

The espec file recognizes a simple language for working with sets of data. These sets can be those defined in the corpus, or user-defined sets in the espec file. In all cases, the set in the espec file can be overided by specifying -set on the command line.

To specify a set, there must be a `corpus:' line in the espec file which points to the corpus file you are working with. For example:

corpus: /usr/sls/summit/recognizers/corpora/voyager.corpus

Next, there must be a `set:' line which designates which set you are working with. This set can be expressed as a function of corpus-defined sets, and user-defined sets, using some basic set-algebraic operations:

<A> + <B> is the union of sets A and B
<A> * <B> is the intersection of sets A and B
- <A> is all data not contained in A
<A> - <B> is the intersection of A with the negation of B

Sets can be arbitrarily combined according to these operations, using ()'s for grouping. The operators associate as they do with standard algebra -- left to right, with * having precedence of + and -. Here are some examples:

<all_data> - <test> - <dev>
(<all_data> - <test>) * <some_set>

The sets or individual data referred to in the `set:' line (or equivalently in the online `-set' option) must be defined either in the corpus, or in the espec file. You can define a new set in the espec file like this:

define <my_set> = <abr> + <mbr> + <chs> - utt-s-abr-14

The right hand side of this definition can again be any general set expression, and can refer to other defined espec sets or corpus sets. Besides corpus-defined and espec-defined sets, you can also use constraint-sets. Constraint sets spontaneously create a set based on some constraint. For example, you could create the set of all data whose utterance_duration is greater than 5.0 seconds like this:

<[utterance_duration > 5.0]>

The expression inside the []'s is a general logic expression, and can consist of any number of terms connected with &&'s, ||'s, and !'s, and using ()'s to group terms:

<[gender == m && (type != read || orthography includes boston)]>

Each term in the formula must be of the form `property operator value', where property is a corpus-defined property, value is a string or number value, and the operator is one of the following:

== - true if strings are identical
includes - true if the property has the value as a substring
excludes - true if the property does not have the value as a substring
< - true if property is less than the value
> - true if property is greater than the value

Sets can be created which combine constraint sets with the standard set algebra:

corpus: voyager.corpus

define <train_read> = <train> * <[type == read]>
define <train_spont> = <train> * <[type == spontaneous]>
define <my_test> = (<dev> + <test> - <abr>) * <[type != read]>

set: (<train_spont> * <[gender == f]>) + <my_test>

The above functionality is sufficient for most needs. However, there are some additional espec set "functions" which can be used to select random subsets of sets, and also to save sets out to a file in the espec format. The first of these functions is trim, and its usage is:

  trim(set, num, random|direct)

Trim takes the specified set, and trims away enough elements to leave exactly num elements, and returns the resulting set. It can trim randomly, which will yield a different set every time it is used, or direct, which will yield the same set every time it is used (ie, it will be deterministic). Trim can be useful, for example, for randomly selecting a small subset of a training or development set to examine in detail with other tools (e.g., sapphire). An example is:

set: trim(<dev>, 25, random)

Trim is useful for choosing a single random subset of data. If, however, you'd like to choose several subsets which are orthogonal with respect to a particular partition of the data (usually speakers), you can use the function rand. Rand is useful for creating independent-of-speaker training and testing sets, for example. Its usage is this:

  rand(set, partition-name, num)

Rand will choose a subset of the specified set, whose number of elements is num or more. The way that it chooses the subset is to choose elements from the partition one at a time, until it has exactly num or more elements. The partition name must be the name of a partition-set in the corpus, for example `by_speaker'. This could be used as follows:


  corpus: voyager.corpus
  
  define <train0> = rand(<all_data>, by_speaker, 2500)
  define <test0> = <all_data> - <train0>

The final function is save, which will save a set to a file in a format that can later be included in an espec file. Its usage is:

  save(set, file-name, set-name)

It will save the set to the file `file-name', naming it the set `set-name'. This file can then be included in the espec file and those sets used for future computation. This is useful for computing a random set and saving it, so you can re use it later.

What to Compute

The second purpose of the epsec is to specify what should be computed across the set of data. This is specified using control lines in the espec file, for example:

  line: -in <apnet>
  line: -ref "<orthography>"
  line: -word_graph_out /t/summit/<speaker>/<tag>.wg.gz

Control lines are lines that start with a the keyword "line:". They act simply as templates, such that any text surrounded by <>'s is taken to refer to an utterance property, unless the < is preceeded by a backslash (\).

This template is instantiated for every element in the specified set of data. To translate from an espec file to the equivalent control file, use the tool ctl_from_espec.

The special property <tag> is not formally a corpus property, but behaves just like one. It will be expanded according to the name of each datum in the designated set. This variable, coupled with the <speaker> property in the corpus, is useful for composing filenames with the proper directory structure (as in the -word_graph_out key above). Any key with ends in out will be interpreted as a file based output, and any tools loading a control file will make the appropriate directories as necessary.