Modify Rules

Prerequisites

This document describes how to modify After the Deadline. You can edit the spell checker dictionary, add words to the misused word detector, and add update the grammar rules. This document assumes you have the After the Deadline service installed. For steps that require a language model rebuild, this document assumes you have installed the bootstrap data as well.

1. Modify the Spell Checker Dictionary

1.1 Overview

After the Deadline generates the spell checker dictionary by counting how many times words in a master word list occur in a corpus of text.  Words from the master word list that occur 2 or more times are accepted into the spell checker dictionary.

1.2 Add a Word

To add words to the spell checker dictionary, create a file (any name will do) in the [atd_directory]/data/wordlists directory and add your words to the file.  Then find some documents that use these words and add them to [atd_directory]/data/corpus_extra.  Then rebuild the AtD models with:

./bin/all.sh

Rebuilding the models is always necessary for this step as AtD needs context to make smart decisions about how a word is used.  This rebuilding step rebuilds and trains all AtD models.  It also generates scores for everything.

1.3 Remove a Word

To remove a word, go to [atd_directory]/data/wordlists and use the grep command to find the file(s) with that word.  Remove the word from these files and rebuild the AtD models.

If you’re in a hurry you can edit models/dictionary.txt and remove the word.  Just remember this change will go away the next time you rebuild your models. Edit the word lists to be permanent. If you don’t rebuild the models, make sure you rebuild the edit cache. The edit cache holds pre-computed suggestions for the most common misspellings. Rebuild it to make sure the word you removed is not present in these pre-computed suggestions.

./bin/buildedits.sh

1.4 Change the Threshold

The spell checker dictionary is set to include words that occur 2 or more times.  This value is set in bin/buildmodel.sh near the bottom.  You can regenerate the spell checker dictionary using a new threshold with:

java -Xmx2536M -XX:NewSize=512M -jar lib/sleep.jar utils/bigrams/builddict.sl 2

It’s good to set a higher threshold when possible.  If your dictionary is over 150K words, you can probably increase this threshold.  Raising the threshold increases the quality of AtD’s spelling corrections and drops misspelled words from the spell checker dictionary.  Of course more corpus data is necessary to make this possible.  Try to keep the spell checker dictionary above 100K words.  Too few words and many legitimate spellings will be flagged as wrong.

2. Extend the Misused Word Detector

2.1 Overview

The misused word detection knows about different words and words that they’re potentially confused with.  The misused word detector checks each potentially misused word for a better fit from the words it could be confused with. A word and its potential suggestions are called a confusion set.

2.2 Add a Misused Word

Edit data/rules/homophonedb.txt and add a new entry e.g.:

write, right

Then rebuild the AtD models.  This is necessary as AtD needs to generate trigrams (sequences of three words) for all words in the misused word database.   You may also want to add definitions for your words to data/rules/homo/definitions.txt.

It’s worth noting that some words are not good candidates for statistical misused word detection.  Some examples include it’s vs its, to vs. too, etc.  You may want to try testing your new words against a bunch of data before choosing to include them.  To do this, create a rule file for your words:

its::word=its, it's::filter=homophone

Then run the test rules program with your new rule file and a corpus:

./bin/testr.sh its.rules data/wikipedia_sentences.txt

Rules and testing them are explained later.

2.3 Remove a Word from Misused Word Detection

Edit data/rules/homophonedb.txt, find the word, and remove it.  You don’t have to rebuild the language models.  Rebuild the AtD rules model for this change to take effect.

./bin/buildrules.sh

3. How to Create Rules

3.1 Rule Syntax

After the Deadline is a rule-based system for finding style and grammar errors. The rules can be as simple as mapping a word to another word (e.g. utilized to used) or more complex with regular expressions and tag patterns. After reading this section you will know how to develop, test, and deploy new rules.

An AtD rule-file can hold comments and rules. Each rule takes up one line in the file. Here is an example rule file:

# infinitive phrases
# http://www.chompchomp.com/terms/infinitivephrase.htm

to is::filter=kill
to .*/VBZ .*/DT|NN::word= \1:base \2::pivots=\1,\1:base
To .*/VBZ .*/DT|NN::word= \1:base \2::pivots=\1,\1:base

Comments begin with a # on their own line. Here I documented the type of rule and a resource to learn more about the rule. A rule consists of a pattern describing a phrase to match followed by multiple declarations separated by ::. The first rule matches the word to followed by is. The declaration filter is set to the word kill. This rule means if AtD finds to is, it should treat it like it found an error but not show anything (in effect, killing other rules that might match this phrase). This mechanism exists to catch false positives for certain rules and stop them from showing errors.

Phrase and Word Patterns

The pattern part of the rule deserves special attention. A phrase pattern matches one or more words, in sequence, as specified by word patterns separated by white space from each other. For example to .*/VBZ .*/DT|NN is a phrase pattern in the second rule. The word patterns are to, .*/VBZ, and .*/DT|NN. This pattern matches three words. The rule language does not have a way to match a variable number of words. You have to create rules for each of these situations.

Word patterns can match on the content of a word, it’s part of speech tag, or both. The content of the word and the part of speech tags are separated by a /. The word content part of the pattern is a regular expression and so is the tag part. If you don’t know regular expressions well, you’ll be ok. You can use a word or tag as-is and that is valid. .* matches anything and | means OR.

A part of speech tag is a label identifying whether a word is an adjective, noun, or verb. The word pattern .*/VBZ matches any word and any third person singular verb. The VBZ is a part of speech tag. AtD uses the Penn Tagset to tag words. I recommend you print a list of tags from: http://www.mozart-oz.org/mogul/doc/lager/brill-tagger/penn.html.

A word pattern can optionally omit the part-of-speech tag. AtD will assume a /.* for you.

Sometimes you’ll see a word phrase that begins with an &. This tells the rule-parser to call a Sleep function with that name and substitute the results of it into the rule. This exists as a convenient way to add larger patterns to the rules. Here are some of the functions you may want to know about in utils/rules/rules.sl:

  • &absolutes – nouns that represent absolutes e.g. dead, disappeared, empty, false, etc.
  • &uncountable – uncountable nouns e.g. garbage, luggage, money, etc.
  • &comparisons_base – the base form of comparison words e.g. good, hot, bad, etc.
  • &comparisons – comparisons better, hotter, colder, etc.
  • &past – matches a past participle verb
  • &irregular_noun_singular – matches several singular form irregular nouns e.g. alumnus
  • &irregular_noun_plural – matches several plural form irregular nouns e.g. alumni
  • &irregular_verb_past – past tense irregular verbs e.g. began
  • &irregular_verb_base – base tense irregular verbs e.g. begin
  • &irregular_verb – a bunch of present past irregular verbs with different past participle forms

The first word in a word pattern deserves special attention. The first word pattern does not use a regular expression for the part-of-speech or word portion. This is because AtD tries to combine rules that trigger on a certain tag or word together. This makes checking a document much faster as only the rules that apply are checked.

Shortcuts do exist for the first word. You may use | in the first word to create copies of the rule for each of the words. This is a syntactical shortcut and does not make the rule any faster. You may use the & shortcuts in the first word.

Specifying Suggestions

After the rule pattern, you have the option of specifying declarations.  A declaration is a key= followed by a value.  Declarations are separated by ::s.   The word declaration describes one or more suggestions to be used when the rule is matched.  If no word declaration is provided, AtD will flag the error and it won’t show any suggestions.  The passive voice and hidden verb rules work this way.  A word declaration can reproduce any word from the matched phrase using for the first word, \1 for the second word, and \n for the nth word.  A word declaration can also specify transforms to apply to the words.  The following transforms are available:

  • :base – transforms the first word (assumed to be a verb) into its base verb
  • :nosuffix – attempts to stem a word ending with a -able or -ible suffix.
  • :participle – transforms the first word (assumed to be a verb) into its past participle
  • :past – transforms the first word (assumed to be a verb) into the past tense
  • :plural – transforms the first word (assumed to be singular) into the plural form
  • :positive – transforms the first word (assumed to be a negative) into a positive word
  • :possessive – transforms the first word into the possessive form
  • :present – transforms the first word (assumed to be a verb) into the present tense
  • :singular – transforms the first word (assumed to be plural) into the singular form

You may specify multiple suggestions by separating them with a comma and a space.

The pivots declaration is used by AtD to specify which part of the matched phrase (and later the suggestion) to look at.  AtD uses these pivots to calculate the statistical score of the suggestion compared to what was matched.  The -\n matches work as well as the transforms.  The first pivot should describe the part of the matched phrase that is interesting.  The second pivot should describe the first suggestion. The third pivot should describe the second suggestion. All pivots are included in one pivots declaration separated by commas with no spaces.

Finally, AtD has different types of statistical filters you can use on the suggestions.  You specify these filters with the filter declaration.  The following filters are available:

  • homophone – uses the misused word detector to filter suggestions.  Suggestions and the matched phrase can only be one word. Include the matched phrase a suggestion too.
  • kill – accepts the rule and avoids showing it as an error.  Use this to match a false positive condition that is more specific than the error you’re trying to catch.  This will cause AtD to ignore the false positive condition.
  • none – performs no statistical filtering

By using the pivots declaration, you’re implicitly using the normal statistical filter.  The pivots will not work with these other filters.

3.2 Developing and Testing Rules

To develop a new rule, you should create a few sentences that describe the error you’re wanting to catch.  Put these into a text file (here, its.examples):

I wonder if this is your companies way of providing support

This error features the plural word companies used instead of the possessive form company’s.  Put as many examples of the error as you can think of into the text file.  Then run the tagit program to learn how AtD sees these errors.

./bin/tagit.sh errors.txt

From the output of this program you will have now have an idea of what tags AtD assigns to the words when the error you’re looking for occurs:

I/PRP wonder/VBP if/IN this/DT is/VBZ your/PRP$ companies/NNS way/NN of/IN providing/VBG support/NN

Notice that companies is tagged as a plural noun (NNS).  A good start for a rule is:

your .*/NNS::word= \1:possessive::pivots=\1,\1:possessive

The next step is to test the rule against the errors you made:

./bin/testr.sh test.r errors.txt

Which outputs:

Warning: Dictionary loaded: 124314 words
Warning: Looking at: your|companies|way = 3.8133735008675426E-5
Warning: Looking at: your|company’s|way = 2.955364463172345E-4
I wonder if this is your companies way of providing support
I/PRP wonder/VBP if/IN this/DT is/VBZ your/PRP$ companies/NNS way/NN of/IN providing/VBG support/NN
0) [ACCEPT] is, your companies -> @(’your company’s’)
id         => c17ed0984ed4d01ac172f0afd95ee00c
pivots     => \1,\1:possessive
path => @(’your’, ‘.*’) @(’.*’, ‘NNS’)
word       => your \1:possessive

This output says that AtD accepted your company’s as a replacement for your companies.  The warning in the second and third lines shows the statistical comparison between the two pivot words.  When testing a rule you want to make sure the right thing is being looked at.  The pivot phrase is enclosed in two |s.

Once the rule works on your examples, you’ll want to try it on a bunch of text to see what kind of errors it finds.  Take any false positives and actual errors and add them to your errors.txt file.  Then repeat this process until you have a rule that doesn’t flag correctly used text and catches the error most of the time.

./bin/testr.sh test.r data/wikipedia_sentences.txt >out.wp
./bin/testr.sh test.r data/gutenberg_sentences.txt >out.gb

3.3 Deploying Rules

If you want to deploy a new set of rules, edit utils/rules/rules.sl and find the right place to load your rules file.  There is a Sleep function loadRules() that makes this easy.  Optionally you can edit one of the files in data/rules/ and add your rules there.

Once your rules are in place, rebuild the rules model with:

./bin/buildrules.sh
<span>%d</span> bloggers like this: