Technology Overview

1. Introduction

After the Deadline is a comprehensive suite of writing improvement tools meant to alert writers to spelling mistakes, misused words, grammar errors, and poor style.    This document covers the software service.

2. Finding Spelling Mistakes

2.1 How it works

Finding a spelling mistake is a three-step process. First, I must decide if a word is misspelled. I do this by checking if the word is present in the dictionary or not. The next step is generating candidate suggestions for a misspelled word. The final step is sorting these candidates such that the intended word is on top. Each of these steps is an opportunity to improve the task of finding mistakes and suggesting corrections.

2.2 Spelling Dictionary

After the Deadline’s spelling dictionary is generated automatically. The dictionary is the intersection of a large list of words and words found when analyzing a corpus of raw data. The dictionary is case-sensitive. The master list is 726,458 words. The current dictionary size is 157,666 words.

2.3 Generating a Suggestion Pool

Once I’ve determined a word is misspelled, the next step is to generate a pool of potential suggestions. The method AtD uses is to take all words within two edits of the misspelled word as described in Peter Norvig’s How to Write a Spelling Corrector essay [1].

AtD collapses the spell checker dictionary into a Trie and walks this Trie to efficiently generate these suggestions. Even with this optimization, generating the suggestion pool is still a time-consuming process. AtD caches misspelled word suggestions to protect against this penalty. This makes sense as a misspelled word has a high likelihood of occurring again in the same document.

2.5 The Spelling Corrector

The spelling corrector’s job is to sort the suggestion pool and arrive with the correct word on top. After the Deadline uses a neural network to score each suggestion against the misused word. The neural network is trained by example and learns how much each factor it’s trained with matters when deciding whether a suggestion is a good fit or not.

The current factors used are:

firstLetter – does the first letter of the suggested word match the first letter of the misspelling? 1.0 – yes, 0.0 – no
editDistance – how many edits must be made to the misspelling to transform it into the suggested word? This value is normalized to the 0.0-1.0 range.
P(suggestion|previous word) – what is the probability of the suggestion occurring given the word that came before it?
P(suggestion|next word) – what is the probability of the suggestion given the word that comes after it?

The neural network generates a score between 0.0 and 1.0 for each suggestion and this is the key used to sort the suggestions. The accuracy of this feature is measured by how often the correct word comes out on top.

3. Detecting Misused Words

3.1 How it Works

Most spell checkers miss real-word spelling errors. These are errors that are the result of a writer typing one word when they mean another. After the Deadline addresses these errors using a statistical approach and a fixed list of confusion sets. A confusion set is a set of words that are sometimes mistaken for each other. When a document is checked, the misused word detector checks each word in the confusion set for a better statistically better fit.

Similar to the spelling corrector, this feature uses a neural network trained on the following factors:

P(word) – what is the overall frequency that the word appears compared to other words?

P(word|next word) – what is the probability of the word given the next word?

P(word|previous word) – what is the probability of the word given the previous word?

P(word|previous word, previous previous word) – what is the probability of word given the two previous words?

From these four factors, the neural network generates a single score. The scores for all words in the confusion set are compared. The current word receives a biased score. The highest score wins.

3.2 Trigrams

AtD stores trigrams (sequences of three words) that end in a potentially misused word. I do this to limit the size of the language model. These trigrams let me calculate P(word| previous word, previous previous word). This extra context greatly boosted the misused word detection precision and recall.

3.3 A note about biasing

I measure the misused word detector by false positives and false negatives. A false positive is a correctly used word that is flagged as misused. A false negative is a misused word that is not flagged.

I assume writers use a word correctly more often than not. Because of this assumption I bias heavily against false positives in the misused word detector. A candidate word must have a score greater than the current word by some multiplier for After the Deadline to flag it as an error. My target false positive rate is 0.5% or 1 in 200 words.

4. Grammar and Style Checking

4.1 How it works

Grammar and style checking are the same technology so they’re covered in one section. The grammar and style checkers deal with whole phrases, not just words. The process consists of splitting raw text into sentences and tagging each word with its relevant part-of-speech (adjective, noun, verb, etc.). The grammar and style rules are then applied to this marked up data. Meta-information such as word part-of-speech, end of sentence, and beginning of sentence are used by the rules.

Functionally, the grammar and style checker is similar to Daniel Naber’s Language Tool [2] project with the exception that it uses the statistical language model to filter suggestions that don’t fit the context of the text they replace.

4.2 Text Segmentation

The text segmentation code uses a rule-based approach to split raw text into paragraphs, sentences, and words.

4.3 Part-of-Speech Tagger

The part-of-speech tagger uses a mixed statistical and rule-based approach for tagging words. If a word is known and has tags associated with it, the tagger tries to find the tag that maximizes the following probability:

P(tagn|wordn) * P(tagn|tagn-1, tagn-1)

For words that are not known, the last three letters of the word are taken, and an alternate model containing tag probabilities based on word endings is consulted. Again the goal is to maximize this probability. Rules are applied to fix some cases of known incorrect tagging.

4.4 Rule Engine

The grammar and style rules are manually created and tested. There are 27,734 rules at this time. The rule language uses regular expressions and can call pre-defined rule pieces making it easy to define a single rule that expands into multiple rules.

6. References

Norvig, Peter. “How to Write a Spelling Corrector”. Last accessed 19 Aug 09.  http://norvig.com/spell-correct.html
Naber, Daniel. “A Rule-Based Style and Grammar Checker”. Bielefeld University, Masters Thesis. Dec 2003. Last accessed: 19 Aug 09.  http://www.danielnaber.de/languagetool/download/style_and_grammar_checker.pdf

After the Deadline