AtD Service Projects
Missing Word Detection
AtD doesn’t look for missing words before certain types of nouns. Common words that go missing are determiners like a, an, the. These rules will require a special kind of filter in the engine. Basically for certain kinds of words (possibly this rule could be tripped on trigger words), the model will look at
[P(word | previous) ** 2] vs.
P(word| inserted word) * P(inserted word|previous). If the inserted word probability is higher then it’s suggested, otherwise this is ignored.
I’m doing as well [as] I can.
I’d like a couple [of] extra hands to help out.
Let me show how to add it to [an] application.
Better Rule Engine Traversal
One of the weaknesses of the AtD rule engine is it looks for the first match from the current place in the phrase and then calls it a day. An improvement to this feature would find all matching errors and then let AtD pick the first one that isn’t pruned out by the statistical filters.
When implemented this feature needs to be aware of the kill filter present in AtD. The kill filter is used to match false positive sentences and prevent further processing of them.
Support for Other Languages (Getting to that first non-English language)
- Verify AtD can read and preserve UTF-16 data files
- Some way to convert between HTML entities and unicode chars, and back to correct HTML entities when providing the suggestions
- Need a corpus of raw text data written in the language (Wikipedia, WordPress? blogs–good candidates to start with)
- Need a list of misspelled words and their correct spelling to train and measure neural networks.
- Need a spellchecking dictionary for the language (can in theory generate one if enough data is available)
Misused Word Detection
- Need a list or criteria for deciding which words are commonly mistaken for each other
Grammar / Style Checking
- Need a list of rules to create (LanguageTool.org is a good start for rules)
- Need to figure out how to do POS tagging for other languages (LanguageTool.org is a good reference on this)
- Will need error explanations localized to the language checking is being provided for… can omit this feature for non-English languages
Misused Punctuation Detection
AtD doesn’t look at punctuation, it could… I haven’t thought about this problem yet.
Automated Data Acquisition, Processing, and Training
One of the things I want to see AtD do (in the near future even) is automatically acquire fresh data from certain sources each week, evaluate it using the statistical quality score, and:
- Decide whether to discard the data based on some empirically set threshold
- Find all words present in the data that are not present in the AtD wordlists (we have this too)
- Depending on some empirically set threshold, add words with a certain relative count to the AtD wordlists (we can do this easily!!)
- Rebuild models and generate reports (one step process to do all this now)–leave it to me for a manual deploy.
As you can see, the only thing missing to do this project is deciding where to get data from and automatically downloading it on a regular basis. If anyone has ideas about what qualifies as a good data source to visit once a week, let me know. This is an easy win to make AtD better automatically and probably < 1 weeks worth of week to do.
Scrape through user_attributes table…
When users select Ignore Always, the phrase is saved in the user_attributes table. I should harvest this data and see if there is any kind of pattern emerging. It may be possible to use this information to make the system smarter or come up with ideas about what is missing from this data to make the system smarter. Some ideas:
- How can this information benefit the spellchecker? [need to know the error was a misspelling]
- How can this information benefit misused word detection? [need to know the context of the error]
- How can this effort benefit grammar/style checking? [is it possible to mine exceptions to the rules from this information?]
Simplest idea to get use out of this data–make AtD send the sentence (not just the ignored part) to the service and feed this as data to the AtD corpus. This would allow AtD to become smarter as it would be able to learn from the context of the ignored error and add it to it’s list of writing that is considered correct. Just a thought. — Would require mods to the AtD/WP.com plugin.
Prune the Misused Words List
AtD uses a statistical approach to identify if a better fit for a word exists within its confusion set. I used homophone groups to seed these confusion sets which has worked well as a start. Now I’m discovering some words do very poorly using this apporach. These are:
- interred, interned – use is too similar, statistical approach is just dice rolling
- dough, do – reason TBD
I think it’d be useful to find a way to look for words that are too similar or wrong for the statistical approach and prune them from the misused words list automatically. This would allow me to lower the tolerance a word must pass before getting flagged.
Create Platform Neutral Format for Language Model
Current AtD language model, neural networks, and rules model are serialized Java objects. Should investigate creating a neutral format for these trained models to make it easier to port AtD-runtime to other programming languages.
Develop Low-Memory Language Model
Create a low-memory version of the AtD language model. Current version loads everything into memory which restricts the size of the language model.
1. Should be possible to make the language model act like an LRU cache swapping data to and from the disk as necessary. Prior to joining Automattic AtD used a similar scheme. It was abandoned for the sake of performance.
2. Another option is to update org.dashnine.preditor.LanguageModel to store a List and a BloomFilter for each part. May be worthwhile to create a MemorySmartHashMap.java that defaults to a normal HashMap implementation for larger data sets and uses a List and a BloomFilter for smaller data sets where O(n) lookup is negligible. See http://jaspell.sourceforge.net/ for a BloomFilter implementation.
Real-time “as you type” Checking
An open project is to implement as you type checking into AtD. AtD analyzes on a sentence by sentence basis, so detect when a user has completed a sentence and send it in for processing. Before proceeding with this will want to decide what the user experience should be like.
AtD Core Front-End API
A lot of code is shared between the TinyMCE and jQuery extensions for After the Deadline. So much so that I believe it’s possible to create a core library that just needs to be told how to do certain things in the environment it’s in. From this information alone it’d be possible to make AtD work seamlessly with other WYSIWYG editors and in very short time.
AtD WordPress Plugin Ideas
- Create ATD_* constants to set default AtD options for auto-proofread and which options AtD checks.
- Create a means to summarize AtD errors and present these to the user (possibly as part of the auto-proofread dialog).
- Make AtD spell check post titles using jQuery extension.