Context

This is a sketch of a solution. The site currently has absolutely no need of this feature (without user submitted data a filter to ensure its hygienic is niaf), and the security focussed clean up is more urgent. To achieve reasonable features I intend to use external binaries, which will need managing. I could write everything from raw; but duplication of source, just so it is in a single process isn't Unix practice or OSS practice.
This is to be used with a spellcheck, but isn't a combined feature. Common sense says put the spellcheck as close to the user as possible; whereas word filtering this is part of processing. It may be useful to run several batches of wordfiltering, to support contextual behaviour (i.e. any mentions to a certain footballs club mean the content gets a club strip branding style-sheet applied to it). In terms of outcomes, much of this is similar to a Bayesian net spam filter. By volume the usecase for this is swear filtering, although this not why the solution is interesting.

Code like this is why I'm writing iceline.

Goals

  • To create an extendable sub-system to “understand” all nouns in a given text. Understand is quoted, as this is machine processing; but when the target concepts are mentioned, return a message to the caller.
  • This subsystem should be written to support human languages, although practicality may constrain deployed versions to en_UK. See discussion on language structures.
  • To deal with realworld spelling correctly, the way a human does, this is discussed .
  • Where feasible create this so it can be ported into other systems.

Example 1

Bhenchod! When that fracking phreak was on the way out he did something clever to the FS. Fecking fsck has crashed, wtf do I do to get my email back?
WordComment
Bhenchod is in Punjabi (“sisterfucker”), and should be filtered. Catching this will be hard, as all dictionaries will need to be loaded.
frackingis a US synonym for 'fucking', stem to 'fuck', and should be filtered.
phreakalthough sounds the same means something completely different, and should be left alone.
Feckingis an Irish synomym for 'fucking', stem to 'fuck', and should be filtered.
fsckis spelt correctly, if ones dictionary is large enough, and should be preserved. With a small dictionary it is likely to be filtered, which is incorrect.
wtfis aliased swearing, but could probably be left in, as it is encoded. The system should understand that it means “what-the-fuck”. Partial matches like this should be flagged depending on options, as it will fail language style guidelines.

Please note the handling of 'feck' and 'fsck' is completely different, even though there is only one letter difference. If the author was annoyed and wrote bhenchod out in Devengali, this script probably can't deal with this until we have Devengali dictionaries. Likewise spelling word in regional extended ASCII, or UTF8 characters is likely to slide through niaf implementations.
In the rest of this document, I continue to mention “dictionaries”. Human text is just an arbitrary sequence of letters attached to a meaning. Different groups of people created different letter-cluster -> meaning bindings. None of this computable. Arbitrary symbol replacements can be coded as a recipe (e.g. try adding a second set of words with all 'e' replaced for '3'), but are also not dynamically/ variably computable.
Due to the requested complexities of the features in this, it may be required to have some sort of asynchronous processing.

Requirements

  • The target concepts are to be selected with each usage, and chosen at runtime.
  • This is just to understand and detect the target words. Actions as a result are controlled by the caller code. Look at the structure of pspell in PHP for reference.
  • Everything must be locale-aware where possible, e.g. there are non-ascii-20 word separation characters used in Arabic. Perl regex are quite good for this type of awareness.
  • The feature is to decompose the block into words and “homogenise” them. Then feed them into a semantic parser, loaded with additional dictionaries. The semantic parser should report what concept the word means.
  • To cleanly report supported and non-supported locales;
  • NICE-TO-HAVE Auto populating “sp3ling” synonyms to feed the dictionary with. As this system is to understand language as it is written, rather than any formal definition this stage is more important than in other system.
  • NICE-TO-HAVE A contextual guesser, to make a guess at what language a section of text is in, and so what stemming, dictionaries and grammar rules apply. This is hard on short sections.
  • NICE-TO-HAVE An option to tag all items that the feature can't parse, so it is cleanly possible to detect poor dictionaries.
  • NICE-TO-HAVE A utility function to “hash out” words, according to a config setting.
  • NICE-TO-HAVE A strictness option (like a grammar check does). If you are just running this to capture “things that might lose me contracts”, its a different usage to “that guy can't write in formal English, I had better do linguistic fascism”. In the first case preserving authentic voice 1 is important.

Current practice

I mention stemming in this article. A implementation is described here, although it looks like a clone PHP5 edition would be better.

Some dictionaries

This is what seems to happen in the blogosphere.

Writing test cases for this will be “fun”. I will add more length to this article, probably as a second resource.


Word filtering

RSS. Share: Share this resource on your twitter account. Share this resource on your linked-in account. G+

Word filtering

This article contains >18 language, please don't read if you are a minor in your states legislation.

RSS. Share: Share this resource on your linked-in account. Share this resource on your twitter account. G+ ­ Follow