This is a document I have copied from traditional academic formatting, comes with Harvard style references to match the institution I was working with at the point of writing. This was done just after my MSc, but in 2004.

Introduction and Background

CAIN script is a special purpose language -- included in CAIN [see following paragraph] -- which should have a low entry cost. It is designed for making decisions about webpages. Most of the information and documentation for this language is available from (Beresford, 2004). There are many computer languages mentioned in this document, many not in frequent use. To provide more information, there is a list of language manuals available from Beresford (2004).

First a little background: The University of Portsmouth has an ongoing research project into web accessibility, and how to improve it. The prototype system implementing the current ideas and experiments is called Computer Aided Information Navigation or CAIN. This is described in (Lamas, Jerrams-Smith, and Heathcote, 2000), as “increasing the Web's value as a pedagogical tool”. The paper describes CAINs conceptual extension of systems such as Dublin core. As this work is ongoing, there is some currently unpublished PhDs authored by Good and Wong (these are planned to be available in 2005 and 2004 respectively). CAIN script is a tool to support Good's PhD.

As the web gets bigger, and is more heavily used, its “common use” (Ashman and Moore, 2003) and median population will change. Currently search tools focus on locating the data the user asked for (Beresford, 2003). A more useful system returns the most comprehensible data on the subject. If everybody favours sites with good usability, the current low standards on website usability (Shirky, 2003) are likely to rise. 'Usable' is a highly subjective term, and needs to be separately defined for each user. As discussed in (Tricot and Tricot nd), usability is different from utility. More experienced users are likely to favour utility, as they have adapted to the usability (or lack of it) as extensively described in (Garfinkel, Weise, and Strassmann, 1994). Following statistical norms, the majority of users do not have the level of experience of advanced users (Paek and Horvitz, nd).

The average web user is frequently portrayed as a young male introvert, with a large amount of time and learning experience in computer devices. This was absolutely correct in 1995 era, but is changing to a representative sample of the North American population (Cimino, 2001). This is not necessarily a strong statement, as most 'social internet use' research appears to be politically motivated. To pick an example: Knight, 2000 critiquing a SIQSS (Feb 2000) report. Kirsner (2000) provides a harsher critique, although focused on interpretation and use of the report. There currently doesn't appear to be any reliable conclusions to be drawn from this research area. The number of frequent computer users is increasing and has definitely exceeded the initial population segment. Personal experience includes the following people: Elderly people who use the internet, as they lack physical mobility; people trying to avoid large mobile phone bills; people who are tied to one physical location. These have a range of perception models (a term that avoids any offensive connotations), and a range of financial and time investment in computers. Few of them want to frequently upgrade their software.

Analysis of requirements

The usability value will need be to be assembled combinatorially from a range of sub factors. A user with dyslexia will prefer a range of conditions, such as no low intensity backgrounds. Orgthongonally the user may also be short sighted, so want large text. The ideal site for this user would have both these example values. Calculating all of the potential combinations would not be a feasible or logical (since some combinations are very rare). Since shortsightedness doesn't correlate with dyslexia, they can be computed separately and combined as necessary.

In the real world people give different functions different priorities (a strongly auditory person will fear loss of vision less than a visually driven person will). The relative weighting of these conditions must be left to user ('score these on a scale 1 - 10'). Conditions biased by the current environment (such as lighting contrasts) may need to be adjusted frequently for best effect.

Whilst creating a programming language and claiming ease of use for non programmers may seem tautological, it appears to be the optimum solution. The detailed specification of what constitutes a 'good web page' is highly complex, and requires a system equivalent to a language. It seems stupid to disregard a system humans have spent about a million years evolving. Healthy adults have considerable linguistic finesse, many in more than one language. All tools that try to hide a language (for example ms-access), still require the detailed precise information entered. Assuming the user is a reasonable typist, input via a keyboard is much faster, and more comfortable than a contrived GUI. The typical GUIs used in this frequently make it difficult to read for any non trivial use (Christiansen, 1992).

This range of possible values, (and the software axiom about flexible software; see Pree, Beckenkamp, and Viademonte, nd) mean that a scripting language is probably the most sensible solution. The way perception models interact with webpages is open to a large amount of debate, so having a scripting language helps support changes in opinion. Assuming the software is correct, the psychologists can continually improve their understanding of webpage interpretation without involving the code maintainer. This is better than editing the project source or the binary form, as the scripting language abstracts irrelevant concerns (handling the database responses is not expected to be in the psychologists competencies), and this enables a faster development time.

As noted the language must be usable by intelligent non-programmers. Anecdotal evidence states that this goal is not really satisfactory, as nonprogrammers cannot obtain a screenfull of information from a screenfull of data. This will mean low use programmers (hereafter LUP), who have done basic programming, and understand the concepts.

As being able to update the software is strictly a secondary function to using it, more time was invested in making it useful. All attempts at “natural language programming” have failed (Beute, 2003), and will not be undertaken here.

There are some unlisted objectives, added to make creation and debugging faster,

  • If the solution includes a language -- it should be easy to read;
  • It should link to other languages easily;
  • Each special symbol should only have one meaning (there a wide range of awkward DWIMC (Anon, nd) errors caused by Perl having several meanings for some symbols);
  • Dependency on special symbols should be minimised, as non-programmers find memorising them expensive;
  • As a corollary to the first point, it should have a simple syntax and structure;

A scripting language has the second advantage of coping with web designers' abuses of HTML. This is critical as the system is dependant on HTML for determining the location of web resources, and the CAIN system has to understand HTML to state its comprehensibility.

While it requires a number of programming skills and concepts, CAIN script should be easy for people who rarely program to use. The syntax design is delineated by a conflict between 1) making everything explicitly defined, so users should know what they are doing, 2) making everything short and easy to type, so they can achieve their goals, rather than floundering in the scripting. Real easy-to-use languages do not exist, especially if the cost of debugging is included. Many less experienced people claim VB is easy to use until the first inexplicable error, just before the first real deadline (for example sophisticatediowan, 09/2003). If the implementation language is assembly; in many cases, the length of the codebase is directly proportional to its runtime. In higher level languages; this does not hold, but many people tend to assume it (Jayson, 2002, particularly in Perl, see Wall, Christensen, and Orwant, 2000). Many language features have leaked into high level languages, which encourage shortcuts (such as the ternary operator '?:' in C). These shortcuts make the resultant code hard to read and impossible for LUP. Since this language has been designed for LUP, the shortcuts have been omitted. Another example is it is quite difficult to invoke functions in logical tests, with CAIN script. This is common in C, but frequently leads to omitted error handling.

McIver and Conway (1996) describe the failure of normal programming languages for LUP. To summarise: general programming languages, not being designed as teaching tools, make poor teaching tools. There are a range of errors mostly restricted to LUP: misuse of types; inaccurate use of syntax; poor use of scoping (Wook Kim, Park, Eigenmann, nd), (Sajaniemi nd), (Pane and Myers, 1996) or (Cockburn and le Churcher, 1997). These are merely the absence of learned behaviours for the programming languages, rather than cognitive failure. This is eloquently described in (Chomsky, 1965), although the language has changed since this was printed. Some more recent work in the same field (de Beaugrande, 1991) or (Pit-Corder, 1973) discusses the same problems. People make the same type of error when learning a new human language (if it uses different rules to their own). People have a high accuracy in making rare words plural (or equivalent operations) in their native language, as they have learned behaviours to do this (CDI, 2003). This artificial language -- as computational performance is not a critical goal -- should make it difficult to make errors.

To partially contradict the previous paragraph, a highly explicit semantic structure can be beneficial. A language where line structure is independent physical layout (as HTML is) is much safer to 'carry around'. Many tools (particularly email clients and word processors) alter the physical layout, which means they cannot be used to transmit or conduct the fussy language. Visual Basic and Javascript have both attempted to correlate the logical layout with the physical one. Experience in debugging both of them indicates this looseness adds errors, as the system doesn't totally accept the physical layout. Optionally both of them have a strict mode, which enforces logical layout notation. An academic response to this idea of logical layout is unfortunately LISP or its younger child Scheme. The source for both of these is hard coded s-expressions (not originally intended for human consumption, Wikipedia 2004), which in the author's opinion are too abstract and alien for usability.

Perl is a language that attempts to be easy to use, and enforces a 'logical layout' with line terminators and block delimiters. While this is intimidating to absolute novices, it encourages clearer thinking in LUP (Skjerning, 2003), which is probably a better goal. Visual Basic (VB) also attempted to remove all non alphanumerical characters, which is perceived as ugly and clumsy (so block delimiters are 'if' and 'end if' rather than '{' '}'). Since Perl is intended for the implementation language, this seems a good style to adopt. However as VB is a language popular with non-programmers, this type of syntax structure is also supported.

Language Design

The task of converting a markup language into a tree structure is well understood, and has been done many times (for example: ICE, nd). This is intended to be done by a library, and is a relatively minor feature for this software. The most sensible cache mechanism, is a database. Databases scale efficiently (so many users can use CAIN concurrently), and provide structured access to information. As with markup languages, this will be relegated to a library. The user interface (the webpages), has been specified by Good. These are implemented in PHP, and should be available via Beresford (2004).

Since this language is specifically designed for manipulating HTML tree structures, they should have an integrated and simple syntax. Perl has made similar concessions to text processing; and Occam for threads and parallelism. Javascript is also used for manipulating tree structures, but the access of HTML structures is restricted to extra libraries (DOM). The strict library interfaces are designed to be compatible with all computer languages, and are highly verbose (Hickson and Hyatt, 2004) or (Goossens 2000).

As part of the text processing theme, Perl has integrated Regular Expressions (RE). These are normally delimited by a '/' character, although other values can be used. To enable guessing of the meaning of difficult sections (such as Javascript source), RE have been included into CAIN script. This is an anomaly, as they are not easy to use, although fairly common. CAIN script uses '~' to indicate tree structures. Another convenience feature, is for '~#' to indicate an HTML ID values. This will make advanced operations easier.

There are two complete syntaxes, which may be used together or separately. Users with a programming background are expected to favour the short single character version, similar to Perl. Non programmers, and Visual Basic devotees are provided with a word based alternative. The word based version requires every token is separated by spaces (which word orientated people are likely to do anyway). The few remaining symbols in the word based system are: grouping brackets and line separators. There have been many subtle errors introduced by incorrect use of precedence. It is hoped to convince everyone to have one operation per line (making precedence errors impossible), so the brackets can be left out. The line separator (note line separator, not line terminator) is currently required. Making the language able to ignore this white space alteration is considered a good idea, and more user friendly. Future versions may make the line separator optional.

A number of the features in the syntax are designed to prevent 'and a bit more' coding. That is, the code is repeatedly extended in a poorly planned fashion. This style of development is frequently used by LUP (personal observation); but leads to unreadable code, which doesn't perform as expected. In the current version all of the operators are binary, and to stack them (except for arithmetic) requires round brackets. There is no 'else' or logical continuation structure, the absence is hoped to make LUP not loose track of the accumulated logical conditions. This extra typing will hopefully encourage more thought about structure (or at least less careless copy and pasting).

Since the language currently has small scope and application, there is little need for features of large software projects. These would reduce the usability for LUP, and increase development software time. To provide better control over load order the current specification includes named blocks. These are similar to functions, but don't need to return a value, and have no formal parameters. They are literally a scope block with an attached name. The name allows access by other parts of the code. This is similar to 'goto labels' in BASIC or the DOS shell, but better structured. The named blocks are not affected by the presence or absence of comments, unlike the DOS shell. The scope block means the system can de-allocate local variables, which is not possible in the previous two languages. It is suspected that many programmers will dislike the expression 'named blocks', they are similar to procedures in Pascal and related languages. This can be used as an alternative. The small scope means there is little need to compile the language. The raw source is normalised into an abstract syntax tree, and later executed. The tree could be converted into an assembly language, and stored, but currently there is little point.

There are no module structures (packages in Java, namespaces in C++/XML). Should they ever seem necessary, a better idea is to write the solution in a general programming language (such as Perl, given the similarity of syntax). The default data type is a string, as most of the use is expected to be tree manipulation. Numerical operations are supported (for weighting values), and co-opt the data into a fixed precision float type. The use of fractional numbers is required for several statistical operations. The slight loss of precision induced by float storage is expected to cause problems. Restricting numerical operations to integers has induced problems in other languages (such as VimL?), so this would be a less strong alternative. The fixed precision is an attempt to limit the loss of accuracy. All data is public, and shares a flat namespace. Since the language is inside an interpreter, modification access has been implemented. The data can be declared read only, write only or neither. A calling stack and scope rules are still required (although alternative implementations may be better, see Tismer, 2004, or Tismer, 2002).

With the calling stack, exception handling has been implemented. Specially named blocks can be attached after any block, which are only invoked when an error happens. Some errors (notably compilation errors) cannot be caught, as it is likely the exception handler will be broken by the error; or not even yet compiled.

The language is nearly a declarative language (states what should be done, not how to achieve it; the most common example is CSS, used in webpages). Structurally it is equivalent to mod_rewrite rules, or other small scope declarative languages. The major exception from this is the maths operators. Declarative languages normally avoid arithmetic, but if the goal is a numerical weighting for the webpage, numerical calculation must be performed. It seems a better idea to allow manipulation of named values, rather than extensive use of 'implicit this' (Perez, 2003), which some Perl programmers use. This requires variables, as used in C type languages.

A formal grammar for CAIN script and some example scripts have been included as optional appendices (from Beresford 2004). Although this is orthogonal to current design, the system supplies everything needed for a complex filtering web proxy. The only major change is improving the writeFile system function, so it can write to local TCP ports. As noted the 'real world' HTML is frequently not used correctly. Part of the HTML interpretation is done using the scripting engine. It could also be used as a HTML cleaning tool (moving the poor HTML produced by tools such as ms-word towards standards compliance).

Implementation Discussion

Parsers can be divided into two general classes: top down recursive ones, and flat linear token eaters (Ullrich, 2004). Americans formalised this into LL and LLAR grammars (Lewis, nd), to complement their theories on state machines. Traditionally token eaters are better, as they can operate with smaller 'memory budgets'. Every older language that has intermediate forms, is probably a token eater (Clint et al 1993). There are certain classes of problems that cannot be solved with a one-token-at-a-time approach, so top down approaches are becoming more popular (Drexel, 2002).
Line numbers are the traditional method for feedback about the location of errors. As Java has a relatively simple syntax, it provides useful and fairly accurate error reporting. Perl looks for the end of a structure, before interpreting it, so the line numbers can be a little loose. The older C compilers provide very vague data about the errors, although newer compilers can be convinced to provide a better level of detail.

For compatibility with standard debugging environments the early stages of the compilation are termed against the physical line numbers. This is the most sensible approach for missing symbols, as since the structure is fatally broken, it cannot be used for structural statements. More developed versions of the script interpreter will provide formattable output, so the error messages can be precisely adjusted for your environment (e.g. integration into Vim or Emacs).

After the script structure has been parsed (and therefore it is relatively correct), the physical line endings are removed. As the language stands, it is relatively difficult to make a semantic errors (compared to Java, Oberon or Pascal). It has no types, and most context conversions are still valid (although may give useless results). The more developed versions will log odd context conversions.

The language automatically creates variables on first mention (as PHP). As with Perl; it warns on single use of variables (which are frequently spelling errors). The variable scheme is closer to databases than C. Variables can be declared readonly, or writeonly. This seems a more useful set of primitives than the constant and cast of C. Although this can't happen in CAIN script (lacking pointers), the difference between a const pointer, a pointer to const data or a const pointer to const data is difficult for many programmers. This induces many subtle errors, frequently when the programmer is attempting to make the code 'be correct or die'. Since the readonly/ writeonly operations are reversible, it should lead to more readable code, with less sections of code performing 'copy the data with this set of access permissions'. Calling it 'readonly' avoids any connotations about variable life span (although this is more of a problem for the word 'static').

The language is not a simulation of any hardware. It omits a large of number of features from C, which don't seem relevant to the intended use. It has no bitwise operators. Currently all numbers must be expressed in base ten, which is better for LUP. The use of exponents in literal values is permitted, with a lower case e, and an assumed base of ten. This language was not intended for large scale use, and so there have been a few shortcuts to make designing the language easier. For example currently all operators are binary; they require two items of data. This simplifies the parser. There is no increment or decrement operators, since these are fundamentally unary.

The determination of structural elements is complete for all HTML. The presentation aspects, either as HTML attributes; or as CSS, is currently partially complete. The presentation data is normalised to make it easier validate (mostly collapsing synonyms, to a single name). All of the raw CSS declarations are put into the style attribute of the appropriate HTML elements. The style attribute is normalised into special attributes for the presentation aspects that where considered important in the research (published as Good, XXX). The attributes considered important by the research are many times fewer than the available range. Extending this translation may be beneficial for future uses.


To summarise (in chronological order), static webpages are indexed as an asynchronous activity. The indexing records the usability and utility of each webpage, using a set of scripts. The scripts are designed to simulate aspects of user perception models. CAIN depends on other tools (such as Google) for the general search. Before a user can use CAIN, they specify their perception model, and provide a comparative weighting for each of the aspects. With the user perception model, the CAIN system filters the inappropriate webpages.

CAIN script is a success as a means of manipulating HTML documents. Due to the integration of tree references, as a first class data type, scripts are much shorter than other equivalents. As the tree references are less complex to use, there is likely to be a lower occurrence of errors (at the least, there will be less spelling errors). The interpreter is designed to be easy to integrate into other systems, so it could be reused as a linked library with little effort.

CAIN script attempts to minimise looping across arrays, in favour of sets. The set collection structures are a little warped from maths sets, as members are addressed by ordinals. Many of the tree reference operations will default to an empty set, which is perceived as a benefit. It attempts to hide the mechanics of applying something to each item, which should eliminate an entire class of LUP errors. An alternative approach is a 'map' primitive (as available in LISP and Perl), but this has a higher conceptual difficulty. Many of the explanations for this feature in other languages are obscure and not 'easy going' (for example XXX). The default access method for a set as a whole eliminates any undefined values, which prohibits another class of errors.

Some initial and non-representative usage indicates that most of the problems of programming languages are still present in CAIN script. Non programmers cannot obtain a screenful of information from a screenful of data. The range of options for the operators is currently considered a debatable decision. It was intended that LUP only use one flavour, but they are using both sequentially (i.e. $myname is = 'Owen', rather than $myname is 'Owen'). This is probably a consequence of the rigid semantics, and if so it is not possible to avoid.

A general purpose programming language (which CAIN script isn't), probably cannot be user friendly. A useful problem definition (or a potential solution) requires very precise terminology and attention to detail. Most people do not apply this, or do not apply the detail in a useful fashion. CAIN script attempts to reduce the 'alienness' of the syntax, but human languages do not have rigid and linear progression. The attempt may make it easier than other programming languages, but further work is still required.


These references are from 2004, and may be old.
Where these references are to organisations, an abbreviation for the name has been used, to minimise length.
All references to “Good” are for the PhD student I was working with, and she hadn't published as we where still doing the work. This why it has no date.