As this is my own data-space, I would like to add alternative forms of access. For me personally linear indices have a fairly low utility, as I 'chunk' English differently. This is not “cool technology”, but a way to avoid regimented-English-use, which is opaque to casual or first time users.
Discussion on usage.
I haven't used tag-clouds in blogs very much. The prototype that I picked up at University of London was just a GUI rendering of a PhD student thesis. The interesting bits where the analysis and interpretation. I have seen alternate UI to Google since then, which played on the same theme 1. Zeitgeist summarisation/ rendering.
As a list of keywords in the current document, a cloud seems low utility. The size impling weighting allows to on to very quickly adsorb a summary, but hopefully an abstract and titles list would do the same thing.
My intended use is a link to any of the content in the entire site. To clarify, people put content in structures that represent the organisational and philosophical structures of the author(s). The reader is a different person, and may not chunk the data in the same fashion. “I need a thingie” is a common problem, but tag-clouds should invert the knowledge/learning expectations, so the computer tells you what 'thingie' you meant. As ever, more technology makes the correct and precise use of language more important.
My initial sketch solution
- Fillin all the keywords properly
- Extend keywords to take a weighting (as http language selection)
- Build a cache of all keyword and weightings to resources, in both directions
- Create weighting=fontsize (will need to be normalised) SPAN elements
- Let the html renderer pack them
- When selecting via tag, add to strapline somehow
- Add a box below the menu
Data capture discussion
On the basis that the outcome required is a list of links the user may select, it is better use of screen estate to stem everything. A word may always be used in the plural form, but the tag would be singular, this allows cancelling of duplicates easier. All spelling is to be dictionary standard.
Text is to be normalised, currently this process is defined as :
- Text chunks are exploded and trimmed down to lists of “just words”,
- stop word filtered,
- spell checked (failures logged, for human analysis),
- and stemmed
Sources for tags
- The meta data currently supports a list of keywords, to be used in the HTML headers. This is a convenient point for non-explicit concepts, for example as would happen when the resource subject isn't listed in the titles, and the subject isn't frequently occurring in the content.
- This list has no weighting attached to it, the format will need to be extended to include this data. This change is tagged for development in format2
- Extract resource titles, and process them. I will need to perform this computation “out of line”, there is no need to install textual analysis tools on the live host, as I am the only author. This section will be little value for documents which resemble CPAN project pages. The text will be normalised and “contextual interpretation” - that is a title of “References” shouldn't be a tag in the resource. This stage would need to be done with Wordnet or similar analysis tools.
- Backlinks as a concept are not proprietary to Google, and are highly effective. As a second “out of line” process, spider the site, caching normalised visible text to describe resources.
- Frequent noun extraction, this is sensitive. With my writing style, I don't know if this list would be valuable. Words would need to be classified using Wordnet. All the normalised words over a certain frequency could be used.
Applying the above should build a list of tags for each document. The importance of each tag can be assigned from how many of the above factors include it, how many other resources include it, and a hard-coded weighting spread, similar to Googles semantic extraction.
I am creating this widget as a study of data systems, and NoSQL style structures. I reference the following caveats:
There is an article written in 2007, recommending tag-clouds; but the example with technorati clearly demonstrates why to not use them.
Personally alphabetically sorting the tags seems low utility (this type of character centric mentality, rather than semantics, is what I was attempting to avoid). Humans are not a computation, but a relationship.
The tag cloud on this is nice (but its implemented in Flash)
Third party tag-cloud libraries.
I want to have good use of the source, but below are examples of some of the libraries I looked at. The libraries seem to focus on rendering rather than information spaces. Making an UL list and styling it is trivial, and not the focus of my attention. That is a library defining a function “import_tags” with a static PHP array of keywords is valueless to current requirements.