The Relevance of Language Engineering to Public Policy

Talk given at the University of Sunderland

Date: 29/10/2002
Venue: University of Sunderland, UK


We may never be able to make an accurate assessment of the learning resources available to the ancient world; following the desecration and then destruction of the Library of Alexandria - a subject now clogged with partisan polemics masquerading as history - many great works only survive through indirect, tantalising references to them and there must have been many others, references to which have not survived. We do, however, know that up until the end of the 17th Century it was perfectly possible for a figure such as John Milton, statesman and poet, to have read every book his culture thought significant.

It is precisely at this period, of Milton, Newton, Wren and Wilkin in England and of Descartes, Spinoza and Leibniz in Continental Europe that the first real signs of the impossibility of polymathy emerge. The proliferation arose from a variety of factors: from the technological possibilities of printing; from a gradual expansion and ingestion of 'classical' and Arabic learning into the European framework; from the Reformation's impact on learning through the advancement of theories of personal development and the advance of vernacular language use; and from the increased technocratic requirement of the nation state met through bourgeois education, so well exemplified in England by Shakespeare.

The new technology and the proliferation each had an effect on cultural attitudes to literary resources. First, the way of writing a text changed. If you look at St. Augustine's CITY OF GOD you can immediately see that the speed of writing was such - even though he was regarded in his day as prolific - that his great work was not so much a treatise with a proposition, a discussion and a conclusion as an intellectual journey; he began with a set of assumptions which didn't change very much but the conclusions he drew from them, particularly after the sack of Rome in 410, changed radically through time. There was no leisure for a second edition. By Shakespeare’s time, with printing some 200 years old, we see a man who is terribly careless with his manuscripts, pirate copies of plays and, within years of his death, a number of editions of his work put together more or less professionally and honestly with an attempt to arrive at an authoritative text from a variety of sources. It was no doubt a sub conscious trait in authors and printers to work in the knowledge that any edition need not be the last word, intellectually or typographically. Our contemporary word processing packages are simply the last word, psychologically if not physically, on provisionality.

At the same time, the sheer volume of resources altered the status of the book from that monumental artefact of classical learning, THE BIBLE and other works in a named canon, via the 18th Century library to the current measurement of information in LOCs (Libraries of Congress). It is not so much the heterogeneity of resources brought about initially by printing and the nexus of the Renaissance and Reformation but the sheer weight of creation, criticism and metacriticism which has changed our attitude to created work. There was a time in the 18th Century when criticism could be perceived as reinforcing the canon but the breakout of creative energy which led to the overthrow of classical and Biblical text as central motifs in favour of the contemporary, realistic novel, meant that no book or books could retain a totally central place. To imagine this you simply need to think of the UK's change from one, two, three and then four terrestrial television broadcasting channels to the new environment we are moving towards of middlecasting and, not far away, narrowcasting.

In the space of two centuries we will have moved from the literary canon, via an explosion of commercial, if not always profitable, publishing and broadcasting, to narrowcasting and individual web publishing.

The reason for this somewhat lengthy and arcane opening statement is that I immediately want to put our attitude to text creation and assimilation into some kind of historical and intellectual framework but, above all, I want to lift the discussion out of the quasi moralistic, totally sterile domain of analysing culture in terms of standards. The classic formulation of this is that students today only have a higher examination success rate because standards are lower. Another formulation of it, of which I have direct experience as editor of a peer reviewed academic journal for six years, is that the standard of written English amongst academics is shocking. Indeed it is very bad but I am not shocked. You only have to look at the quantity of ephemeral and authoritative prose being generated by each of us as students, academics and intellectuals to know that only a very few of us will be able to maintain Augustan quality. The requirements of the world we live in are such that only a very few of us, the rich and the fortunate, the novelist with the big advance, will be allowed to sit in our studies writing such limpid periods that we need hardly consider the galley proofs.

We lead fast, varied, complex, provisional lives. Even those among us who are classified in one way or another as ignorant or illiterate, operate more rapidly, with more knowledge and options than most of their predecessors. The vocabulary of the 18th Century craftsman was much greater than most of his successors but it was stable, readily sourced and applied in a narrow framework; he had the words of The Bible and a few other books, perhaps including a smattering of Greek and Roman mythology, the immense argot of his trade, a small geographical and moral environment and, until the end of the Century, the accumulated stability and relevance of a culture which still owed much to the England of the Wars of the Roses. A hundred years later the industrial worker had been cut loose from these ancient moorings and what remained of it was mere sentiment, represented today in the absurd architectural wastefulness of semi-detached housing and the posturings of the Countryside Alliance. The bible was losing its influence, mass production and the division of labour had reduced the technical argot but there was a whole new world of cheap newspapers and magazines, the impact of the romantic novel and the formalisation of sport as compensations.

Now, another hundred years on, we adjust to, acquire and lose information at an immense pace; we are the first generation not only to produce more intellectual resources in one decade than in the whole of human history before it but also the first to have forgotten more than we know.

This is the world of the constantly ephemeral, febrile and instant which forms the context for a public sector requirement for the applied resources of language engineering.


What I want to talk about, then, is the contemporary political and cultural context for assistive literacy in general and language engineering in particular and to set a public policy framework which I think people involved in language engineering should be aware of so that they can have an effective dialogue with society and, not least, its elected representatives, the politicians who are responsible for the formulation and implementation of public policy.

I want to look first at four basic ideas:

  • The clarity of the author's intention
  • The requirement for correspondence and backwards compatibility
  • The role of intermediaries and
  • The importance of translation and attribution.

I then want to go on to talk a bit about:

  • The future of intellectual property
  • The integration of tools and metadata into data and, finally,
  • The crucial role of language engineering in the development of the 21st Century toolkit.

I want to make three points before going into detail:

  • First, I am going to discuss all my topics with public policy as the focus, so I won't keep repeating this in the various sections.
  • Secondly, I am discussing digital information systems but assume that what we are really thinking about for many years to come are hybrid systems where digital information (software) and human beings (skin wear) work together. I assume that as time goes by machines will improve in doing what they do well, automated processing, but that they will become more 'fuzzy' because they will have much larger, context sensitive rule sets.
  • Thirdly, this is not a power point presentation, it is an essay and so we might travel quite a long way away from the central topic in order to elucidate a context before coming back to the core of the subject. I would advise you to listen and note questions but not write notes; you can have the whole text when I have finished.

1. The Clarity of the Author's Intention.

Last week the Government put on the Order Paper of the House of Commons a debate on the Reinvention of Urban Post Offices. It turned out to be a debate on the closure of some 3,000 of these but in the Minister's opening Statement there was also the promise of matched funding for the Offices that did not close so that they could develop their customer bases by providing a more modern and attractive environment for customers. The Opposition spokesman deliberately took "Re-invention" to be an Orwellian euphemism for closure. I would take it to be a deliberate attempt to focus on the positive side of the package. Does it matter? Well, in terms of the whole debate it doesn't because the text of the Minister's Statement puts the word "Re-invention" into its proper framework as connected to the funding package and the Minister also did not shy away from using the word "close" to describe what would have to happen to Post Offices; so the Opposition's charge was flawed. Yet, if you were simply trying to use the headline on the Order Paper to get the essence of what was being proposed it would be entirely misleading. This is not a problem for people like me who exist for the consumption of small print but in the world of sound bites it is crucial; the Government made a cynical decision to write a headline dealing with a secondary aspect of its package and therefore was attacked for misleading "spin".

This is a simple example but, at the other end of the spectrum, what about the authorial intention of parts of the Bible. Even without the theological disputes which are currently raging about how far various texts are a minute contemporary ethical primer, there are issues around historicity, anthropology, poetics, language translation and emendation. These problems, however, are not confined to venerable texts; you only have to look at a Salman Rushdie novel like MIDNIGHT'S CHILDREN to see that the reader is faced with a multitude of problems which are very similar.

It seems to me, controversially perhaps, that in the digital world we will have to distinguish very clearly between what we think of as an artistic artefact and a collective artefact. The first will be textually and structurally sacrosanct; it might even have to be kite marked as such. The communal document, on the other hand, must be stripped of any artistic pretensions such as irony, allusion, ambiguity, humour and concurrence. It must, above all else, be crafted specifically so that it can be readily amplified or simplified. Perhaps a word is needed here about the idea of concurrence. This takes place when an author takes two ideas and gives them reciprocal support through intertwining them in a single sentence. This is an intrinsic part of our culture and we do not take readily to sequencing ideas, partly because the real world is not like that and partly because we want to use the shorthand of causality rather than a specifically causal syntax now demanded, for example, in resource description frameworks (RDF). This cleavage between the culturally complex and the syntactically simple is a clear illustration of both the problem and the challenge for the preparation of collective documents which, if you like, become public rather than authorial property. I will look at some of the implications later; my object at this point is simply to point out the difference between two kinds of authorial intention, the one individually and the other collectively owned.

Whatever you’re feeling about these observations, there is one central point. It is much more important to establish clear syntax than to quibble about minor points of usage. The main problems I had as an editor was not sorting out split infinitives but simply in trying to work out what I was being told. The essence of grammar is the clarity with which the author's intention is transmitted to the recipient. I would not go so far as to say that we must all write in the intellectual slipstream of Frege but that at least expresses the sense of what I am saying. 

2. The Requirement for Correspondence and Backwards Compatibility.

This is a statement of the obvious but such things are necessary in setting out a public policy requirement. If a document is amplified or simplified it must correspond to and be traceable to the initial or source file. If, for example, you take a Government White Paper and edit it to half the length and with a lexicographic ratio of 20% of the original, to avoid any accusation of manipulation, the user must be able to click back to the source file from which the simplification was derived.

I have in mind here a multiple choice offers to users something like this: A source, say the Government, offers a document in a variety of lengths, say 100%, 20% and 1% or, the full text, the briefing paper and the executive summary; these requirements are then put at right-angles to the lexicographic richness of the paper, say, at 100%, 2,000 words and 5,000 words where the two lexicographically simplified versions have a specialist micro vocabulary with a glossary. Bear in mind that to shorten is often to condense, to make more difficult, to increase the quantity of jargon and acronyms. If the client chooses the shortest document with the smallest lexicographic range you can see how important it is to be able to return to the original document.

There is a second kind of authentication that we need to look at. When the Government started putting its documents onto the Internet it used PDF files. I understand why. It feared that if it put a text file onto the Internet people could interpolate and ship all or chunks of the document as if they were authentic. This is an extreme case. Somewhere in the middle we now have consultations where White Papers pose questions and there are answering facilities but only in respect of the questions posed. What we need to look at is how we can structure and label interpolation or what I would call ecritology, the science of distributed commentary and criticism. We need to be able to see an initial text simultaneously with a variety of comments rather than seeing the document and then migrating to a tangle of threads. This presents some immense organisational problems centred around the power of context sensitive searching and sorting and it raises questions about intermediaries, which is my next topic.

3. The Role of Intermediaries.

This section also begins with a statement of the obvious. In promoting discussion between an author and clients or consumers, there are two kinds of intermediaries, the one entrusted by the author, the other by the client or consumer. The first typically moderates bulletin board discussions, the second acts on behalf of an individual who, for a variety of reasons, cannot interact directly with the information system. The first sort need to be trusted and the second sort need to be entrusted. At the moment there is an intense discussion about the relationship between what are called impartial, career civil servants and political advisers, commonly and mostly inaccurately known as 'spin doctors'. There are a large number of difficult issues to disentangle in this discussion so let me stick to three: * Can a career civil servant ever be impartial in the sense in which a soccer referee is supposed to be impartial or is there always a tendency either to back one's elected rulers which is, after all, an explicit condition of the job or a tendency to insert prejudices, if only subconsciously, into the form of moderation?

  • Are we not better off with people who declare their prejudices, like newspapers, but then moderate a debate where all users know our prejudices?
  • Would the problem be diminished if generalised, simple statements could be sourced to initial documents and then back through formative research?

My answer to the first question is that the impartial civil servant is a myth which is perpetuated by the governing class to provide politicians with a safety net; there are all kinds of politics and turf wars within and between Government Departments but the public face is not so impartial as bland. The consultative process is, currently, largely a sham. I therefore believe that people with open political commitment are better moderators than those without; there is, after all, always the suspicion that the referee is biased.

The question about sourcing and research is much more interesting. There has been some solid work done on deliberative polling which shows that people are much more balanced in their judgments once they understand the factual basis for argument. It is also interesting to note a recent article by David Blunkett in THE TIMES which said that political debate is being wrecked by flimsy surveys which are deliberately produced by the media in advance of impartial Government statistics. He cites a BBC very partial phone round of Police forces to derive street crime statistics which were trumpeted weeks before much better Government statistics which were ignored because the story had passed its peak

The other kind of intermediary is the individual or small group's trusted intermediary to act in consultation, expressing or in some cases representing the needs or aspirations of people who cannot interact directly with an information system.

What we have to consider in the context of language manipulation is the way in which human intermediary facilities change as the technology becomes more powerful. Is the idea of end-to-end software processing in document simplification or amplification achievable? Either way, we need to think about the roles, training and accountability of intermediaries.

4. The Importance of Translation and Attribution.

If you want a paradigmatic example of what is wrong with public interest information flows you only need to pick up a newspaper. These are full of editing, obviously but they also deal extensively in processes which would, theoretically, be characterised as

  • Clarification
  • Simplification
  • Amplification
  • Interpretation.

These are all valuable, if often abused, tools but they are meaningless without attribution. If I pick up a newspaper and read a simplified story about a speech but can then get hold of the text of the speech, the abuse of the editor is open to investigation (I set aside here the obvious problem that most people do not have the time, resources or inclination to check sources. the real problem arises with non attribution as in: "Sources said" or "Senior officials said". I can think of a myriad examples in fact where I have heard the initial statement of a politician and its later reporting has borne little resemblance to the original. These are what I would call the political equivalents of the "Play It Again, Sam!" syndrome. One of my favourite pastimes is to listen to Prime Minister's Question Time on a Wednesday afternoon and then to follow its later reporting. The central point here is that the translation without attribution is harmful rather than simply inadequate.

Transferring that message to the digital world means that any textual adjustment to an original document must be checkable against the original and the identity of the 'translator' must be declared unless, of course, it is a piece of software.

These are some rather high-level observations about the way that information is handled in the public domain. I will come back at the end of this presentation to slot these back into a discussion about the importance of language engineering but before that I want to say a few quick words about two subsidiary but important topics.

5. The Future of Intellectual Property.

The current model of intellectual property rights which we have inherited from the analogue age is totally inadequate for our new conditions. Of all the books commercially published only 20% merit the collection of royalties for the author. Many publications, such as academic journals, pay authors nothing, presumably on the ground that they write in their employers' time, as I am writing this! In the analogue world the means of production meant that considerable capital was required to publish but now that is not the case. "There are a myriad of small music CD publishers, stretching down the culture into individual church choirs and bands that sell their own CDs on the door at gigs and nowhere else. Any self-regarding poet can stick all his stuff on the Internet; and soon there will be community television and even peer-to-peer publishing, or the digital equivalent of a letter with a bunch of snaps. Whether or not the new model will be based on micropayments or whether it will be based on an initial payment for launch rights but no downstream collection is an interesting debate but the aspect of it that interests us here is the distinction I drew earlier between the declared individual artefact and the communal document. If documents are posted which may be amplified, simplified, clarified or otherwise deconstructed and re-assembled the intellectual property right required will not have anything to do with ownership but simply with attribution and backwards sourcing requirements. There might also be a condition which relates to the proper flagging of interpolation or ecritology. In other words, the original author will have rights but these will be concerned with the preservation of the integrity of the source material in the context of all manner of inputs to it. There are some people who resist this, who say that documents must necessarily evolve as they go through waves of commentary but without the integrity of the initial source we will soon be in trouble as I pointed out earlier. Without attribution, too, we would soon get into the problem of documents within documents and impossible tangles of attribution.

6. The integration of tools and metadata into data.

This is pretty obvious when you think about it. In reality there isn't much of a distinction in common digital environments between data, metadata and tools but we think of the text as the core with other things being brought to bear on it. So we finish a document and:

  • Check the architecture, the metadata
  • Run grammar and spelling checkers
  • Use another application to parse, simplify or translate.

In communal documents I would assert that properties of the document should be defined that, in other words, a text without an in-built architecture or certain functionalities isn't a document at all but is simply an unauthoritative fragment. This will move standards into the area of document creation instead of their being retrospectively applied. At the moment such standards are primarily concerned with accessibility issues arising from measurable and severe disability but we need to think very carefully about the public requirements for people with small vocabularies, those for whom English is not their first language and, perhaps above all, that one fifth of our population which is classified as functionally illiterate.

Having said that, I will finally turn to what you have, presumably, all been waiting for, a discussion of the public policy requirements for language engineering. Here, then, are four points:

  • Rights of Access
  • Specialisation and heterogeneity
  • Choice
  • Transparent intermediation

7. Rights of Access.

Not long ago I heard the following, most revealing story. A financial journalist in conversation with a Treasury official remarked that the recent Finance Bill was incomprehensible, for whom did the official think the Bill had been written. Without hesitation the official confirmed that it had been written for the lawyers. So, you see, not for citizens and not even for financial experts. This demonstrates the need, does it not, for a much wider understanding of rights of access to information than the mere assertion that I should be allowed the text, the whole text and nothing but the text. A second example that comes to mind is the furore - artificial, of course - which blew up over then Chancellor Kenneth Clark's admission that he had never read the Maastricht Treaty. Of course he hadn't; it reads like a series of cryptic instructions to a printer; e.g.: delete the "and: after "race" and insert a comma and then after "creed" insert a comma and then add "sexual orientation and disability." Now that is a relatively simple example of a phrase which might have been amended from "... on the grounds of sex, race and creed" to: "... on the grounds of sex, race, creed, sexual orientation and disability". Even for lawyers this kind of accretion is difficult but for most of us it is impossible.

At a level of less complexity, then, we need to think about the citizen's right to information, now the subject of EU legislation and soon to be invoked here under the FREEDOM OF INFORMATION ACT. Is this going to be a simple, theoretical right or are we going to make it meaningful through transparent simplification and amplification; are we going to shift the Latin? Are we going to delete: "Anything to the contrary heretofore not withstanding" and simply state: "and this over-rides all previous laws on this subject"?   Will we, then, take the legal document and translate it into English that can be understood by a person with five "O" Level passes or should we take this latter as the base document and get legal draughtsmen to sort out their side of the matter. Either way, what we do not want is a mass of manual re-writing.

In short, we can no longer issue one document and expect all citizens to access it with equal benefit. The technology allows us to do better than that. Illiterate people and those for whom English is not their first language still pay taxes, still vote and are entitled to the level of citizenship which the technology can provide at reasonable cost; what is reasonable, of course, is a political and not a technical matter. Nonetheless, rights legislation will increasingly put pressure on the public sector not only for translation between languages but translation within them.

8. Specialisation and Heterogeneity.

There are reasons other than differing intellectual levels which lead to the requirement for a variety of renderings of a source document. We are both more specialised and more heterogeneous than we have ever been before in our intellectual and lifestyle pursuits. We may want a White Paper on housing in full text but one on foreign policy in a shortened or simplified form. We may simply, for example, want the Iraq Dossier without comment or we might want the ten major points as defined by the author in his tagging. Whatever the reason, this is not simply a call for simplification to accord proper rights to those with cognitive difficulties, it is a recognition of our lifestyle.

9. Choice.

These issues naturally lead on to the idea that what consumers of public affairs information require is choice; in this case not choice of raw material or choice of source material but choice of intellectual rendering.

10. Transparent Intermediation.

I now come to the most crucial point in this lecture; and if this is the only point you take away with you my journey will have been worthwhile. Currently the varying intellectual gifts and aptitudes, time and resources of our citizens mean that in the sphere of public and political debate they rely upon the services of intermediaries. The problem is that these intermediaries are biased. Even public broadcasting has been dragged by commercial newsgathering and presentation into almost unconscious mendacity. Let me give you one example from the Afghan crisis last year which led to the first war in history where the media told more lies than the politicians. After the initial urging of George Bush not to lash out after 9/11 the 'story' looked after itself for a while, until 'Ground Zero" began to lose its pulling power. Then the 7/24 news operations began to press for hostilities against the Taliban to begin. Almost immediately it became commonplace to state that the meteorology and topography of the country meant that any campaign would have to be over by the middle of November because of the heavy snow. As it turned out the campaign ended in mid November with correspondents reporting from Kabul in their shirt sleeves; I checked the weather records which showed that it rarely snowed in the campaign area before early December and the map showed that 2/3 of Afghanistan is indeed elevated but flat.

We have reached the very dangerous situation where elected politicians are weaker than commercial news suppliers and public broadcasters dragged in their wake. Unless you have access to source material the politician cannot communicate with you in a way which explains his thinking to you. The intermediaries he relies upon are totally unreliable. Yet the source material is horribly opaque; this explains the role I see for language engineering within languages as central to the survival of the democratic process. Technology can provide transparent intermediation.

I do not think there could be a more important conclusion to this lecture than that; what language engineers have within their grasp is a set of vital tools to preserve our democratic system of governance.

But the culture is against it, which is where I began this lecture.

As a society we are extremely conservative. This isn't remarkable, the periods when societies are not conservative are very short and very special. We usually express this conservatism in rather high flown language; in our case that is the language of standards: standards of authorship, standards in examinations, standards even in handwriting. These standards are always articulated in absolute terms and free of their cultural environment. So, for example, the fact that the standard of written English is steadily falling is never put up against the more  important fact that the number of people behaving as authors is steadily rising; The simplification of some subject matter is never considered against the more important phenomenon of wider learning; the handwriting issue is frequently not considered in terms of other forms of writing, like computer keyboards or text messaging. Well, that last is interesting, of course, because the simple fact of its ubiquity has led to its being condemned by the guardians of our cultural standards.

It is, then, in this political/cultural context that we have to look at machine processes for manipulating text. The title of this presentation mentions public policy as distinct from current politicians. One of the terrible ironies of the current political situation is that our politicians, as I mentioned earlier, are plagued by partial media but they have not seen the relevance of what technology has to offer. Not only to offer them but also citizens who are equal prisoners of plutocratic bias.

What underlies these problems, however, is not some narrow allegiance to outdated standards though that is the form it will take; what we are looking at is a failure to understand the way that technology can change the way society operates. What is on offer is immense flexibility which will allow an individual to operate at a variety of levels of richness and complexity; it will apply the classical principles of comprehensive education, the division of labour, playing to strengths, to every aspect of life involving ideas. We will have to get used to the wonderful idea of prodigies, like David Beckham, springing up in all kinds of intellectual pursuits. This is the prospect that really excites me in what you are doing. We need to stop thinking of the knowledge economy as some kind of metaphor and think of what it would mean in reality. It does not - and we know this deep in our hearts - mean an ever larger number of people talking the talk of Oxbridge SCRs.

So, I think that you will very soon be in the forefront of an intense and intensely important debate about public policy. This is a modest beginning to that debate.