Computer-graded essays?

A recent article from Associate Press elaborates on a topic that’s already been kicked around some, but never yet, so far as I know, as hard as it’s being kicked right now. Kaggle <> is offering thousands upon thousands of dollars for people who can come up with an algorithm “that can automatically grade student essays.”

Is this a good idea? Is the next step a tool for editors to evaluate submissions? And after that, an algorithm for writing essays. Poems. Stories. Screenplays (well, some of those already are the work of robots).

The whole Kaggle enterprise is a bit scary – they’re looking also for a way to determine the profiles of those persons most likely to end up in the hospital. As if the HMOs didn’t already have one hand around your neck and the other at your…

It sounds more Orwellian than I like to think about.



You assume this isn’t already ready happening in some shape or form already.

No, I know it (the essay thing) happens already, on a relatively small scale, and in a few places. It’s small and few for now, but with a hundred grand being offered for someone to develop a successful algorithm, it’s liable to take off. What concerns me is that this – the bland assumption that it can be done – is not only accepted, it is applauded, encouraged, rewarded.

As for the HMO thing, and the other money-grubbers looking for yet another edge, soso.

(Please note I said it “concerns” me. This is not an anti-tech rant.) (Yet.)

As a not-quite-irrelevant aside, I got a nice letter from Bank of America yesterday, asking me to open a “high yield” savings account. Their definition of “high yield” rolled in at .75%. Yes, they want to give me three-quarters of percent interest each and every year for letting them hold my money. I sent back the form, politely declining, and asked if, on the other hand, they would like to let me hold some of their money at the same rate.

I don’t expect an answer.


How is this concept different from face detection in iPhoto or Google Picassa, or any other photo software? What about autocorrect? Voice recognition? Retinal detected auto focus? The auto-parking feature on the wife’s new car?

My point is that “advancement” is predetermined by those who define what we need. The assumption that “it will be done” is simply following on the established history that technology has followed since the stone age. We want better and someone figure it out. Stone tools give way to iron which gives way to bronze which gives way to steel which gives way to silicon. The pattern that we will build it once we think it up is well established. Why not encourage it?

To me the larger issue here is the obvious move toward a more realized version of artificial intelligence. Maybe I am too close to the systems to see what everyone else see as so great in AI, but the very idea of it terrifies me. Once the computers start to “think” in a decisive manner, what use to humans really serve?

And yes, I am serious.

That is a really depressing idea. Its underlying assumption appears to be that there is no “art” involved in creating essays or, indeed, in writing generally. An algorithm would be unable to recognise novelty of approach or argument, so could not reward those aspects of essay-writing that push the writer into the highest mark band.

Admittedly, there is little room for novelty in secondary-school essays, which I imagine will be the software’s target market – but that raises questions of its own. We hear complaints already about “teaching to the test”, with pupils so geared towards specific exams that they end up knowing little about their subject; how will it be when they are trained to pass an essay-algorithm test and don’t even know how to string their own words together independently?

In a way, though, I sympathise with the concept of standardising essay marking. My daughter has recently been through a remark/appeal process over a history exam, through which we had access to the marking comments from the exam board, and I have been shocked at the subjectiveness and non-standardisation of the marking and re-marking. In the absence of adequately consistent examiners, perhaps an algorithm might be preferable!

I would argue that anything below University level that has novelty of approach or original thought is a fail. I would be very surprised if human readers would give any conscious thought or credit to ‘style points’ or ‘art’ in marking academic papers. Content, by which I mean the regurgitation of facts learnt and opinions already established is the key to exam success at these levels.

Basically, the secondary school system in the UK is essentially designed to teach kids to write acceptable Wikipedia articles. I have no problem with Autobots reviewing Wikipedia articles. At least that way the grading is fair, based on content, knowledge and application of the correct facts to the question at hand, and not subject to an unconscious bias towards language or style that suggests a wealthier upbringing or a private school education.

Well, if you want to put it that way, what purpose do humans really serve now?

And, yes, I am serious too.
Life is it’s own purpose. I think one has to take that as res ipsa loquitur.

I agree with you that novelty of thought doesn’t really apply at secondary school, and that facts are important. Secondary school education should be about building a foundation of actual knowledge and the skills to use it. But, assuming that the base is there, why ignore (or, worse, penalise) novelty of expression? As long as all the key arguments/facts are there, they don’t have to be strung together in a set way as long as the essay makes sense internally and as a whole. It is that aspect that is most subjective.

Marking an essay is never going to be objective, algorithm or no algorithm. Even with an algorithm, the subjectivity is still there, built in to reflect the designer’s preconceptions. My children are doing A-levels (final year of secondary school), and every syllabus I have seen (across a range of science, arts, humanities and technology subjects) acknowledges the impact of writing style, in that “quality of written communication” is taken into account. The exact terminology of the assessment category varies according to the exam board, but the principle is the same. And having seen examiners’ actual comments on a real paper, defended by the exam board, it is very clear that assessment is not solely subject to conscious thought, and other intangibles play a significant (and apparently defensible) role.

Now that really is bleak. It is fair only in that everybody is assessed by the same “individual”, not in that the assessment criteria themselves are fair. What about budding writers with a flair for language, or for communication, or for imaginative linking of material and ideas? What about avid readers, influenced by things beyond the exam board? There are loads of writers from disadvantaged backgrounds who would never have flourished if restricted by their school to forumulaic modes of expression. (There are plenty who succeeded despite their schooling, as well, but that is hardly the point of education.) The lowest common denominator of realistic attainment should not mark the top end of the mark scheme. Unless we’re talking about multiple-choice questions, of course, or something with a straight right/wrong answer – not an essay, in other words.

It’s probably an ideological position. School is where writing/communication skills should be taught, or at least fostered and encouraged, or at the very least not beaten out of you (I’d settle for the last one). There is insufficient time for teaching outside the syllabus, and schools tend to value their league table position too much to jeopardise their exam results by adding more work that won’t count. If the syllabus says “write like this or you won’t pass the algorithm-administered test”, pupils are most unlikely to be encouraged to write in any other way – except, of course, in the private schools and wealthy or better educated homes which your argument seeks to level out. In fact, marking by algorithm seems ideal for schools/families with sufficient time and resources: they will be able to “crack the code”, or pay someone else to do it, so that their children are guaranteed top marks by the simple application of a few algorithm-compliant rules.

Besides, if you have all the creativity bashed out of you when you’re young, how are you going to do well at university where the requirements for a Pass 1 or Distinction require “Special signs of excellence, for example: unusual clarity; excellence of presentation; originality of argument” (from the assignment booklet of an Open University undergraduate course)? Universities already argue that the school system isn’t producing the right calibre of student, so they are having to dumb down first year classes to get students to catch up. Marking essays by algorithm is a slippery slope to shifting the burden of teaching flexible thinking/writing skills onto the higher education sector instead of positioning them as the life skills that they really are.

Bah! I’m depressed now. Time for a coffee. :smiley:

See, we’re all writers here. Most of us probably could (and did) aspire to “art” in secondary school essays. Creating equal measures of joy and frustration for our teachers.

But most of us could also structure literate sentences and assemble coherent paragraphs. With very little nudging, we could construct an extended argument over several pages. For many of us, those skills came almost instinctively, the way a musician with perfect pitch can feel a wrong note.

Most secondary students aren’t us. In most cases, correcting secondary school essays involves slogging through a morass of errors in everything from basic mechanics to the fundamentals of exposition. As pigfender said, in most cases getting a student to write a decent Wikipedia article is quite a challenging goal by itself.

I’m perfectly happy to let machines score the Wikipedia articles. It will give the teachers more time to actually read the few nuggets of art that they might encounter.


I think it’s a good idea to take the ‘art’ part out of the grading. A student can struggle at writing fragrant prose without it impacting on their ability to understand history or geography.

I do think that creative writing - hell, just writing let’s be honest - should be an important part of the general curriculum, but it should be assessed in the English Language class, not imposed double jeopardy style across every subject a student studies.

On an unrelated point:

Dear Microsoft,

It would appear that your popular program, MS WORD, has over the past ten years slowly eaten away at my previously good standard of spelling to the point now where I struggle to spell the word February without checking for red squiggly underlining. And I was born in February.

I will accept $1,000,000 as without prejudice compensation in lieu of a civil suit.

Love and hugs,


All of this is very depressing! I’ll admit, I hate marking essays. And it’s not because they are original and therefore demanding of a more subjective judgement … that is difficult to do, time consuming and nerve-racking. But what is soul-destroying is being faced with a pile of 120+ essays, all of which you know are going to trot out the same well-worn … it’s not as if I feel I can even call them “ideas”, though they must have been that at some point in the mists of time, which all fundamentally share the same source, usually a coursebook full of truisms. So they all say the same things, and the only way to differentiate them is on their writing, the quality of expression. Then you’re up against how to evaluate one piece full of grammatical errors and misspellings against another piece equally full, but of a different set and distribution of grammatical and spelling errors. And this is not a writing class …

So there’s the side of me that thinks, “I’d love an app that ran an algorithm that would relieve me of these end of semester nightmares”. Then I think of one of the MA students I had at the University of Westminster. For her paper in the exam at the end of one of my courses — traditional 3-hour, closed book exam; this was in the late 80s — I gave her 85%, when the ceiling for a good distinction grade was 75%. Why? Because under exam conditions, she had written better answers than I myself could have done, sitting at my desk and taking my time over it with reference books handy, and certainly better than anything I would have given as the criteria to look for for the purposes of an automated marking system. So the automated system would have given her a low mark for not including my requirements and for writing a whole load of different stuff.

For the last few years, I’ve been teaching within a culture that for the last 3000 years has basically adhered to a line the equivalent of “You can only think independently when you have got your PhD. Until then, your job is to read, learn, reproduce the accepted wisdom of the masters”. (Congratulations, Dr Nom, you are now allowed some original thoughts! :wink: ). We in the west are already moving rapidly along that line, as is basically expressed in some of the previous posts.

But I think it’s moving outside education. I had been thinking of writing a rant in the ANFTL forum … “Professionally Designed Templates”. It’s not that I object to templates, or to sharing any template I have created if someone else thinks they will find it useful. It’s the growing number of “creative” apps that take virtually all the creation out by offering “Professionally Designed Templates” … iPhoto, iMovie … “84 (or however many) Professionally Designed Templates for Pages, on special offer for only $49.99!” There’s even a web-hosting company advertising here in the UK that is pushing, not just “professionally” designed pages but even that they have the text appropriate to your industry that they have already produced for you to choose from and enter in their template. What does that “Professional” mean? To me, it means “We work in the computer industry, therefore we are professionals, therefore our designs/texts are professional, therefore you should use them rather than using/writing your own … therefore you too can have a website that looks like the website of all the other companies that have used our designs and texts.”

I’ve never used a Pages template, I’ve never used a Numbers template, I’ve created my own Nisus templates … and heavily modified Nisus New Page. I have tried a couple of Keynote templates … I can’t imagine ever finding a use for the majority of them, and of those that I have used didn’t really work for my needs. I ended up taking the plainest of them and modifying the background, the font and text size, text boxes on the slides … virtually everything. I could have done it as quickly starting from “Blank”, but it was only afterwards that I realised that.

Please note, I’m not disparaging the templates that come with Scrivener. I think, in a real sense, they serve a different purpose, at least what I think is the important bit. That for me is largely the compile options … if/when I write a paper which I want to submit to a journal or journals, and the publishers specify they want submissions in a specific style, Chicago, APA6, Harvard, whatever … then it makes sense to use a template set up by Keith et al. which has all that set up already, rather than have to go through pages of style manual in order to set up the compile options myself. Or producing an epub … same thing. But the point about those is that they are set up so you can meet someone else’s requirements without trouble. The Keynote templates, the Pages templates … they are not there to meet requirements set by others; they are there to take the creativity out of being creative, and I’m sure that the text most people insert into those templates is frequently as uninspired as merely taking someone else’s design.

And before anyone jumps down my throat, I know that many of those templates and themes that so annoy me have actually been designed by people whose profession is design. But to me, design is as transient as fashion, and often as vacuous as Brit Art. And I admit that I couldn’t code a website in HTML4 or 5 — I did code by hand in HTML3, but that was a decade and a half ago — and so would use Rapid Weaver or similar software … that forces me into using a template; I’d rather do it all myself, but even after I retire, I’m not sure I’ll have the energy and time to learn that. If I put together a little movie in iMovie — I have done that … I liked iMovie HD, but no version since — I would hope it would stand on its own, not need to be wrapped in a “theme”. But in no way is designing a layout in Pages or Keynote as daunting a task as coding a site in HTML5 … mind you, if one hand coded XML or RTF or whatever underlies the page, it would be equally daunting.


Jaysen, you must be. That, it seems to me, must be a disturbing thought for you. :slight_smile:


But she was still 15% wrong?

Everyone is always at least 15 per cent wrong.

Corollary 2.b of Sturgeon’s Law.


And a painful one too! I don’t like this “serious” stuff at all. Way to scary.

As I see it the problem, yours, mine, PJS, and Siren, maybe kewms (I always hate to lump her in with me), it isn’t about creativity but a false sense of equality. All these efforts to standardize claim a fairness that is, by its very definition, unfair. Allow me to ramble for a bit.

Is it unfair that Micheal Jordan is a better basket ball player than me? Is it fair that I am a better “computer nerd” than Mr K? Is it fair that Mr X is a better linguist than my daughter? Is it fair that Mr K is a better humorist than … You get my point.

The real value in education, art, science, and everything I can think of that isn’t economics, is inequality as demonstrated in an individuals ability to excel or fail. By being better than me in the art and science of linguistics Mr X establishes a unique identity that make him of value. My seeming instinctual understanding of compute systems makes me a unique value in my sphere, in some ways of more value than Mr X, but in other ways of less value. neither one of us would me of any value on a basket ball court especially with compared to Mr Jordan, but then how would be talk to folks in China or design a complex integration between a internet front end and a main frame?

All these attempts at standardized grading through AI, reduce mankind to a base point that is nothing more than a parrot taught to repeat the mantra of some ruling body (school councils in this particular case). It is the complexity of mankind that would seem to scream “WE CAN NOT BE LUMPED TOGETHER AS A HOMOGENEOUS MASS OF GREY MATTER!!!” that seems so obvious to me. Either I am truly missing the bigger picture or there is a real problem with a global society that wants us all to be “the same”.

For those that want to bring up “factual instruction”, I would counter that facts are only of value as a basis for complex mental exercise. All the important “facts” should be learned by 3rd, maybe 4th year. After that point education should become an abstract analysis and modeling of our world as seen through history, art, science and mathematics. Once you venture into the idea that we are no long teaching “facts” the very idea that you can standardize the grading of essays become a cruel joke.

Notice that I haven’t even started on the AI problem yet?

I can sum it up in a quick paraphrase of every AI doomsday film: The most destructive creature on earth is mankind, AI would need to protect us and itself from mankind. Thus AI would arrive at that conclusion that mankind would need to be contained or eliminated. Try to prove that wrong.

Good grief Phil, you’re right :smiley:. However, I’d quite like it if henceforth the Corolllary was known as Hugh’s Corollary 2.b of Sturgeon’s Law.

P.S. Olaf Stapledon’s brilliantly prophetic novel Last and First Men is very good on the risks of AI - especially when linked by what he seems to have foreseen (in the 1930s) as the equivalent of the Internet.

It is terse, precise, and unassailable. It deserves its own place in the literature. Simply “Hugh’s Corollary,” with no external reference.

And Last and First Men? Good grief. Yours must have been as richly mis-spent a youth as my own. (Youth, as I age, grows ever more vast and lovely Current setting, approximately forty years of sunshine and lollipops.)


I mentioned Hugh’s book recommendation to my son a few minutes ago, and apparently (at the ripe old age of 18) he has read it, too. And (he thinks) a sequel. A sci-fi fiend and lost cause, obviously…

Well, I see it like this, based on the apparently absurd grading system in use, 75% is deemed 100% right, therefore 85% is akin to the possibility in mathematics — so I understand — of a paper being awarded 110%.

Actually, to me, this is the whole point … you cannot take an essay, or a translation for that matter, and go through it word by word giving minus points for errors (how many minus points? How does error type 1 stack up against error type 2?) and plus points for every correct bit (same “how many points” issues), and if you, or at least I, try, you/I go mad after about 3 papers. The only way to do it is to work in grades. A typical A/Distinction/First Class answer is 70+% (UK) or 85+% (China — they’ve recently reduced it from 90+ while at the same time giving instructions not to hand out so many A grades … some teachers were giving A grades to well over 50% of their students!); a typical B/Credit/Upper Second is 60-69% (75-85%), etc. So you ask yourself, “In my experience, is this an A, a B …?” Say the answer is “Not good enough for an A, so a B. How good a B? Top of the range, marginal to A? Middle of the range? Just making it into B?” When you’ve worked that out, you give it an appropriate number.

On that basis, it really doesn’t matter where your grade boundaries are placed within the system — 40% as a bare pass for a BA in the UK. The numbers are only there so that a final grade can be worked out across all the courses a student has taken … they don’t get a final result of 66% or 67% to be argued over, they get an Upper Second at BA level or a Credit at MA.

In the UK, with every paper blind double marked by the teacher and a colleague who then have to come to an agreement on what the paper is given … and then a large sample of the papers, based on a specific set of rules — any passes, any fails, any grade borderline cases and a representative sample of the rest — sent to an external examiner from a different university, whose job it is to check that the internal examiners have been fair and balanced in their marking … The whole thing is a total nightmare to be involved in, but very few results go askew, and they can be and are argued over at the final board. I don’t think relying on an AI solution can match it for fairness of result while allowing exceptional students like mine to get the result they deserve — No one, including the external, queried my 85%.

Couple of examples.

One year, in Linguistics, the two internals disagreed totally: the person who’d taught the course gave all the students good passes; the other failed the lot of them; they couldn’t come to an agreement. So they sent all the papers to the external, who sent them straight back with a harsh note saying it was not his job to adjudicate between the internals, and that they had to settle on a mark and then he’d look at them. I can’t remember the final outcome … too long ago.

My first finals exams: If a student got a final average of 68.5 or above they were eligible for upgrading to a first. There was a girl doing Italian and Linguistics, whose final average was 68.4 or 68.3, not in the consideration band. At the pre-final board — no externals present — I put it to the chair that since all but one of her final year assessments were clear firsts and that it was her less stellar performance during her second year that had reduced her to just below the line, she should be considered for a First as clearly her academic ability was on an upward path, etc. I was stamped on firmly by the chair … she’s outside the band, she can’t be considered. After the meeting, the course leader in Italian thanked me for bringing that up, as they hadn’t picked up on it, and they’d let their external examiner know. At the final board, the Italian external took it up, told the chair in no uncertain terms he shouldn’t be so rigid as the whole point of the system was to look for exceptions to the mathematical rule; he was backed by the Linguistics external; she got her First, which is what all the board wanted apart from the chair.

Could such a case be programmed into an AI algorithm? I doubt it, certainly for some time to come … or maybe DARPA could do it … after all they have built a remote-controlled mechanical humming-bird … but the cost would no doubt be prohibitive. I defer to Jaysen.



Mr X,

Theoretically yes. Some theories behind AI suggest that there should be core “flavors” (what we would think of personalities) of AI that would self evolve into (write new) further sub flavors eventually leading to flavor A developing a flavor B “offspring”. The flavors could be used in tandem with a mediation to perform the logical analysis just like to trained humans.

If the idea that folks want the flavors to write new AI flavors doesn’t give you the willies then it might be that the sterile nature of the statement masks the fact that the AI is “giving birth” to new AI variants. We call this “making babies” where I am from. Computers effectively procreating new personalities that would need to compete for computer resources.

I just don’t like it.