Measuring text information content through the ages…

Earlier this week I met with a linguistics PhD student from Victoria University named Myq, we discussed a variety of topics. I shared my experience with OpenCog and suggested he check out RelEx. He discussed his work around disproving a study which investigated the number of words required in a piece of text to retain the core meaning. Basically, a lot of the words in text/speech, although useful for stringing ideas together, are not vital to the message being carried.

This got me thinking…

Since I’m working on NetEmpathy, which is currently focussed on analysing the sentiment of tweets, the meaning within tweets (when it exists) is very high. There’s little space for superfluous flowery text when you only have 140 characters.

Myq mentioned how academic papers are a lot like this now. The meaning is highly compressed, particularly in scientific papers. You’ve got to summarise past research, state your method so that it’s reproducible, analyse the results, etc. All in a half a dozen pages. This wasn’t always the case though. In the past academic papers would be long works which meandered their way to the point. Part of this might have to do with the amount of preexisting knowledge present in society, i.e. earlier on there was less global scientific knowledge available, so to adequately cover the background of a subject wasn’t a major difficulty and they could spend more time philosophising. That’s a topic for another post though…

What I was interested is how densely information is packed. Is this increasing?

My immediate thoughts were: text compression! and measure the entropy!.

Basically, information theory dictates that text that contains less information can be represented in fewer bytes. This is why it’s possible to create lossless compression. You assign frequent symbols to be represented by smaller ones. For example, because ‘the’ is one of the most common English words, you might replace it with ‘1’ (and crudely, you could replace ‘1’ with ‘the’ so that you could still use ‘1’ normally). This way, you’ve reduced the size of that symbol by two thirds without loss of information. Obviously this wouldn’t improve your compression factor and a spreadsheet full of numbers though.

A guy called Douglas Biber has apparently already investigated this information content historically, but from a more linguistic and manual investigation.

What I’d like to do one day is examine the compression factors of early scientific journals, recent journals, tweets, txt messages, wikipedia, etc. and see just how the theoretical information content has changed, if at all.

Another project for when I’m independently wealthy.