Random Text Generation With Trigrams

At a quick glance, the contents of the "poetry" page may seem to be nothing more than a poem with a new line fading in every couple of seconds. However a closer look shows the contents are only something poetry like, in small sections often seeming to make sense, but overall following no structure or single train of thought, most of the time not even gramatically correct.

The text fading onto the page is a product of word transition probabilities using large amounts of poetry as a reference. This is acomplished using n-gram freqencies, in this case trigrams.

Trigrams and n-grams

Trigram: a group of three consecutive written units such as letters, syllables, or words.
In linguistics an n-gram is simply a group of n adjacent items (in this case words) in a piece of text. For example, all the 3-grams or trigrams and their frequency from the sentence "The quick brown fox jumps over the lazy dog" would be

First Word Second Word Third Word Frequency
The quick brown 1
quick brown fox 1
brown fox jumps 1
fox jumps over 1
jumps over the 1
over the lazy 1
the lazy dog 1

A sample size this small isn't really enough to try to generate any text with, but shows how text is broken down into trigrams.

Generating text

The text on the poetry page is generated from around eight million words, consisting mainly of poetry but also some novels and other writings.

To begin generating, two starting words are chosen, in this case "in" and "the".
All of the trigrams that begin with "In the" are then considered, the ten most frequent are shown below but there are almost nine thousand in total, more than half with a frequency of only one.
A weighted random selection is preformed to choose the next word using all of the trigrams.
For example, dark is 655 times more likely to be chosen than a word with a frequency of only one.

First Word Second Word Third Word Frequency
in the dark 655
in the air 517
in the world 478
in the morning 457
in the sky 360
in the sun 358
in the night 331
in the middle 308
in the end 295
in the same 247

Suppose that "middle" is randomly chosen as the next word.
The process is then repeated but now using trigrams that begin with "the middle",
this process is repeated on and on for the desired number of iterations.


Most of the content generated from this is nonsense, although sometimes, completely by accident,
multiple lines will seem to make sense, and even share a common idea.
Though the quality of randomness can still often be seen.
Here are some of my favorite generated pieces of poetry.

in the habitations of these lines she solaces herself
Will she be all the men who walked with buoyant feet beside her
to the ground beneath the horizon's span
Unheeding of wings

The winds are breathing low ,
They glided asunder without taking leave
but I see , the prevalence of a sensation of admiration or delight .
Yes , in the bowers Where clouds are driven by tempests

in the cement spaces
overbearing expectations
The TV light grew dim
and the more aware
Eight legs brace for the guilt of happiness

Red is the rushing torrents , trickles and fails to conform
in the way creating more
does not matter much if I had a dream
I wake .
unaware friends . Are you afraid of touching the crimson drenched sky
with each floating white petal