At a quick glance, the contents of the "poetry" page may seem to be nothing more than a poem with a new line fading in every couple of seconds. However a closer look shows the contents are only something poetry like, in small sections often seeming to make sense, but overall following no structure or single train of thought, most of the time not even gramatically correct.
The text fading onto the page is a product of word transition probabilities using large amounts of poetry as a reference. This is acomplished using n-gram freqencies, in this case trigrams.
Trigram: a group of three consecutive written units such as letters, syllables, or words.
In linguistics an
n-gram is simply a group of n adjacent
items (in this case words) in a piece of text. For example, all the 3-grams or trigrams and their frequency from
the sentence "The quick brown fox jumps over the lazy dog" would be
First Word | Second Word | Third Word | Frequency |
---|---|---|---|
The | quick | brown | 1 |
quick | brown | fox | 1 |
brown | fox | jumps | 1 |
fox | jumps | over | 1 |
jumps | over | the | 1 |
over | the | lazy | 1 |
the | lazy | dog | 1 |
A sample size this small isn't really enough to try to generate any text with, but shows how text is broken down into trigrams.
The text on the poetry page is generated from around eight million words, consisting mainly of poetry but also some novels
and other writings.
To begin generating, two starting words are chosen, in this case "in" and "the".
All of the trigrams that begin with "In the" are then considered, the ten most frequent are shown below but
there are almost nine thousand in total, more than half with a frequency of only one.
A weighted random selection
is preformed to choose the next word using all of the trigrams.
For example, dark is 655 times more likely to
be chosen than a word with a frequency of only one.
First Word | Second Word | Third Word | Frequency |
---|---|---|---|
in | the | dark | 655 |
in | the | air | 517 |
in | the | world | 478 |
in | the | morning | 457 |
in | the | sky | 360 |
in | the | sun | 358 |
in | the | night | 331 |
in | the | middle | 308 |
in | the | end | 295 |
in | the | same | 247 |
Suppose that "middle" is randomly chosen as the next word.
The process is then repeated but now using trigrams
that begin with "the middle",
this process is repeated on and on for the desired number of iterations.
Most of the content generated from this is nonsense, although sometimes, completely by accident,
multiple lines
will seem to make sense, and even share a common idea.
Though the quality of randomness can still often be seen.
Here are some of my favorite generated pieces
of poetry.
in the habitations of these lines she solaces herself
Will she be all the men who walked with
buoyant feet beside her
to the ground beneath the horizon's span
Unheeding of wings
The winds are
breathing low ,
They glided asunder without taking leave
but I see , the prevalence of a sensation of admiration
or delight .
Yes , in the bowers Where clouds are driven by tempests
in the cement spaces
overbearing
expectations
The TV light grew dim
and the more aware
Eight legs brace for the guilt of happiness
Red is the rushing torrents , trickles and fails to conform
in the way creating more
does not matter
much if I had a dream
I wake .
unaware friends . Are you afraid of touching the crimson drenched sky
with each floating white petal