Experimenting with Markov Chains

...to Generate Readable Text

18 June 2016

According to Victor Powell, “Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another.” Many people have used Markov chains to generate literature based on a pre-existing corpus. This blog post explains some experiments I recently did with Markov chains, with the goal of generating readable text.

Marky_Markov is a Ruby gem that can be used to generate Markov chains. I used it to quickly set up a Markov chain and start fooling around with it.

require 'marky_markov'

markov = MarkyMarkov::TemporaryDictionary.new
markov.parse_string "The cat is very friendly."
markov.parse_string "The dog is slightly friendly."
markov.parse_string "The sky is very unfriendly."
markov.parse_string "The window is slightly unfriendly."

markov.generate_10_sentences

=>”The window is slightly unfriendly. The window is slightly friendly. The sky is very unfriendly. The sky is very unfriendly. The window is slightly friendly. The sky is very friendly. The cat is very unfriendly. The cat is very unfriendly. The window is slightly unfriendly. The cat is very friendly.”

Some of the sentences are copies of the original corpus, but other sentences are original! The algorithm had learned how to write original sentences!

These sentences are fairly dull though. The generator also repeats itself sometimes.

At this point, the budding Markov Chainer will start searching for a pre-made corpus that can then be fed into the Markov Chain. But I’m not a budding Markov Chainer…if anything, I’m slightly jaded. The benefit of feeding a pre-made corpus into a Markov Chain is that the text start sounding more interesting. The problem is that the text also becomes less coherent. Here’s some sample text from a Markov chain trained on “The Raven”:

Once upon a bust of Pallas just above my chamber door; – This it is, and this mystery explore; – ‘Tis the wind and nothing more.

I can understand why some people may be happy with text that seems meaningful so long as you don’t read it too deeply. In fact, this text could plausibly be used as poetry. Any oddities in the text itself can be explained away as stylistic choices and metaphors.

But if we decide to use Markov chains outside of poetry, well…um….

Here’s some sample text from a Markov chain trained on the ‘tinyshakespeare’ corpus:

DUKE VINCENTIO:

Well, your wit is in the care of side and that.

FRIAR LAURENCE:

Or walk liest;

And the ears.

And hell!

In self.

PETRUCHIO:

Persuading to our the enemy, even woman, I’ll afford show’d and speaking of

England all out what least. Be satisfied! Now, sir.

Second Lord:

They would be ruled after this chamber, and

my fair nues begun out of the fact, to be conveyed,

Whose noble souls I’ll have the heart of the wars.

…yeah, no. The blogger is surprised about how his Markov chain learned Old English, but I absolutely hate this output. This is why I had been reluctant to use Markov chains ever…because what they actually produce is utter nonsense.

When I conduct experimentation in text generation, my goal is to produce text that reaches human quality. This bias may have been caused by my previous experience as a journalist, where I was expected to write words that people are able to read…not necessarily text that appears to be readable. This causes me to prefer more ‘conservative’ techniques in text generation (shuffling paragraphs, using pre-written templates, etc.)

Recently, I started a Medium blog where I showed off some fictional work that was computer-generated. This blog was somewhat controversial on Reddit, as can be seen in the comment thread. Many people raised the valid point that the algorithms used were just too simplistic for their tastes (as I was too reliant on my ‘conservative’ techniques). While I was happy with my output (it was readable), I also realized that I needed to do slightly more experimentation. One reddit poster (kleer001) made a rather interesting comment:

let’s see some markov chains maybe?

here, have som fun with this: http://projects.haykranen.nl/markov/demo/

sorry I noped you, you sound like a cool dude, but still, you can do hella better. i believe in you.

Hence, why I was fooling around with marky_markov.

Looking at the output of my toy Markov chain (“The window is slightly friendly”) makes me realize that the main problem I had with Markov chains isn’t necessarily the chain itself, but the corpus that people use to train on it. If you provide complicated prose to a Markov chain (such as Shakespeare’s plays), you are not going to get good results out of it. What I’m going to have to do is to simplify the prose to a level that a Markov chain will be able to grasp and imitate properly.

I need to generate a corpus that follow these two rules:

Each sentence should have very simplistic and repetitive grammar.
Each sentence should be context-less; there should be no connection to previous and future sentences (the Markov chain will not have the ability to understand the context of words).

Now, writing out all these sentences for a corpus is rather dull…which can explain why some people reach for a pre-made corpus instead. But I remembered that I still had prior experience in text-generation. Why don’t I write a program that can generate text…and then use its output as a corpus? I used the Ruby gem ‘Calyx’ to quickly set up a template-based text generator.

require 'marky_markov'
require 'calyx'

markov = MarkyMarkov::TemporaryDictionary.new

class CorpusGenerator < Calyx::Grammar
  start '{dialogue}, {gender} {speak} {mood}.'
  rule :dialogue, 'Do you think she is out there', 'Do you think she cares', 'Will she ever care',
  'We have a job to do', 'She has a job to do', 'I did the right thing',
  'She did what was right', 'Do not kill me', 'I must kill you',
  'You have a job to do', 'I am sorry', 'She is sorry',
  'I hate you', 'I am sorry I met you'
  rule :gender, 'he', 'she', 'the computer', 'the monster'
  rule :speak, "says", "giggles", "laughs", "cries", "sings"
  rule :mood, "sadly", "madly"
end

generator = CorpusGenerator.new

corpus = []

10.times do
  sentence = generator.generate
  corpus << sentence
  markov.parse_string sentence
end

Now, the problem is that the sentences are so basic that it’s likely a Markov chain could just re-generate those same, basic sentences. That’s no fun. But, the Markov chain does have the possibility of generating unique sentences too. I may need to measure how many ‘duplicated’ sentences it generates, to determine how much “uniqueness” the Markov chain can produce.

I decide to run my program with 10 corpus sentences, and then generate 10 sentences using my Markov chain, and then compare the uniqueness (the number of unique sentences divided by the number of total sentences generated).

SAMPLE = 10

SAMPLE.times do |sentence|
  generated_sentences << markov.generate_1_sentence
end

duplicated_sentences = generated_sentences.select {|sentence| corpus.include?(sentence) }

def calculate_odds(duplicated_sentences)
  duplicates = (duplicated_sentences.length.to_f)
  percent_of_duplicate = (duplicates/(SAMPLE.to_f))
  1 - percent_of_duplicate
end

puts "The number of duplicated sentences is #{duplicated_sentences.length}."
puts "The uniqueness of the text is #{calculate_odds(duplicated_sentences)}."

I also added some print statements to this code to view the original corpus and the generated sentences. I highlighted any unique sentences by using dashes.

#Input Corpus (Generated By Templates):
Do you think she cares, the monster says madly.
Do you think she cares, the monster sings madly.
She has a job to do, she giggles madly.
I am sorry, she sings sadly.
I am sorry I met you, the computer cries sadly.
I hate you, he says sadly.
I did the right thing, she laughs madly.
I am sorry I met you, she sings madly.
She has a job to do, the monster says sadly.
Will she ever care, the monster says madly.

#Ouput Sentences (Generated by Markov Chains):
I hate you, he says sadly.
I did the right thing, she laughs madly.
Will she ever care, the monster says madly.
She has a job to do, she giggles madly.
I hate you, he says sadly.
- I am sorry I met you, she sings sadly. -
She has a job to do, she giggles madly.
- Will she ever care, the monster says sadly. -
- Will she ever care, the monster sings madly. -
Will she ever care, the monster says madly.

The number of duplicated sentences is 7.
The uniqueness of the text is 0.19999999999999996.

So, about 20% of the generated text is unique. That’s…actually kinda justifiable. Surely, the corpus generator is doing most of the work, but the Markov chain is able to add some additional “variation” to the output.

Usually, Markov chains work well when you give it a large corpus. But luckily, since we already have a corpus text generator, we can just tell it to produce as many sentences as we want. So, I generate a corpus of 1000 sentences, use that corpus to train a Markov chain, and then use the Markov chain to generate 1000 new sentences.

The number of duplicated sentences is 712.
The uniqueness of the text is 0.28800000000000003.

Excellent. 288 unique sentences! I wonder what they read like…

I met you, the monster giggles madly.
a job to do, she sings madly.
I met you, she cries madly.
must kill you, she sings madly.
do, the monster laughs sadly.
You have a job to do, the computer laughs sadly.
he cries madly.
job to do, the computer cries madly.
a job to do, she sings madly.
me, he giggles madly.
I met you, the monster cries madly.
you think she is out there, she says madly.
he says madly.
She is sorry, the monster cries sadly.
she cries madly.
I met you, the monster laughs sadly.
right, she laughs sadly.

Oh.

Clearly, there’s still work to be done, especially with the initial text generator. The sentence fragments are annoying (though I presume I can write some more code to scan the generated text and delete any sentence that aren’t complete). The bigger problem is that this wall of text is…fairly dull and boring. I don’t really care why all these characters are speaking.

I need to try to create a somewhat sensible theme, at least one well enough that to justify having all these random people talking to each other. Or maybe I need to have different types of sentences that could be generated (possible descriptions and actions) that makes the generated text slightly more engaging.

The Markov chain interestingly was able to generate a phrase that was not specifically hardocded into the generator…“I met you” (obviously an extract from “I am sorry I met you”). And yes, it can repeat that phrase a lot.

I met you, the monster says sadly.
I met you, the computer giggles madly.
I met you, the monster says sadly.
I met you, the computer laughs madly.
I met you, the monster sings madly.

I am unsure whether this experiment was a success. On the one hand, the Markov chain did add ‘value’ to the process, by generating fairly unique sentences. On the other hand, much of this additional variation could have been easily accomplished by simply running the text generator again. The variations that the Markov chain does produce are fairly trivial and may not be interesting on its own.

I am also dubious on whether more uniqueness would actually be a worthwhile goal. Infinite text generation itself may not be very interesting. You’re usually only going to read a small fraction of whatever the program generates, before you get bored and ready to move onto the next content to consume. If that’s the case, then why try to increase the number of possible variations? The reader is still going to get bored anyway. It may be smarter to instead focus on making each individual variation interesting.

The one bright spot in this research is its possible application to machine learning. The blogger who trained Markov chains on the ‘tinyshakespeare’ corpus asked “Is Deep Learning a Markov Chain In Disguise?” The output of a machine learning algorithm (when trained on text) is only slightly better than that of a Markov chain, after all. If you can write a program that can generate a corpus that then can train a Markov chain to produce meaningful text, then you could reuse that same program on a machine learning algorithm. Maybe their output may be more interesting to read.

Return back to Blog Index