Markov models are remarkably suited to mimicking the structure of a phenomenon and as a result, I thought it would be interesting to explore that application in the context of textual analysis. This post describes the experiment of taking a few books penned by Mark Twain with the goal of building a Markov model that probabilistically generates text that shares Twain’s writing style and syntactical structure.

Method

I start by collecting various writings from Mark Twain and combining them to create one big corpus. I then clean this big corpus by stripping all punctuation and lowercasing all text, allowing the corpus to be more easily parsed later on.

function clean_corpus(text, regex; normalize = true, lower_case = true)
    if normalize
        # replace control characters with spaces
        text = normalize_string(text, stripmark = true, stripignore = true, stripcc = true)
    end
    
    if lower_case
        text = lowercase(text)
    end
    
    # remove unwanted characters
    text = replace(text, regex, "")
    
    # remove ""
    text = split(text)
    target_index = 1
    for i in 1:length(text)
        target_index = findnext(text, "", target_index)
        if target_index == 0
            break
        else
            splice!(text, target_index)
        end
    end        
    text = join(text, " ")

end;

# import books
f = open("mark_twain_books/adventures_of_tom_sawyer.txt")
ats = readall(f);
f = open("mark_twain_books/huckleberry_finn.txt")
hf = readall(f)
f = open("mark_twain_books/the_prince_and_the_pauper.txt")
tpatp = readall(f)

# clean books
# create regex object (I prefer whitelisting characters I want to keep)
chars_to_remove = r"[^a-z ]"

ats_clean = clean_corpus(ats, chars_to_remove);
hf_clean = clean_corpus(hf, chars_to_remove)
tpatp_clean = clean_corpus(tpatp, chars_to_remove)
# combine all books
big_corpus_clean = ats_clean * " " * hf_clean * " " * tpatp_clean

The next step is to convert each word in the text into a numerical representation which will make building a frequency array, which will be discussed below, from the corpus both convenient and computationally efficient.

function text_to_numeric(text, symbols)
    numeric_text = []
    for each in text
        push!(numeric_text, findfirst(symbols, each))
    end

    numeric_text
end;

I’ll also create a function to map the numerical representation back to the original text.

function numeric_to_text(numeric, symbols)
    text= []
    for num in numeric
        push!(text, symbols[num])
    end
    
    text
end;

Since Markov models do not care about the past and writing style depends on what has already been written, I can group ngram words into each state as a workaround to the model’s inherently limited memory. As an aside, it’s typically common to consider ngram values of 1 to 4 and usually as ngram increases, the quality of the text generated by the Markov model does as well.

In order to know how to jump from one state to another, I need to generate probabilities for these jumps. For our purposes, I’m going to generate a ngram+1 dimensional frequency array, $P$, that contains frequency counts for each ngram+1 words. I can achieve this through the following:

  • Combine all texts into one massive corpus.
  • I then find all unique words in the corpus and assign each word to a unique integer.
  • In the corpus, I extract $n = ngram + 1$ words at a time, iterating through every word in the text in a sliding-window-like fashion.
    1. In each iteration, I associate each word in the ngram+1 words with its corresponding unique number.
    2. I then use these numbers as indices in the frequency array, $P$, and increment that spot in $P$ by one.
    3. This loop repeats until I reach the end of the text minus ngram+1 words.
function get_corpus_frequencies(corpus, ngram; groupby = "words")
    # to get frequency of symbol x after ngram symbols
    ngram = ngram + 1
    if groupby == "chars"
        corpus = split(corpus, "")
    else        
        corpus = split(corpus)
    end
    
    # find unique symbols
    unique_symbols = unique(corpus)   
    # convert text to numbers
    corpus_numeric = text_to_numeric(corpus, unique_symbols);
    # create M
    dimensions = repeat([length(unique_symbols)], outer=[ngram])
    M = repeat(zeros(UInt16, 1), outer = dimensions)
    # get frequencies for ngram of text
    for i in 1:length(corpus)-ngram+1
        M[corpus_numeric[i:i+ngram-1]...] += 1
    end
    
    M
end;

M_2 = get_corpus_frequencies(big_corpus_clean, 2)

Now $P$ will give me insight into which word, and by inference the state, I should go to next given ngram words and as such, $P$ is a transition matrix which functions as our Markov model. With $P$, I can determine the next word to jump to by the following:

  • Choose ngram words as a starting point.
  • Convert those words to their numerical representations.
  • For $x$ steps, do the following:
    1. Look up in $P$ my last ngram words to find probabilities for all possible next words.
    2. Create ranges from the above probabilities.
    3. Choose a random number, $r$, from 0 to 1.
    4. Find out which range $r$ lands in, and that is the next word to go to.

For this, we need a function to choose the next state.

function choose_next_state(distribution, r)
    # only consider entries that are non-zero
    nonzero_entries = findn(distribution)

    distribution_nonzero = distribution[nonzero_entries]
    ranges = cumsum(distribution_nonzero)

    for (idx, range) in enumerate(ranges)
        if r < range
            return nonzero_entries[idx]
        end
    end
end

Implementation Challenges

There’s a couple implementation challenges I need to account for. Firstly, I have to approach the fact that there might not be a next state to jump to for some states. Initially, I decided to address this problem by randomly choosing the next state to go to. This is better than letting the entire process grind to a halt but still negatively affects sentence structure.

To overcome this, I instead use a ngram-dimension version of $P$ to attain a probability for going to another state. If this failed, then I’d resort to just randomly choosing the next state. In practice, I believe this “trickle-down” method of transition matrices is quite effective.

For example, if I want to test with trigram words, I’ll first create the original 4-dimensional $P$, $P4$, but then also a 3-dimensional $P$, $P3$, for when there is no probability entry in $P4$, and create a 2-dimensional $P$, $P2$, for when there is no probability entry in $P3$. This concept can be extended to ngram words and the total number of transition matrices needed to use this method would be ngram.

Below is the implementation of this “trickle-down” logic.

function trickle_down(current_state, M)
    none_worked = true
    for (idx, P) in enumerate(M)
        sigma = convert(Int, sum(P[current_state[idx:end]..., :][:]))
        if sigma != 0 # avoid division by 0 error
            distribution = P[current_state[idx:end]..., :][:] /
                           sum(P[current_state[idx:end]..., :][:])
            r = rand()
            next_word_idx = choose_next_state(distribution, r)
            none_worked = false
            break
        end
    end
    if none_worked
        # just choose next state at random
        next_word_idx = rand(1:length(M[1][current_state..., :][:]))
    end

    next_word_idx
end

Now finally I build the machinery that uses the transition matrix that I generated from the Mark Twain books.

function markov_model(ϕ, num_steps, unique_symbols, ngram, M, groupby)
    if groupby == "chars"
        ϕ = split(ϕ, "")
    else
        ϕ = split(ϕ)
    end

    # create empty array to store result of Markov jumping from state to state
    markov_chain_text = []
    append!(markov_chain_text, ϕ)

    current_state = text_to_numeric(ϕ, unique_symbols)

    # "trickle-down" transition matrices
    for step in 1:num_steps
        next_word_idx = trickle_down(current_state, M)
        next_word = numeric_to_text([next_word_idx], unique_symbols)[1]
        push!(markov_chain_text, next_word)
        current_state = text_to_numeric(markov_chain_text[end-ngram+1:end], unique_symbols)
    end

    markov_chain_text
end

Now I’m just going to write a function to run this Markov model-based text generator.

function run(corpus, M; num_steps = 10, ngram = 2, groupby = "words")
    unique_symbols = unique(split(corpus))
    # choose random ngram set of symbols from text
    ϕ = get_phi(corpus, ngram, groupby = groupby)
    @show ϕ

    markov_chain_text = markov_model(ϕ, num_steps, unique_symbols, ngram, M, groupby)
    join(markov_chain_text, " ")
end

function get_phi(cleaned_corpus, ngram; groupby = "words")
    if groupby == "chars"
        cleaned_corpus_array = split(cleaned_corpus, "")
    else
        cleaned_corpus_array = split(cleaned_corpus)
    end
    starting_point = rand(1:length(cleaned_corpus_array)-ngram)
    ϕ = join(cleaned_corpus_array[starting_point:starting_point+ngram-1], " ")
end

Now let’s run it! Hopefully the output sounds a bit like Mark Twain!

ngram2results = run(big_corpus_clean, M_2, num_steps = 200, ngram = 2);

Results

For ngram = 2 with an initial state of “high boardfence”, we get promising results from the Markov Twain bot:

“high boardfence and disappeared over it his aunt polly asked him questions that were full of these wonderful things and many a night as he went on and told me all about such things jim would happen in and say hm what you know bout witches and that when he came home emptyhanded at night he knew the model boy of the window on to know just how long he can torment me before i could budge it was all right because she done it herself her sister miss watson a tolerable slim old maid with goggles on had just come to live with her and took a set at me now with a string and said it was minutes and minutes that there warnt a sound that a ghost makes when it was to see with gay banners waving from every balcony and housetop and splendid pageants marching along by night it was to go around all day long with you id made sure youd played hookey and been aswimming but i bet you ill she did not wait for the rest as he lay in the part where tom canty lapped in silks and satins unconscious of all this fuss”