# Markov model for word processing

Markov models are often used to analyze a large number of random events
After a discrete event occurs, another discrete event will occur with a certain probability under the condition of the previous event.
For example, we can establish a Markov model for a weather system:

In this weather system model, if it is sunny today, there is 70% possibility that it will be sunny tomorrow and 20% possibility
It's cloudy and 10% likely to rain. If today is a rainy day, there is a 50% chance of rain tomorrow and a 25% chance of rain tomorrow
If it can be sunny, 25% of it may be cloudy.
The following points need to be noted.
• the sum of all possible leads from any node must be equal to 100%. No matter how complex the system is, one of several events is bound to occur in the next step.
• although the weather system has only three possibilities at any time, you can use this model to generate an infinite transition list of weather states.
• only the status of the current node will affect the status of the next day. If you are on the "sunny" node, even if the first 100 days are sunny or rainy, the probability of sunny tomorrow is still 70%.
• some nodes may be harder to reach than others. The reason for this phenomenon is very complicated to explain mathematically, but it can be seen intuitively that at any time node in this system, the possibility that the next day is a "rainy day" (the sum of arrow probabilities pointing to it is less than "100%") is much less than "sunny" or "cloudy".
Obviously, this is a very simple system, and the Markov model can evolve into a complex system of any scale. In fact, Google's page rank algorithm is also based on Markov model, taking the website as the node and inbound / outbound links as the connection between nodes. The "likelihood" of connecting a node indicates the relative attention of a website. In other words, if our weather system represents a Micro Internet, then the page level of "rainy day" is
(page rank) is relatively low, while the "cloudy" page rank is relatively high.

## Text analysis and writing

Using the content of William Henry Harrison's inaugural speech analyzed in the previous example, we can write the following code to generate sentences composed of Markov chains of arbitrary length (the chain length in the following example is 100) through the structure of the speech content

```from urllib.request import urlopen
from random import randint
def wordListSum(wordList):
sum = 0
for word, value in wordList.items():
sum += value
return sum
def retrieveRandomWord(wordList):
randIndex = randint(1, wordListSum(wordList))
for word, value in wordList.items():
randIndex -= value
if randIndex <= 0:
return word
def buildWordDict(text):
# Eliminate line breaks and quotation marks
text = text.replace("\n", " ");
text = text.replace("\"", "");
# Make sure that each punctuation mark is with the preceding word
# In this way, it will not be eliminated and retained in the Markov chain
punctuation = [',', '.', ';',':']
for symbol in punctuation:
text = text.replace(symbol, " "+symbol+" ");
words = text.split(" ")
# Filter empty words
words = [word for word in words if word != ""]
wordDict = {}
for i in range(1, len(words)):
if words[i-1] not in wordDict:
# Create a new dictionary for words
wordDict[words[i-1]] = {}
if words[i] not in wordDict[words[i-1]]:
wordDict[words[i-1]][words[i]] = 0
wordDict[words[i-1]][words[i]] = wordDict[words[i-1]][words[
i]] + 1
return wordDict
text = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt")
wordDict = buildWordDict(text)
# Generate a Markov chain with chain length of 100
length = 100
chain = ""
currentWord = "I"
for i in range(0, length):
chain += currentWord+" "
currentWord = retrieveRandomWord(wordDict[currentWord])
print(chain)
```

The output of the code changes every time.
The basic principle is:
The buildWordDict function takes the string of speech text obtained on the Internet as a parameter, and then cleans up the string
And formatting, removing quotation marks and adding spaces at both ends of other punctuation marks, so that each word can be processed
Valid processing, passed to the retrieferandomword function. This function will press
Randomly obtain a word according to the weight of word frequency in the dictionary.
First determine a random starting word (the frequently used "I" in the example), and we can randomly repeat it through Markov chain to generate sentences of any length we need.

Added by Chelsove on Sun, 16 Jan 2022 09:23:33 +0200