Test Data for Trading - recurrence of sentimental analysis code

Test Data for Trading - sentimental Analysis series articles explain the contents of Chapter 14 of machine learning for ethical trading and reproduce relevant codes. Because there are great differences between Chinese and English text analysis, this series does not select the materials in the Chinese market as the data of code reproduction, but selects the source code behind the book for reproduction.

From token to number: document term matrix
The word bag model represents a document according to the frequency of terms or tags it contains. Each document is a vector, and each tag in the vocabulary has an entry, reflecting the relevance of the tag to the document.
In view of the existence of vocabulary, the document term matrix can be calculated directly. However, it is also a rough simplification because it abstracts the relationship between word order and grammar. Nevertheless, it can often achieve good results quickly in text classification, so it is a very useful starting point.
There are several ways to weigh a tagged vector entry to capture its relevance to the file. We will explain how to use sklearn to use binary flags, counts and weighted counts indicating presence or absence. These counts take into account the differences in the frequency of terms in all documents, that is, in the corpus.

#Introduce some required libraries and packages to build the environment
from collections import Counter
from pathlib import Path
import numpy as np
import pandas as pd
from scipy import sparse
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
import seaborn as sns
from ipywidgets import interact, FloatRangeSlider
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split

Import the bbc data mentioned in the previous sections

#Convert data to dataframe format
docs = pd.DataFrame(doc_list, columns=['topic', 'heading', 'body'])
docs.info()

Inspection results

docs.sample(10)

BBC data is extracted from five texts.

#Get the proportion of five categories in bbc data
docs.topic.value_counts(normalize=True).to_frame('count').style.format({'count': '{:,.2%}'.format})

The proportions of the five types of text in the resulting bbc document are shown in the figure.
1. Count tokens

#Calculate the number of word s in the whole corpus
word_count = docs.body.str.split().str.len().sum()
print(f'Total word count: {word_count:,d} | per article: {word_count/len(docs):,.0f}')
#There are 842910 word s in total
Total word count: 842,910 | per article: 379

2. Count the 50 token s with the highest frequency

token_count = Counter()
for i, doc in enumerate(docs.body.tolist(), 1):
    if i % 500 == 0:
        print(i, end=' ', flush=True)
    token_count.update([t.strip() for t in
n = 50
(tokens
 .iloc[:50]
 .plot
 .bar(figsize=(14, 4), title=f'Most frequent {n} of {len(tokens):,d} tokens'))
sns.despine()
plt.tight_layout();
#50 token s with the highest frequency are detected and visualized

3. Generate document - term matrix using CountVectorizer
The scikit learn preprocessing module provides two tools to create a document term matrix. The CountVectorizer uses binary or absolute counts to measure the term frequency tf (d, t) of each document d and token t.
Instead, TfIDFVectorizer measures (absolute) term frequency with inverse document frequency (idf). In other words, a term that appears in more documents will get lower weight.
The TF IDF vector of each document is normalized. TF IDF measure was originally used for information retrieval and ranking the results of search engines. Later, it was found to be very useful for text classification or clustering.
The key parameters affecting vocabulary size are:
stop_words: use a built-in or provide a list of (frequent) words to exclude (such as and, the)
ngram_range: n-grams in the N range defined by the (nmin, nmax) tuple.
max_features: limit the number of tags in the glossary accordingly

4. Document frequency distribution

binary_vectorizer = CountVectorizer(max_df=1.0,
                                    min_df=1,
                                    binary=True)

binary_dtm = binary_vectorizer.fit_transform(docs.body)
binary_dtm
n_docs, n_tokens = binary_dtm.shape
tokens_dtm = binary_vectorizer.get_feature_names()

5,min_df and max_df - interactive visualization
This paragraph discusses min_df and Max_ The influence of DF setting on vocabulary size. Read the document into a DataFrame, set CountVectorizer to generate binary flags, use all token s, and call its fit_transform() method to generate a document term matrix.
Visual display requires tags to appear in at least 1% and less than 50% of documents, and limits the vocabulary to about 10% of nearly 30Ktoken. This makes the number of unique tokens in each document slightly higher than 100.

df_range = FloatRangeSlider(value=[0.0, 1.0],
                            min=0,
                            max=1,
                            step=0.0001,
                            description='Doc. Freq.',
                            disabled=False,
                            continuous_update=True,
                            orientation='horizontal',
                            readout=True,
                            readout_format='.1%',
                            layout={'width': '800px'})

6. Most similar file
The result of CountVectorizer allows us to use SciPy spatial. The pdist() function of the pairing distance provided by the distance module finds the most similar file.
It returns a compressed distance matrix whose entries correspond to the upper triangle of a square matrix.

#This code is intended to get the two most similar files
m = binary_dtm.todense()
pairwise_distances = pdist(m, metric='cosine')
closest = np.argmin(pairwise_distances)
rows, cols = np.triu_indices(n_docs)
rows[closest], cols[closest]
(6, 245)

This shows that the sixth document and the 245th document are the most similar

pd.DataFrame(binary_dtm[[6, 245], :].todense()).sum(0).value_counts()
0    28972
1      265
2       38

The two documents share 38 tokens, but after reading the articles, we found that one of the themes of the two articles is that the employment growth in the United States is still very slow, and the other is to describe the extortion of WorldCom. Although the two articles belong to the theme of business, the contents expressed are very different, which shows that the ability of similarity measurement based on token and word in identifying deeper semantic similarity is very limited.

7. Frequency of terms - how does reverse document frequency work
For a small text sample, TFIDF is calculated as follows:

#First, calculate the frequency of each token in each document
sample_docs = ['call you tomorrow', 
                'Call me a taxi', 
                'please call me... PLEASE!']
#Take this document as an example
vectorizer = CountVectorizer()
tf_dtm = vectorizer.fit_transform(sample_docs).todense()
tokens = vectorizer.get_feature_names()
term_frequency = pd.DataFrame(data=tf_dtm, 
                              columns=tokens)
print(term_frequency)#How often does the output term appear in the document
      call  me  please  taxi  tomorrow  you
0     1   0       0     0         1    1
1     1   1       0     1         0    0
2     1   1       2     0         0    0
#call occurs once in the first, second and third documents; In the second and third document, me appears once, not in the first sentence.

Continue to calculate the frequency of including these token s in the document

vectorizer = CountVectorizer(binary=True)
df_dtm = vectorizer.fit_transform(sample_docs).todense().sum(axis=0)
document_frequency = pd.DataFrame(data=df_dtm,
                                  columns=tokens)
print(document_frequency)
      call  me  please  taxi  tomorrow  you
0     3   2       1     1         1    1
#Three documents contain call; Two documents contain me; One document contains please

8. Calculate TlDF

tfidf = pd.DataFrame(data=tf_dtm/df_dtm, columns=tokens)
print(tfidf)
   call   me  please  taxi  tomorrow  you
0  0.333333  0.0     0.0   0.0       1.0  1.0
1  0.333333  0.5     0.0   1.0       0.0  0.0
2  0.333333  0.5     2.0   0.0       0.0  0.0
#The weight of TF-IDF is the ratio of token frequency to document frequency

9. Summarize news articles using TfIDF weights
Pick any article

article = docs.sample(1).squeeze()
article_id = article.name
print(f'Topic:\t{article.topic.capitalize()}\n\n{article.heading}\n')
print(article.body.strip())

Topic: Business
France Telecom gets Orange boost
Strong growth in subscriptions to mobile phone network Orange has helped boost profits at owner France Telecom. Orange added more than five million new customers in 2004, leading to a 10% increase in its revenues. Increased take-up of broadband telecoms services also boosted France Telecom’s profits, which showed a 5.5% rise to 18.3bn euros ($23.4bn; £12.5bn). (extracted article excerpts)

This paper mainly describes the impact of the strong growth of mobile phone network Orange users on the economy.
Select the most relevant token through the value of Tfidf:

article_tfidf = dtm_tfidf[article_id].todense().A1
article_tokens = pd.Series(article_tfidf, index=tokens)
article_tokens.sort_values(ascending=False).head(10)
#The selected token s and weights are as follows
telecom         0.540529
france          0.341326
equant          0.261060
euros           0.244469
orange          0.186060
telecoms        0.160378
services        0.108252
growth          0.106366
shareholders    0.102073
businesses      0.097149

According to the topic of the extracted article, the extracted keywords are indeed very relevant to the document.
We will then compare the token s extracted according to the weight with the randomly selected Tokens:

#Pick any ten token s in the document
pd.Series(article.body.split()).sample(10).tolist()
#Ten randomly selected token s
['our',
 'in',
 'at',
 'Breton.',
 'and',
 'the',
 'improve',
 'to',
 'subscriber',
 'in']

It can be seen that the algorithm of extracting token according to weight is much better than random extraction.

Added by aspekt9 on Thu, 20 Jan 2022 19:13:27 +0200