Using Ancol PCA method, the results of ancestral source calculator are combined with the actual situation for visual analysis

preface

Do you have 10000 friends who see the problem ❓ Do you want to ask what the hell Ancol PCA is. I don't know normal, because I made the word 2333333
Why it's called this name: as we all know, the English of descent is Ancestry, and the English of location is location. These two words take the first three letters, loc and then reverse to remove c. when combined, it's Ancol. PCA means principal component analysis. The meaning remains unchanged

The following tutorial begins:

Programming language: Python 3 eight
Module: pandas numpy sklearn matplotlib geopy
Overall idea: first reduce the multi-dimensional data of the calculator to two-dimensional data and make it as the x and y axes, then convert the position data into one-dimensional data and make it as the z axis, and finally combine it into three-dimensional data and visualize it

code:

# Gets the latitude and longitude of the position
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent = 'google_map')
location = geolocator.geocode('location')
(location.latitude, location.longitude)

# Dimensionality reduction and visualization
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import  pandas as pd
import numpy as np

# Data preprocessing part
## data fetch
df = pd.read_csv('new_pca.csv', index_col = 0, header = 0)
## Obtain the ethnic group of the sample
label = df.index
## Select the e11 calculator data section
data = df.loc[:, 'Africa':'America']
## Select the latitude and longitude part
df = df.loc[:, 'latitude':'longitude']
# Start dimensionality reduction
## Convert longitude and latitude into radian system and reduce the dimension to one-dimensional data
pca = PCA(n_components = 1)
z = pca.fit_transform(np.deg2rad(df))
## Reduce the dimension of 11 dimensional data of e11 calculator to two dimensions
pca = PCA(n_components = 2)
data_2 = pca.fit_transform(data)
# Merge two arrays
data = np.concatenate((data_2, z), axis = 1)
# Scatter visualization
fig = plt.figure()
ax = Axes3D(fig)
for i in range(len(label)):
    x, y, z = data[i][0], data[i][1], data[i][-1]
    if label[i] == 'the republic of korea':
        ax.scatter(x, y, z, color = 'blue')
    elif label[i] == 'Jilin Province':
        ax.scatter(x, y, z, color = 'red')
plt.show()

Result display:


(this example is an Ancol PCA diagram based on the e11 data and location information of Koreans and Korean people in Jilin Province)

Advantages of this method:

Two or more populations with very similar calculator results can be dispersed on the scatter diagram, and a more scientific combination analysis of gene level and individual level is realized

Disadvantages of this method:

The Euclidean distance between points can not accurately reflect the genetic distance between populations. In addition, it is difficult to read data for people who are dizzy in 3D.

Significance of this method:

In the past, when we looked at the analysis method of Zuyuan, we just looked at the calculator results directly, and then asked where people and families, and inferred. At most, it will be combined with the traditional PCA. However, this method can digitize the location information and trace the source more scientifically.

The following is the flow chart of implementing Ancol PCA:

And the code used to draw the flow chart:

from graphviz import Digraph
dot = Digraph(comment = 'The Round Table')
# fixed point
dot.node('#',' read data ')
dot.node('a', 'Calculator data')
dot.node('b', 'Calculator 2D data')
dot.node('c', 'Data location')
dot.node('d', 'Longitude and latitude')
dot.node('e', 'Radian position information')
dot.node('f', 'One dimensional position')
dot.node('+', 'Form a new array')
dot.node('g', '3D visualization')
# Connect
dot.edge('#', 'a', 'pandas')
dot.edge('a', 'b', 'sklearn(PCA)')
dot.edge('b', '+', 'numpy(concatenate)')
dot.edge('#', 'c', 'pandas')
dot.edge('c', 'd', 'geopy')
dot.edge('d', 'e', 'numpy(deg2rad)')
dot.edge('e', 'f', 'sklearn(PCA)')
dot.edge('f', '+', 'numpy(concatenate)')
dot.edge('+', 'g', 'matplotlib')
# Output PDF
dot.render('ancol_pca.pdf')

thank:

Data provided by: maternal mtDNA ancestral group QQ: 923891525
Programming language available: Python official website: https://www.python.org
Provide modules:
Pandas official website: https://pandas.pydata.org
Numpy official website: https://www.numpy.org/
Matplotlib official website: https://matplotlib.org
sklearn official website: https://scikit-learn.org/stable/
Geopy project website: https://github.com/geopy/geopy
Graphviz official website: http://www.graphviz.org

©️ Yang Haolin
Please indicate the source when reprinting

Keywords: Python data visualization

Added by kid85 on Mon, 10 Jan 2022 22:00:19 +0200