Exploratory Data Analysis EDA (Exploratory Data Analysis) analysis with python

show holy respect to python community, for there dedication and wisdom

Dataset related:

First, UCL wine dataset:

UCI data set is a commonly used standard test data set for machine learning. It is a database for machine learning proposed by the University of California Irvine. UCI data sets are mostly used in the testing of machine learning algorithms. The important thing is the word "standard". The newly compiled machine learning programs can be tested with UCI data sets, and similar machine learning algorithms can also be compared. Its official website address is as follows:
website: UCI Machine Learning Repository

Field related:

Fixed acidity: most wine related acids are either fixed or nonvolatile (not easy to evaporate)
Volatile sour taste: the content of acetic acid in wine is too high, which will produce unpleasant vinegar taste
Citric acid: a small amount of citric acid can increase the freshness and flavor of wine
Residual sugar: the amount of residual sugar after fermentation. There is little wine below 1g per liter, and wine above 45g is considered sweet
Chloride: the content of salt in wine
Free sulfur dioxide: SO2 exists in the equilibrium state between SO2 molecule (as dissolved gas) and bisulfite ion in free form; It can prevent microbial growth and oxidation in wine
Total sulfur dioxide: amount of free and bound SO2; At low concentration, SO2 can hardly be detected in wine, but when the concentration of free SO2 exceeds 50ppm, SO2 will become obvious in the sense of smell and taste of wine
Density of water and sugar
PH value: describes the degree of acidity or alkalinity of wine, from 0 (very acidic) to 14 (very alkaline); Most wines have a pH between 3 and 4
Sulfate: a wine additive that increases the level of sulfur dioxide (SO2) and acts as an antibacterial and antioxidant
Alcohol: the percentage of alcohol in wine
Quality: output variable (score 0 - 10 according to sensory data), with the occupation of special wine appraiser and bartender

Second, the Kaggle Titanic dataset:

The sinking of the Titanic is one of the most well-known shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 of the 2224 passengers and crew on board. This sensational tragedy shocked the international community and promoted the improvement of ship safety regulations.

One of the reasons for the shipwreck is that the passengers and crew did not have enough lifeboats. Although there are some luck factors in surviving the shipwreck, some people are more likely to survive than others, such as women, children and upper class society.

In this challenge, it is required to complete the analysis of who is likely to survive. In particular, machine learning tools are required to predict which passengers will survive the tragedy.

Field related:

passengerid: Passenger ID
Class: class (1 = 1st, 2 = 2nd, 3 = 3rd)**
Name: passenger name
sex: Gender
Age: age
sibsp: number of siblings / spouses on board
parch: number of parents / children on board
Ticket: ticket information
fare: fare
Cabin: cabin
embarked: port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
survived: the variable is predicted to be 0 or 1 (where 1 means survival and 0 means death)

Related to drawing tools:

anaconda

Pandas

Numpy

Matplotlib

Seaborn

Bokeh

plotly

#Import related packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Draw nominal variable diagram

def plot_categoricals(x, y, data, annotate = True):
    """Plot counts of two categoricals.
    Size is raw count for each grouping.
    Percentages are for a given value of y."""
    # dict vectorizer
    # Raw counts 
    raw_counts = pd.DataFrame(data.groupby(y)[x].value_counts(normalize = False))
    raw_counts = raw_counts.rename(columns = {x: 'raw_count'})
    
    # Calculate counts for each group of x and y
    counts = pd.DataFrame(data.groupby(y)[x].value_counts(normalize = True))
    
    # Rename the column and reset the index
    counts = counts.rename(columns = {x: 'normalized_count'}).reset_index()
    counts['percent'] = 100 * counts['normalized_count']
    
    # Add the raw count
    counts['raw_count'] = list(raw_counts['raw_count'])
    
    plt.figure(figsize = (14, 10))
    # Scatter plot sized by percent
    plt.scatter(counts[x], counts[y], edgecolor = 'k', color = 'lightgreen',
                s = 100 * np.sqrt(counts['raw_count']), marker = 'o',
                alpha = 0.6, linewidth = 1.5)
    
    if annotate:
        # Annotate the plot with text
        for i, row in counts.iterrows():
            # Put text with appropriate offsets
            plt.annotate(xy = (row[x] - (1 / counts[x].nunique()), 
                               row[y] - (0.15 / counts[y].nunique())),
                         color = 'navy',
                         s = f"{round(row['percent'], 1)}%")
        
    # Set tick marks
    plt.yticks(counts[y].unique())
    plt.xticks(counts[x].unique())
    
    # Transform min and max to evenly space in square root domain
    sqr_min = int(np.sqrt(raw_counts['raw_count'].min()))
    sqr_max = int(np.sqrt(raw_counts['raw_count'].max()))
    
    # 5 sizes for legend
    msizes = list(range(sqr_min, sqr_max,
                        int(( sqr_max - sqr_min) / 5)))
    markers = []
    
    # Markers for legend
    for size in msizes:
        markers.append(plt.scatter([], [], s = 100 * size, 
                                   label = f'{int(round(np.square(size) / 100) * 100)}', 
                                   color = 'lightgreen',
                                   alpha = 0.6, edgecolor = 'k', linewidth = 1.5))
        
    # Legend and formatting
    plt.legend(handles = markers, title = 'Counts',
               labelspacing = 3, handletextpad = 2,
               fontsize = 16,
               loc = (1.10, 0.19))
    
    plt.annotate(f'* Size represents raw count while % is for a given y value.',
                 xy = (0, 1), xycoords = 'figure points', size = 10)
    
    # Adjust axes limits
    plt.xlim((counts[x].min() - (6 / counts[x].nunique()), 
              counts[x].max() + (6 / counts[x].nunique())))
    plt.ylim((counts[y].min() - (4 / counts[y].nunique()), 
              counts[y].max() + (4 / counts[y].nunique())))
    plt.grid(None)
    plt.xlabel(f"{x}"); plt.ylabel(f"{y}"); plt.title(f"{y} vs {x}");

#Import data and view it

df = pd.read_csv('winequality-white.csv', sep=';')
df.head()
df.tail()
df.sample(5)

#Check for missing values:

# Check if any of the following is NULL
df.isnull().any()

#Use the heat map to see how worthwhile it really is

sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')

#Check the number and total number of unique values of some nominal variables

df.quality.unique()
df.quality.nunique()
df.quality.value_counts()

#View data type and column name

df.dtypes
df.columns

#Get statistics of continuous variables

df.describe()

#Draw histogram:

df['fixed acidity'].plot(kind = 'hist',figsize=(20, 7), )

#Draw density map

#Draw box diagram

import seaborn as sns
plt.figure(figsize=(10, 7))

sns.boxplot(x=df['alcohol'])

#Draw the box diagram and erect the image

import seaborn as sns
plt.figure(figsize=(10, 7))
sns.boxplot(data = df,y='alcohol',)

#scatter plot

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(df['volatile acidity'] , df['citric acid'])
ax.set_xlabel('volatile acidity')
ax.set_ylabel('citric acid')
plt.show()

#Regression mapping

plt.figure(figsize=(25, 7))
sns.regplot(x="alcohol", y="density", data=df);

#Histogram drawing

def bar_plot(df,key):
    df[key].value_counts().sort_index().plot.bar(figsize = (12, 5),
                      edgecolor = 'k', linewidth = 2)

    # Formatting
    plt.xlabel(key); 
    plt.ylabel('COUNT'); 
    plt.xticks(rotation = 60)
    plt.title('BAR PLOT for ' + key);
    plt.show()


bar_plot(df,'quality')

#Multi box drawing

def plot_box_plot2(df,key,value):
    import copy
    box = copy.deepcopy(df)
    box[value] = box[value].astype('float')
    sns.set_style('whitegrid',{'font.sans-serif':['SimHei','Arial']})
    sns.set_context("talk")
    fig,axes=plt.subplots(1,1,figsize = (25,7))
    sns.boxplot(data = box, x=key, y=value)
    plt.show()

plot_box_plot2(df,'quality','citric acid')

plt.figure(figsize=(25, 7))
# plt.style.use('seaborn-white')
ax = sns.boxplot(x="quality", y="free sulfur dioxide", data=df)

#Drawing violin

plt.figure(figsize=(25, 7))
sns.set_theme(style="whitegrid")

# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(data=df, x="quality", y="density", 
               split=True, inner="quart", linewidth=1,)
sns.despine(left=True)

#Draw correlation diagram:

plt.figure(figsize=(15,8))
sns.heatmap(df.corr(),cmap='Greens',annot=False)

#Draw a correlation diagram and display the correlation value

plt.figure(figsize=(15,15))
sns.heatmap(df.corr(), color='b', annot=True)

#Draw the statistical chart of nominal variables

plt.figure(figsize=(15,5))
sns.countplot(x='quality', data = df)

#Draw a box diagram of all variables:

plt.figure(figsize=(10,15))

for i, col in enumerate(list(df.columns.values)):
    plt.subplot(4,3,i+1)
    df.boxplot(col)
    plt.grid()
    plt.tight_layout()

#Draw a histogram of all variables

plt.figure(figsize=(20,16))

for i,col in enumerate(list(df.columns.values)):
    plt.subplot(4,3,i+1)
    sns.distplot(df[col], color='b', kde=True, label='data')
    plt.grid()
    plt.tight_layout()

# pair plot

sns.pairplot(data=df, kind='scatter',diag_kind='kde')

#Discretize the variables and construct the relationship diagram of discrete variables

df['alcohol_bin'] = pd.cut(df.alcohol,bins=[7,9,11,13,15],labels=['low','mid_low','mid_high','high'])
# df.insert(5,'Age Group',category)

df['alcohol_value'] = pd.cut(df.alcohol,bins=[7,9,11,13,15],labels=[1,2,3,4])
# df.insert(5,'Age Group',category)


plot_categoricals('alcohol_value', 'quality', df, annotate = True)

#Titanic dataset

#Load data

df=pd.read_excel(titanic.xls")

df.head()

# df.tail()

#df.sample(5)

#df.columns

#df.shape

#df.info()

#Get statistics

df.describe()

#By default, describe only outputs the information of continuous variables, so I want to see the statistics of data of other variable types:

df.describe(include=[bool,object])

#View statistics (single column)

df.fare.mean()

df[df['survived']==1].mean()

df[(df['survived'] == 1) & (df['pclass'] == 1)]['age'].max()

df[df['name'].apply(lambda name: name[0] == 'A')].head()

df[df['name'].apply(lambda name: name[0] == 'A')].head()

#replace function:

x = {1 : 'Class I', 2 : 'Class II', 3:'Class III'}
df_new=df.replace({'pclass': x})
df_new.head()

#Get aggregate information (group by)

#Contingency table information

pd.crosstab(df['survived'], df['pclass'])
pd.crosstab(df['survived'], df['sex'], margins=True)

#pivot table

df.pivot_table(['fare','age'],['survived'],aggfunc='mean')
df.pivot_table(['fare','age'],['survived'],aggfunc='median')

#Data sorting (based on a specific field)

df.sort_values(by=["fare"], ascending=False).head()
df.sort_values(by=["fare"], ascending=False).tail()

#Visualization of missing values

import seaborn as sns
plt.rcParams['figure.dpi'] = 100# the dpi can be set to enhance the resolution of the image
# Congiguring retina format
%config InlineBackend.figure_format = 'retina'
sns.heatmap(df.isnull(), cmap='viridis',yticklabels=False)

#Survival statistics

sns.countplot(x=df.survived)

#Survival of different genders

sns.countplot(data =df, x = 'survived',hue = 'sex')

#Different survival statistics of class registration

sns.countplot(data = df , x = 'survived', hue='pclass')

#Cross table of class and survival

pd.crosstab(df['survived'], df['pclass'], margins=True)

#Class counting statistics

sns.countplot(df.pclass)

#Draw age histogram and density map

plt.figure(figsize=(20, 7))
sns.distplot(df.age, color='purple')

#Draw age histogram and density map (remove missing values)

plt.figure(figsize=(20, 7))
sns.distplot(df['age'].dropna(),color='darkred',bins=40)

#Cost density map

plt.figure(figsize=(20, 7))
sns.distplot(df.fare, color='green')

#Draw the box diagram and violin diagram of the cost:

plt.figure(figsize=(20, 7))

plt.subplot(1,2,1)
sns.boxplot(data = df, y='fare',orient = 'v')
plt.subplot(1,2,2)
sns.violinplot(data = df, y='fare',orient = 'v')
#(Q1−1.5⋅IQR, Q3+1.5⋅IQR)

#Box diagram of cost and age relative to class

plt.figure(figsize=(20, 7))

plt.subplot(1,2,1)
sns.boxplot(x=df.pclass,y=df.fare)
plt.subplot(1,2,2)
sns.boxplot(x=df.pclass, y=df.age)

#Correlation visualization

# Considering only numerical variables
scatter_var = list(set(df.columns)-set(['name', 'survived', 'ticket','cabin','embarked','sex','sibsp','parch']))

# Creating heatmap
corr_matrix = df[scatter_var].corr()
sns.heatmap(corr_matrix,annot=True);

#Scatter plot of age and cost

plt.scatter(df['age'], df['fare'])
plt.title("Age Vs Fare")
plt.xlabel('Age')
plt.ylabel('Fare')

#Scatter chart of class and cost

plt.scatter(df['pclass'], df['fare'])
plt.title("pclass Vs fare")
plt.xlabel('pclass')
plt.ylabel('fare')

# pair plot of variables

#Scatter relation between two variables

sns.pairplot(df[scatter_var])

#Sex, boarding place, class and survival statistics

f, [ax1,ax2,ax3] = plt.subplots(1,3,figsize=(20,5))
sns.countplot(x='sex', hue='survived', data=df, ax=ax1)
sns.countplot(x='pclass', hue='survived', data=df, ax=ax2)
sns.countplot(x='embarked', hue='survived', data=df, ax=ax3)
ax1.set_title('sex feature analysis')
ax2.set_title('pclass feature analysis')
ax3.set_title('embarked feature analysis')
f.suptitle('categorical feature analysis', size=20, y=1.1)

plt.show()

#Cross statistics of boarding place, class and gender

grid = sns.FacetGrid(data = df, col='pclass', hue='sex', palette='seismic', size=4)
grid.map(sns.countplot, 'embarked', alpha=0.8)
grid.add_legend()

#Plot the age density of survival:

f,ax = plt.subplots(figsize=(10,5))
sns.kdeplot(df.loc[(df['survived'] == 0),'age'] , color='gray',shade=True,label='not survived')
sns.kdeplot(df.loc[(df['survived'] == 1),'age'] , color='g',shade=True, label='survived')
plt.title('age feature distribution', fontsize = 15)
plt.xlabel("age", fontsize = 15)
plt.ylabel('frequency', fontsize = 15)

#Age density map under different gender and survival conditions:

def plot_distribution( df , var , target , **kwargs ):
    row = kwargs.get( 'row' , None )
    col = kwargs.get( 'col' , None )
    facet = sns.FacetGrid( df , hue=target , aspect=4 , row = row , col = col )
    facet.map( sns.kdeplot , var , shade= True )
    facet.set( xlim=( 0 , df[ var ].max() ) )
    facet.add_legend()


plot_distribution( df , var = 'age' , target = 'survived' , row = 'sex' )

#Cost map of people with different survival conditions

And calculate the variance and mean and conduct visual analysis

# Fill in missing values
df["fare"].fillna(df["fare"].median(), inplace=True)

df['fare'] = df['fare'].astype(int)

# Get the Fare of the surviving and dead passengers respectively
fare_not_survived = df["fare"][df["survived"] == 0]
fare_survived = df["fare"][df["survived"] == 1]

# The mean and variance of far are obtained
avgerage_fare = pd.DataFrame([fare_not_survived.mean(), fare_survived.mean()])
std_fare = pd.DataFrame([fare_not_survived.std(), fare_survived.std()])

df['fare'].plot(kind='hist', figsize=(15,3),bins=100, xlim=(0,50))

avgerage_fare.index.names = std_fare.index.names = ["survived"]
avgerage_fare.plot(yerr=std_fare,kind='bar',legend=False)

#Statistics of loneliness and starting with family

#The impact of being alone or with your family on survival

df['family'] =  df["parch"] + df["sibsp"]
df['family'].loc[df['family'] > 0] = 1
df['family'].loc[df['family'] == 0] = 0


# Delete Parch and SibSp
df_new = df.drop(['sibsp','parch'], axis=1)

# mapping
fig, (axis1,axis2) = plt.subplots(1,2,sharex=True,figsize=(10,5))

sns.countplot(x='family', data=df, order=[1,0], ax=axis1)

# It can be divided into two situations: taking a boat with your family and taking a boat alone
family_perc = df_new[["family", "survived"]].groupby(['family'],as_index=False).mean()
sns.barplot(x='family', y='survived', data=family_perc, order=[1,0], ax=axis2)

axis1.set_xticklabels(["With Family","Alone"], rotation=0)

# pair plot

g = sns.pairplot(df[[u'survived', u'pclass', u'sex', u'age', u'family', u'fare', u'embarked']], hue='survived', palette = 'seismic',
                 size=4,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=50) )
g.set(xticklabels=[])

# pandas profiling

#pip install pandas-profiling

# from pandas_profiling import ProfileReport
# EDA_report = ProfileReport(df)
# EDA_report.to_file(output_file='EDA.html')

Reference: kaggle

Reference: Interview Questions on Exploratory Data Analysis (EDA)

Reference: Introduction to Exploratory Data Analysis (EDA)

Reference: How to do Exploratory Data Analysis (EDA) with python?

Reference: Kaggle entry level Title: prediction of Titanic survivors - Data Analysis

Reference: Kaggle entry level Title: prediction of Titanic survivors - Data Mining

Reference: UCI data set sorting (attached with common data sets of papers)

Reference: pandas

Reference: pandas profiling

Keywords: Python Big Data Machine Learning Data Analysis Data Mining

Added by iacataca on Thu, 10 Feb 2022 05:31:49 +0200

Programming VIP

Exploratory Data Analysis EDA (Exploratory Data Analysis) analysis with python

Field related:

Field related:

#Titanic dataset

Popular Keywords