Half an hour a day: ten sets of exercises in introduction Python and Pandas library will take you to learn data analysis

Pandas is an open source, BSD licensed library that must be mastered when starting Python to do data analysis. It provides high-performance, easy-to-use data structures and data analysis tools. The main data structures are Series (one-dimensional data) and DataFrame (two-dimensional data), which are sufficient to deal with most typical use cases in the fields of finance, statistics, social science, engineering and so on. Come and study together today.

Exercise 6 - Statistics

Explore wind speed data

  • Step 1: import the necessary Libraries
# Run the following code
import pandas as pd
import datetime
  • Step 2: import data from the following address
import pandas as pd
# Run the following code
path6 = "../input/pandas_exercise/pandas_exercise/exercise_data/wind.data"  # wind.data
  • Step 3: store the data and set the first three columns as appropriate indexes
import datetime
# Run the following code
data = pd.read_table(path6, sep = "\s+", parse_dates = [[0,1,2]]) 
data.head()

  • Step 4: 2061? Do we really have data for this year? Create a function and use it to fix the bug
# Run the following code
def fix_century(x):
    year = x.year - 100 if x.year > 1989 else x.year
    return datetime.date(year, x.month, x.day)

# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)

# data.info()
data.head()

  • Step 5: set the date as the index. Note that the data type should be datetime64[ns]
# Run the following code
# transform Yr_Mo_Dy it to date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])

# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')

data.head()
# data.info()

  • Step 6: how many data values are missing for each location
# Run the following code
data.isnull().sum()

  • Step 7: how many complete data values are there for each location
# Run the following code
data.shape[0] - data.isnull().sum()

  • Step 8: for all data, calculate the average value of wind speed
# Run the following code
data.mean().mean()

10.227982360836924

  • Step 9: create a file named loc_stats data frame to calculate and store the minimum, maximum, average and standard deviation of wind speed at each location
# Run the following code
loc_stats = pd.DataFrame()

loc_stats['min'] = data.min() # min
loc_stats['max'] = data.max() # max 
loc_stats['mean'] = data.mean() # mean
loc_stats['std'] = data.std() # standard deviations

loc_stats

  • Step 10: create a named day_stats data frame to calculate and store the minimum, maximum, average and standard deviation of wind speed at all location s
# Run the following code
# create the dataframe
day_stats = pd.DataFrame()

# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max 
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations

day_stats.head()

  • Step 11: for each location, calculate the average wind speed in January

Note: January 1961 and January 1962 should be treated differently

# Run the following code
# creates a new column 'date' and gets the values from the index
data['date'] = data.index

# creates a column for each value from date
data['month'] = data['date'].apply(lambda date: date.month)
data['year'] = data['date'].apply(lambda date: date.year)
data['day'] = data['date'].apply(lambda date: date.day)

# gets all value from the month 1 and assign to janyary_winds
january_winds = data.query('month == 1')

# gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,'RPT':"MAL"].mean()

  • Step 12: for data records, take samples at the frequency of year
# Run the following code
data.query('month == 1 and day == 1')

  • Step 13: for data records, the sampling frequency shall be monthly
# Run the following code
data.query('day == 1')

Exercise 7 - visualization

Explore Titanic disaster data

  • Step 1: import the necessary Libraries
# Run the following code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline
  • Step 2: import data from the following address
# Run the following code
path7 = '../input/pandas_exercise/pandas_exercise/exercise_data/train.csv'  # train.csv
  • Step 3: name the data frame titanic
# Run the following code
titanic = pd.read_csv(path7)
titanic.head()

  • Step 4: set the PassengerId as the index
# Run the following code
titanic.set_index('PassengerId').head()

  • Step 5: draw a sector chart showing the proportion of male and female passengers
# Run the following code
# sum the instances of males and females
males = (titanic['Sex'] == 'male').sum()
females = (titanic['Sex'] == 'female').sum()

# put them into a list called proportions
proportions = [males, females]

# Create a pie chart
plt.pie(
    # using proportions
    proportions,
    
    # with the labels being officer names
    labels = ['Males', 'Females'],
    
    # with no shadows
    shadow = False,
    
    # with colors
    colors = ['blue','red'],
    
    # with one slide exploded out
    explode = (0.15 , 0),
    
    # with the start angle at 90%
    startangle = 90,
    
    # with the percent listed as a fraction
    autopct = '%1.1f%%'
    )

# View the plot drop above
plt.axis('equal')

# Set labels
plt.title("Sex Proportion")

# View the plot
plt.tight_layout()
plt.show()

  • Step 6: draw a scatter diagram showing Fare, age and gender of passengers
# Run the following code
# creates the plot using
lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)

# set title
lm.set(title = 'Fare x Age')

# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)
(-5, 85)

  • Step 7: how many people are still alive?
# Run the following code
titanic.Survived.sum()

342

  • Step 8: draw a histogram showing the ticket price
# Run the following code
# sort the values from the top to the least value and slice the first 5 items
df = titanic.Fare.sort_values(ascending = False)
df

# create bins interval using numpy
binsVal = np.arange(0,600,10)
binsVal

# create the plot
plt.hist(df, bins = binsVal)

# Set the title and labels
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')

# show the plot
plt.show()

Exercise 8 - creating a data frame

Explore Pokemon data

  • Step 1: import the necessary Libraries
# Run the following code
import pandas as pd
  • Step 2: create a data dictionary
# Run the following code
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
            "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
            "type": ['grass', 'fire', 'water', 'bug'],
            "hp": [45, 39, 44, 45],
            "pokedex": ['yes', 'no','yes','no']                        
            }
  • Step 3: save the data dictionary as a data box named pokemon
# Run the following code
pokemon = pd.DataFrame(raw_data)
pokemon.head()

  • Step 4: the columns of the data frame are sorted alphabetically. Please change the order to name, type, HP, evolution and pokedex
# Run the following code
pokemon = pokemon[['name', 'type', 'hp', 'evolution','pokedex']]
pokemon

  • Step 5: add a column place
# Run the following code
pokemon['place'] = ['park','street','lake','forest']
pokemon

  • Step 6: view the data type of each column
# Run the following code
pokemon.dtypes

Exercise 9 - time series

Explore Apple stock price data

  • Step 1: import the necessary Libraries
# Run the following code
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt

%matplotlib inline
  • Step 2: dataset address
# Run the following code
path9 = '../input/pandas_exercise/pandas_exercise/exercise_data/Apple_stock.csv'   # Apple_stock.csv
  • Step 3: the read data coexists into a data frame called apple
# Run the following code
apple = pd.read_csv(path9)
apple.head()

  • Step 4: view the data type of each column
# Run the following code
apple.dtypes

  • Step 5: convert the Date column to the datetime type
# Run the following code
apple.Date = pd.to_datetime(apple.Date)
apple['Date'].head()

  • Step 6: set Date as index
# Run the following code
apple = apple.set_index('Date')
apple.head()

  • Step 7: is there a duplicate date?
# Run the following code
apple.index.is_unique

True

  • Step 8: set index to ascending order
# Run the following code
apple.sort_index(ascending = True).head()

  • Step 9: find the last business day of each month
# Run the following code
apple_month = apple.resample('BM')
apple_month.head()

  • Step 10 how many days is the difference between the earliest date and the latest date in the dataset?
# Run the following code
(apple.index.max() - apple.index.min()).days

12261

  • Step 11: how many months are there in the data in step 11?
# Run the following code
apple_months = apple.resample('BM').mean()
len(apple_months.index)

404

  • Step 12: visualize Adj Close values in chronological order
# Run the following code
# makes the plot and assign it to a variable
appl_open = apple['Adj Close'].plot(title = "Apple Stock")

# changes the size of the graph
fig = appl_open.get_figure()
fig.set_size_inches(13.5, 9)

Exercise 10 - delete data

Explore Iris iris Iris data

  • Step 1: import the necessary Libraries
# Run the following code
import pandas as pd
  • Step 2: dataset address
# Run the following code
path10 ='../input/pandas_exercise/pandas_exercise/exercise_data/iris.csv'   # iris.csv
  • Step 3: save the data set as the variable iris
# Run the following code
iris = pd.read_csv(path10)
iris.head()

  • Step 4: create the column name of the data frame
iris = pd.read_csv(path10,names = ['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class'])
iris.head()

  • Step 5: is there a missing value in the data frame?
# Run the following code
pd.isnull(iris).sum()

  • Step 6: set the column petal_ Lines 10 to 19 of length are set to missing values
# Run the following code
iris.iloc[10:20,2:3] = np.nan
iris.head(20)

  • Step 7: replace all missing values with 1.0
# Run the following code
iris.petal_length.fillna(1, inplace = True)
iris

  • Step 8: delete column class
# Run the following code
del iris['class']
iris.head()

  • Step 9: set the first three rows of the data frame as missing values
# Run the following code
iris.iloc[0:3 ,:] = np.nan
iris.head()

  • Step 10: delete the row with missing value
# Run the following code
iris = iris.dropna(how='any')
iris.head()

  • Step 11: reset the index
# Run the following code
iris = iris.reset_index(drop = True)
iris.head()

Your support is the driving force for me to keep updating, (like, follow, comment) this article is not over yet. If you want to know about the follow-up, you can follow me and keep updating.

Click to receive 🎁🎁 Q group number: 943192807 (pure technical exchange and resource sharing) take it away by self-service.

① Industry consultation and professional answers ② Python development environment installation tutorial ③ 400 self-study videos ④ common vocabulary of software development ⑤ latest learning roadmap ⑥ more than 3000 Python e-books

Keywords: Python Data Analysis Data Mining pandas

Added by chrispols on Thu, 18 Nov 2021 15:32:18 +0200