Python crawler actual combat, requests module, python realizes IMDB movie top data visualization

preface

Use Python to crawl IMDB movies. No more nonsense.

Let's start happily~

development tool

Python version: 3.6.4

Related modules:

requests module;

random module;

bs4 module;

And some Python built-in modules.

Environment construction

Install Python and add it to the environment variable. pip can install the relevant modules required.

First, Douban as an introduction to reptiles, in-depth analysis of all kinds of cattle has tended to be perfect; On the other hand, with the development of China's film industry, we need to turn our perspective to the international market and learn about the films that foreigners are more interested in through data analysis.

Train of thought analysis

IMDB top250 home page

IMDB movie details page (1)

IMDB movie details page (2)

Based on the above web page structure, we find that we only need to get the detail page code (unique) of each film, and obtain the information of exporting country & type, score & number of people on the detail page (1) (2) through two "leapfrogs". For easy understanding, the crawling mind map is as follows:

Crawler code

IMDB top250 home page

#Import library-------------------------------------------
from urllib import request
from chardet import detect
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

#Get the web page source code and generate the soup object-------------------------
def getSoup(url):
    with request.urlopen(url) as fp:
       byt = fp.read()
       det = detect(byt)
       time.sleep(random.randrange(1,5))
       return BeautifulSoup(byt.decode(det['encoding']),'lxml')
   
#Parse data-------------------------------------------  
def getData(soup):
   #Get score
   ol = soup.find('tbody', attrs = {'class': 'lister-list'})
   score_info = ol.find_all('td',attrs={'class':'imdbRating'})
   film_scores = [k.text.replace('\n','') for k in score_info]
   #Get links to ratings, movie titles, directors and actors, release years, and details
   film_info = ol.find_all('td',attrs={'class':'titleColumn'})
   film_names =  [k.find('a').text for k in film_info]
   film_actors =  [k.find('a').attrs['title'] for k in film_info]
   film_years = [k.find('span').text[1:5] for k in film_info]
   next_nurl =  [url2 + k.find('a').attrs['href'][0:17]  for k in film_info]
   data=pd.DataFrame({'name':film_names,'year':film_years,'score':film_scores,'actors':film_actors,'newurl':next_nurl})      
   return data    

IMDB top250 movie details page

#Get detail page data-------------------------------------------
def nextUrl(detail,detail1):
  #Get movie country
  detail_list = detail.find('div',attrs={'id':'titleDetails'}).find_all('div',attrs={'class':'txt-block'})
  detail_str = [k.text.replace('\n','') for k in detail_list]
  detail_str = [k for k in detail_str if k.find(':')>=0]
  detail_dict = {k.split(':')[0] : k.split(':')[1] for k in detail_str}
  country = detail_dict['Country']    
  #Get movie type
  detail_list1 = detail.find('div',attrs={'class':'title_wrapper'}).find_all('div',attrs={'class':'subtext'})
  detail_str1 = [k.find('a').text for k in detail_list1]
  movie_type=pd.DataFrame({'Type':detail_str1})
  #Get the detailed score and number of people of movies divided by groups
  div_list = detail1.find_all('td',attrs= {'align': 'center'})
  value = [k.find('div',attrs= {'class': 'bigcell'}).text.strip() for k in div_list]
  num   = [k.find('div', attrs={'class': 'smallcell'}).text.strip() for k in div_list]
  scores=pd.DataFrame({'value':value,'num':num})  
  return country,movie_type,scores

Result display

Data analysis

Film type comparison

First, let's take a look at the proportion of various types of films:

The top three movies are comedy, crime and action.

The tense and exciting mood and the relaxed plot can best bring the fans a memorable viewing experience.

Let's take a look at the score comparison of various types of films

From the perspective of genre, the reason why western films are unique may be related to the small audience and the high score of wild lovers. Secondly, the subjects of crime, action, adventure, reasoning and terror are also easy to get high scores

Year comparison

First, let's look at the year of the top 250 film

Among the top 250 films, there were many films in 1957, 1995 and 2014. After 1975, the films on the list showed a significant increase trend, which may be related to the increasing maturity of the film industry.

As for 1995, the little friends who are familiar with movies may know that 1995 is the 100th anniversary of world movies. Countless film geniuses, with the idea of giving gifts, gave birth to their great works in this year. We are more familiar with Shawshank Redemption, the true story of Forrest Gump, vulgar novels, four Weddings and one funeral, seven sins, the lion king, etc.

At the same time, let's look at the evaluation scores of films in each year

Comparing the film age score, there is no obvious upward or downward trend. It can be seen that film art will not lose its value over time. For films, technology is not the first, and the factors of emotional resonance account for a greater weight; Which movie is the best? The answer is in each of us.

Country comparison

Let's take a look at the proportion of top 250 films in various countries and regions

This data is interesting. It is a bit like the Nobel Prize. American films account for half of the country, and other countries share the rest of the cake. The top ones are Britain, France, Japan and Germany. In China, the only film on the list is in the mood for love.

If it is the reason for the mainstream Western values, neighboring Japan, also a representative of Oriental culture, has 16 films on the list. It can be seen that Western values can not be the main reason for the lack of Chinese films on the list. Although many high-quality works such as big fish Begonia and the newly released wandering earth have been launched in China in recent years, they still have a mediocre response in the international market. I believe that movies have a common language, and they really have such things as universal values. How to build an international film industry and tell stories to the people of the world is the next topic that Chinese filmmakers need to explore.

Director comparison

Let's take a look at the most frequent directors on the top 250 list

The Nobel Prize in the film industry is open. Let's see which authors are on the list. Since you may not be familiar with the names of foreign directors, a comparison table of director representative works is made here. It is worth noting that Ridley Scott, James Cameron and David finch directed the films "alien 1", "alien 2" and "alien 3" respectively. One "alien" has three directors on the list, which shows their series influence.

Population comparison

First, let's look at the scores of different groups

From the gender perspective, men are more likely to give high scores than women. On the other hand, from the perspective of age group, minors, both men and women, are most likely to give high scores. With the increase of age, the scores become more and more sharp (d ú). People over the age of 45 give the lowest scores. Is it difficult to move a hard heart after going through the sea? Or maybe you can evaluate a film fairly and objectively only if you have a wide range of knowledge? Maybe we can study this issue, such as the scientific allocation method of the age group of Film Festival judges.

However, knowing the score, we also need to know the proportion of various groups

Although the scores of "old uncle" and "old aunt" are low, there is no need to worry about the reputation of a film. Because the data tell us that to meet the tastes of young and middle-aged men in the age groups of 30-44 and 18-29, the film's reputation is certainly not bad. From the good reputation of war action films such as war wolf and operation Red Sea in recent years, we can know a little about the scoring mechanism.

Relationship between type, age and score

First, let's use the heat map to see the ratings of different types of films by different groups

Different age groups have different preferences for film types. For example, minor men and women show strong interest in reasoning and Western films, while men and women above 45 love science fiction and black films respectively.

The score also needs to be comprehensively analyzed in combination with the proportion

This time, we refine the data granularity to all age groups. Combined with the scores of all age groups, we give the recommended films of all age groups in the top 250 list.

Film recommendation

Minor male (< 18)

18-29 year old male

Male aged 30-44

45 + male

Minor female (< 18)

Women aged 18-29

Women aged 30-44

45 + Female

The above movies are recommended according to IMDBtop250 data. If there is any inconsistency, say sorry here. After all, the preferences of the American people are still different from those of China.

Keywords: Python crawler data visualization requests

Added by stef686 on Fri, 19 Nov 2021 12:54:56 +0200