Crawler series: storing CSV files

This issue will explain how to save data to a CSV file.

Comma separated values (CSV, sometimes referred to as character separated values, because the separating character can also be not a comma) is a common file format for storing table data. Microsoft Excel and many applications support CSV format because it is very concise. The following is an example of a CSV file:

code,parentcode,level,name,parentcodes,province,city,district,town,pinyin,jianpin,firstchar,tel,zip,lng,lat
110000,100000,1,Beijing,110000,Beijing,,,,Beijing,BJ,B,,,116.405285,39.904989
110100,110000,2,Beijing,"110000,110100",Beijing,Beijing,,,Beijing,BJS,B,010,100000,116.405285,39.904989
110101,110100,3,Dongcheng District,"110000,110100,110101",Beijing,Beijing,Dongcheng District,,Dongcheng,DCQ,D,010,100000,116.418757,39.917544

Like Python, whitespace in CSV is also important: each row is separated by a newline character, and columns are separated by commas (so it is also called "comma separated values"). CSV files can also be separated by Tab characters or other characters, but they are not very common and are not used much.

If you only want to download the CSV file from the web page to the computer without any modification and analysis, don't read the next content. Just download and save the CSV file in the way described in the previous article.

Python's CSV library can easily modify CSV files or even create a CSV file from scratch:

import csv
import os
from os import path


class DataSaveToCSV(object):
    @staticmethod
    def save_data():
        get_path = path.join(os.getcwd(), 'files')
        if not path.exists(get_path):
            os.makedirs(get_path)
        csv_file = open(get_path + '\\test.csv', 'w+', newline='')
        try:
            writer = csv.writer(csv_file)
            writer.writerow(('number', 'number plus 2', 'number times 2'))
            for i in range(10):
                writer.writerow((i, i + 2, i * 2))
        finally:
            csv_file.close()


if __name__ == '__main__':
    DataSaveToCSV().save_data()

If the files folder does not exist, create a new folder. If the file already exists, Python overwrites test.exe with the new data CSV file, newline = '' Remove spaces between lines.

After running, you will see a CSV file:

number,number plus 2,number times 2
0,2,0
1,3,2
2,4,4
3,5,6
4,6,8
5,7,10
6,8,12
7,9,14
8,10,16
9,11,18

The following example is to collect a blog post and store it in a CSV file. The specific code is as follows:

import csv
import os
from os import path

from utils import connection_util
from config import logger_config


class DataSaveToCSV(object):
    def __init__(self):
        self._init_download_dir = 'downloaded'
        self._target_url = 'https://www.scrapingbee.com/blog/'
        self._baseUrl = 'https://www.scrapingbee.com'
        self._init_connection = connection_util.ProcessConnection()
        logging_name = 'write_csv'
        init_logging = logger_config.LoggingConfig()
        self._logging = init_logging.init_logging(logging_name)


    def scrape_data_to_csv(self):
        get_path = path.join(os.getcwd(), 'files')
        if not path.exists(get_path):
            os.makedirs(get_path)
        with open(get_path + '\\article.csv', 'w+', newline='', encoding='utf-8') as csv_file:
            writer = csv.writer(csv_file)
            writer.writerow(('title', 'Release time', 'Content summary'))
            # Connect to the target website to get content
            get_content = self._init_connection.init_connection(self._target_url)
            if get_content:
                parent = get_content.findAll("section", {"class": "section-sm"})[0]
                get_row = parent.findAll("div", {"class": "col-lg-12 mb-5 mb-lg-0"})[0]
                get_child_item = get_row.findAll("div", {"class": "col-md-4 mb-4"})
                for item in get_child_item:
                    # Get title text
                    get_title = item.find("a", {"class": "h5 d-block mb-3 post-title"}).get_text()
                    # Get publishing time
                    get_release_date = item.find("div", {"class": "mb-3 mt-2"}).findAll("span")[1].get_text()
                    # Get article description
                    get_description = item.find("p", {"class": "card-text post-description"}).get_text()
                    writer.writerow((get_title, get_release_date, get_description))
            else:
                self._logging.warning('No content of the article was obtained, please check!')


if __name__ == '__main__':
    DataSaveToCSV().scrape_data_to_csv()

Most of the code reuses the contents of the previous articles, which should be emphasized here:

    logging_name = 'write_csv'
    init_logging = logger_config.LoggingConfig()
    self._logging = init_logging.init_logging(logging_name)

Set the log name and instantiate the log for later logging.

    with open(get_path + '\\article.csv', 'w+', newline='', encoding='utf-8') as csv_file:

With () defines the runtime context to be established when the with statement is executed. with() allows for normal try except... Finally, patterns are used for encapsulation to facilitate reuse.

newline = '' avoid blank lines between lines in CSV files.

At the same time, the file code is also set to utf-8. The purpose of this is to avoid garbled code caused by Chinese or other languages.

The above is about saving the collected content as a csv file. All the code of this example is hosted in github.

github: https://github.com/sycct/Scra...

If you have any questions, welcome to github issue.

Keywords: Python crawler csv

Added by mp96brbj on Thu, 09 Dec 2021 10:50:06 +0200