Graphic big data | comprehensive case - mining music album data using Spark analysis

Author: Han Xinzi@ShowMeAI
Tutorial address: http://www.showmeai.tech/tutorials/84
Article address: http://www.showmeai.tech/article-detail/178
Notice: All Rights Reserved. Please contact the platform and the author for reprint and indicate the source

introduction

This is one of the most widely used cases of video and audio data processing of HDFS, and presents the results of video and audio data storage.

1. Experimental environment

(1)Linux: Ubuntu 16.04
(2)Python: 3.8
(3)Hadoop: 3.1.3
(4)Spark: 2.4.0
(5) Web framework: Flash 1.0.3
(6) Visualizer: Echarts
(7) Development tool: Visual Studio Code

To support Python visual analysis, you can run the following command to install the flash component:

sudo apt-get install python3-pip
pip3 install flask

2. Experimental data set

1) Data set description

Data set and source code download
Link: https://pan.baidu.com/s/1C0VI6w679izw1RENyGDXsw
Extraction code: show

The data set of this case comes from Kaggle platform, and the data name is albums CSV, which contains the data of 100000 music albums (you can download it through the above Baidu online disk address). The main fields are described as follows:

album\_title: music album name
genre: album type
year\_of\_pub: album release year
num\_of\_tracks: number of singles per album
num\_of\_sales: album sales
rolling\_ stone\_ Critical: rating of rolling stone website
mtv\_ Critical: the score of MTV, the world's largest music television network
music\_ maniac\_ Critical: the score of music talent

2) Upload data to HDFS

(1) Start the HDFS component in Hadoop and run the following command on the command line:

/usr/local/hadoop/sbin/start-dfs.sh

(2) Log in to the user creation directory on hadoop and run the following command on the command line:

hdfs dfs -mkdir -p /user/hadoop

(3) Put the data set albums. In the local file system Upload CSV to distributed file system HDFS:

hdfs dfs -put albums.csv

3.pyspark data analysis

1) Establish engineering documents

(1) Create folder code
(2) Create a project under code Py file
(3) Create a static folder under code to store static files
(4) Create a data directory under the code/static folder to store the json data generated by analysis

2) Conduct data analysis

In this paper, the music album data set albums CSV carried out a series of analysis, including:

(1) Count the number of albums of each type
(2) Count the total sales of all types of albums
(3) Count the number of albums and singles released each year in recent 20 years
(4) Analyze the sales volume of the top five album types in each year
(5) Analyze the average score of the top five album types in different scoring systems

3) Code implementation

Get data set and code → official GitHub of ShowMeAI https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
Running code segment and learning → online programming environment http://blog.showmeai.tech/python3-compiler

project.py code is as follows:

from pyspark import SparkContext
from pyspark.sql import SparkSession
import json

#Count the number of albums of each type (only ten album types with a total number greater than 2000 are displayed)
def genre(sc, spark, df):
    #Count the total number of albums of each type according to the genre field, and filter out the records with the number greater than 2000
    #And take out 10 types for display
    j = df.groupBy('genre').count().filter('count > 2000').take(10)
    #Convert the list data into json string and write it to the json file in the static/data directory
    f = open('static/data/genre.json', 'w')
    f.write(json.dumps(j))
    f.close()

#Count the total sales of various types of albums
def genreSales(sc, spark, df):
    j = df.select('genre', 'num_of_sales').rdd\
        .map(lambda v: (v.genre, int(v.num_of_sales)))\
            .reduceByKey(lambda x, y: x + y).collect()
    f = open('static/data/genre-sales.json', 'w')
    f.write(json.dumps(j))
    f.close()

#Count the number of albums and singles released each year
def yearTracksAndSales(sc, spark, df):
    #Add the number of albums and singles in the same year and sort them by year
    result = df.select('year_of_pub', 'num_of_tracks').rdd\
        .map(lambda v: (int(v.year_of_pub), [int(v.num_of_tracks), 1]))\
            .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])\
                .sortByKey()\
                .collect()

    #In order to facilitate visualization, each field in the list is stored separately
    ans = {}
    ans['years'] = list(map(lambda v: v[0], result))
    ans['tracks'] = list(map(lambda v: v[1][0], result))
    ans['albums'] = list(map(lambda v: v[1][1], result))
    f = open('static/data/year-tracks-and-sales.json', 'w')
    f.write(json.dumps(ans))
    f.close()

#Take out the top five album types with total sales
def GenreList(sc, spark, df):
    genre_list = df.groupBy('genre').count()\
        .orderBy('count',ascending = False).rdd.map(lambda v: v.genre).take(5)
    return genre_list


#Analyze the sales of the top five types of albums in each year
def GenreYearSales(sc, spark, df, genre_list):
    #Filter out the top five albums with the type of total sales, add and sort the sales of albums of the same type and year.
    result = df.select('genre', 'year_of_pub', 'num_of_sales').rdd\
        .filter(lambda v: v.genre in genre_list)\
            .map(lambda v: ((v.genre, int(v.year_of_pub)), int(v.num_of_sales)))\
                .reduceByKey(lambda x, y: x + y)\
                    .sortByKey().collect()

    #In order to facilitate the extraction of visual data, the data is stored in a format suitable for visualization
    result = list(map(lambda v: [v[0][0], v[0][1], v[1]], result))
    ans = {}
    for genre in genre_list:
        ans[genre] = list(filter(lambda v: v[0] == genre, result))
    f = open('static/data/genre-year-sales.json', 'w')
    f.write(json.dumps(ans))
    f.close()

#The average score of the top five album types in different scoring systems
def GenreCritic(sc, spark, df, genre_list):
    #Filter out the top five albums with the same type of total sales, and average the rolling stone score, mtv score and music talent score of the same type of albums
    result = df.select('genre', 'rolling_stone_critic', 'mtv_critic', 'music_maniac_critic').rdd\
        .filter(lambda v: v.genre in genre_list)\
        .map(lambda v: (v.genre, (float(v.rolling_stone_critic), float(v.mtv_critic), float(v.music_maniac_critic), 1)))\
        .reduceByKey(lambda x, y : (x[0] + y[0], x[1] + y[1], x[2] + y[2], x[3] + y[3]))\
        .map(lambda v: (v[0], v[1][0]/v[1][3], v[1][1]/v[1][3], v[1][2]/v[1][3])).collect()

    f = open('static/data/genre-critic.json', 'w')
    f.write(json.dumps(result))
    f.close()


#Code entry

if __name__ == "__main__":
    sc = SparkContext( 'local', 'test')
    sc.setLogLevel("WARN")
    spark = SparkSession.builder.getOrCreate()
    file = "albums.csv"
    df = spark.read.csv(file, header=True)  #dataframe

    genre_list = GenreList(sc, spark, df)

    genre(sc, spark, df)
    genreSales(sc, spark, df)
    yearTracksAndSales(sc, spark, df)
    GenreYearSales(sc, spark, df, genre_list)
    GenreCritic(sc, spark, df, genre_list)

4) Code run

(1) In the Ubuntu terminal window, log in with hadoop user, run su hadoop on the command line, and enter the user password.
(2) Enter the directory where the code is located.
(3) In order to be able to read albums in HDFS CSV file, run on the command line:

/usr/local/hadoop/sbin/start-dfs.sh

(4) Run on the command line:

spark-submit project.py

4. Visualization

=======

The visualization of this case is based on Echarts, and the realized visualization page is deployed on the web server based on flash framework.

Get data set and code → official GitHub of ShowMeAI https://github.com/ShowMeAI-Hub/awesome-AI-cheatsheets
Running code segment and learning → online programming environment http://blog.showmeai.tech/python3-compiler

1) Related code structure

(1) Create a new visualizationflash. In the code directory Py file to store the flash application.
(2) Create a new folder named templates under the code directory to store html files.
(3) Create a new folder named js in the code/static directory to store js files.

2) Establish flash application

At sparkflash Copy the following code from the. Py file:

from flask import render_template
from flask import Flask
# from livereload import Server

app = Flask(__name__)

@app.route('/')
def index():
    #Using render_template() method to render the template
    return render_template('index.html')

@app.route('/<filename>')
def req_file(filename):
    return render_template(filename)

if __name__ == '__main__':   
    app.DEBUG=True#Code debugging takes effect immediately
    app.jinja_env.auto_reload = True#Template debugging takes effect immediately
app.run()#Use the run() function to make the application run on the local server

3) Download js file

(1) Download jQuery from the website( https://cdn.bootcss.com/jquery/3.4.1/jquery.min.js ), save it as jquery.min.js file, saved in code/static/js directory.
(2) Download Echarts from the official website download interface( https://echarts.apache.org/zh/download.html ), save it as echarts-gl.min.js file, saved in code/static/js directory.

4) Echarts visualization

(1) Create a new index in the code/templates directory HTML file. Copy the following code:

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Music</title>
</head>

<body>
    <h2>Music album analysis</h2>
    <ul style="line-height: 2em">
        <li><a href="genre.html">Statistical chart of the number of albums of each type</a></li>
        <li><a href="genre-sales.html">Sales statistics of various types of albums</a></li>
        <li><a href="year-tracks-and-sales.html">Statistical chart of the number of albums and singles released each year in recent 20 years</a></li>
        <li><a href="genre-year-sales.html">Sales volume analysis chart of the top five album types in each year</a></li>
        <li><a href="genre-critic.html">Score analysis chart of the top five album types in total sales</a></li>
    </ul>
</body>
</html>

index.html It is the main page, which displays the link of the page where each statistical analysis chart is located. Click any link to jump to the corresponding page.

(2) Create a new genere in the code/templates directory HTML file. Copy the following code:

<!DOCTYPE html>
<html>

<head>
    <meta charset="utf-8">
    <title>ECharts</title>
    <!-- introduce echarts.js -->
    <script src="static/js/echarts-gl.min.js"></script>
    <script src="static/js/jquery.min.js"></script>
</head>

<body>
    <!-- by ECharts Prepare one with size (width and height) Dom -->
    <a href="/">Return</a>
    <br>
    <br>
    <div id="genre" style="width: 480px;height:500px;"></div>
    <script type="text/javascript">
        $.getJSON("static/data/genre.json", d => {
            _data = d.map(v => ({
                name: v[0],
                value: v[1]
            }))
            // Initialize the ecarts instance based on the prepared dom
            var myChart = echarts.init(document.getElementById('genre'), 'light');

            // Specify configuration items and data for the chart
            option = {
                title: {
                    text: 'Statistical chart of the number of albums of each type',
                    subtext: 'As can be seen from the figure Indie Type has the largest number of albums.',
                    // x: 'center'
                    x: 'left'
                },
                tooltip: {
                    trigger: 'item',
                    formatter: "{a} <br/>{b} : {c} ({d}%)"
                },
                legend: {
                    x: 'center',
                    y: 'bottom',
                    data: d.map(v => v[0])
                },
                toolbox: {
                    show: true,
                    feature: {
                        mark: { show: true },
                        dataView: { show: true, readOnly: false },
                        magicType: {
                            show: true,
                            type: ['pie', 'funnel']
                        },
                        restore: { show: true },
                        saveAsImage: { show: true }
                    }
                },
                calculable: true,
                series: [
                    {
                        name: 'Radius mode',
                        type: 'pie',
                        radius: [30, 180],
                        center: ['50%', '50%'],
                        roseType: 'radius',
                        label: {
                            normal: {
                                show: false
                            },
                            emphasis: {
                                show: true
                            }
                        },
                        lableLine: {
                            normal: {
                                show: false
                            },
                            emphasis: {
                                show: true
                            }
                        },
                        data: _data
                    }
                ]
            };

            // Use the configuration item and data you just specified to display the chart.
            myChart.setOption(option);
        })
    </script>
</body>
</html>

This is done by reading code / static / data / generic Using the data in JSON, draw a rose chart to show the number of albums of various types.

(3) Create a new generic sales in the code/templates directory HTML file. Copy the following code:

<!DOCTYPE html>
<html>

<head>
    <meta charset="utf-8">
    <title>ECharts</title>
    <!-- introduce echarts.js -->
    <script src="static/js/echarts-gl.min.js"></script>
    <script src="static/js/jquery.min.js"></script>
</head>

<body>
    <a href="/">Return</a>
    <br>
    <br>
    <!-- by ECharts Prepare one with size (width and height) Dom -->
    <div id="genre-sales" style="width: 1000px;height:550px;"></div>
    <script type="text/javascript">
        $.getJSON("static/data/genre-sales.json", d => {
            console.log(d);

            // Initialize the ecarts instance based on the prepared dom
            var myChart = echarts.init(document.getElementById('genre-sales'), 'light');

            var dataAxis = d.map(v => v[0]);
            var data = d.map(v => parseInt(v[1])/1e6);

            option = {
                title: {
                    text: 'Sales statistics of various types of albums',
                    subtext: 'This figure counts the sales volume and of various types of albums, as can be seen from the figure Indie The type of album sold the highest, nearly 4.7 billion. Pop Types of albums ranked second, with about 3.9 billion.',
                    x: 'center',
                    // bottom: 10
                    padding: [0, 0, 15, 0]
                },
                color: ['#3398DB'],
                tooltip: {
                    trigger: 'axis',
                    axisPointer: {            // Axis indicator, axis trigger active
                        type: 'shadow'        // The default is straight line, and the options are: 'line' | 'shadow'
                    }
                },
                grid: {
                    left: '3%',
                    right: '4%',
                    bottom: '3%',
                    containLabel: true
                },
                xAxis: [
                    {
                        type: 'category',
                        data: dataAxis,
                        axisTick: {
                            show: true,
                            alignWithLabel: true,
                            interval: 0
                        },
                        axisLabel: {
                            interval: 0,
                            rotate: 45,
                        }
                    }
                ],
                yAxis: [
                    {
                        type: 'value',
                        name: '# Million Albums',
                        nameLocation: 'middle',
                        nameGap: 50 
                    }
                ],
                series: [
                    {
                        name: 'Direct access',
                        type: 'bar',
                        barWidth: '60%',
                        data: data
                    }
                ]
            };

            // Use the configuration item and data you just specified to display the chart.
            myChart.setOption(option);

        })
    </script>
</body>
</html>

This is done by reading code / static / data / generic sales Using the data in JSON, draw a histogram to display the total sales of various types of albums.

(4) Create a new year tracks and sales in the code/templates directory HTML file. Copy the following code:

<!DOCTYPE html>
<html>

<head>
    <meta charset="utf-8">
    <title>ECharts</title>
    <!-- introduce echarts.js -->
    <script src="static/js/echarts-gl.min.js"></script>
    <script src="static/js/jquery.min.js"></script>
</head>

<body>
    <a href="/">Return</a>
    <br>
    <br>
    <!-- by ECharts Prepare one with size (width and height) Dom -->
    <div id="canvas" style="width: 1000px;height:550px;"></div>
    <script type="text/javascript">
        $.getJSON("static/data/year-tracks-and-sales.json", d => {
            console.log(d)
            // Initialize the ecarts instance based on the prepared dom
            var myChart = echarts.init(document.getElementById('canvas'), 'light');

            var colors = ['#5793f3', '#d14a61', '#675bba'];

            option = {
                title: {
                    text: 'Trends in the number of albums and singles in the past 20 years',
                    padding: [1, 0, 0, 15]
                    // subtext: 'the figure shows the change trend of the number of albums and singles released from 2000 to 2019. It can be seen from the figure that the number of albums has changed very little and basically stabilized at about 5000; The number of singles fluctuated slightly, about 10 times the number of albums. "
                },
                tooltip: {
                    trigger: 'axis'
                },
                legend: {
                    data: ['Number of singles', 'Number of albums'],
                    padding: [2, 0, 0, 0]
                },
                toolbox: {
                    show: true,
                    feature: {
                        dataZoom: {
                            yAxisIndex: 'none'
                        },
                        dataView: { readOnly: false },
                        magicType: { type: ['line', 'bar'] },
                        restore: {},
                        saveAsImage: {}
                    }
                },
                xAxis: {
                    type: 'category',
                    boundaryGap: false,
                    data: d['years'],
                    boundaryGap: ['20%', '20%']
                },
                yAxis: {
                    type: 'value',
                    // type: 'log',
                    axisLabel: {
                        formatter: '{value}'
                    }       
                },
                series: [
                    {
                        name: 'Number of singles',
                        type: 'bar',
                        data: d['tracks'],
                        barWidth: 15,
                    },
                    {
                        name: 'Number of albums',
                        type: 'bar',
                        data: d['albums'],
                        barGap:  '-100%',
                        barWidth: 15,
                    }
                ]
            };

            // Use the configuration item and data you just specified to display the chart.
            myChart.setOption(option);

        })
    </script>
</body>
</html>

This is done by reading code / static / data / year tracks and sales Based on the data in JSON, draw a histogram to show the number of albums and singles released each year in the past 20 years.

(5) Create a new gene year sales in the code/templates directory HTML file. Copy the following code:

<!DOCTYPE html>
<html>

<head>
    <meta charset="utf-8">
    <title>ECharts</title>
    <!-- introduce echarts.js -->
    <script src="static/js/echarts-gl.min.js"></script>
    <script src="static/js/jquery.min.js"></script>
</head>

<body>
    <a href="/">Return</a>
    <br>
    <br>
    <!-- by ECharts Prepare one with size (width and height) Dom -->
    <div id="genre-year-sales" style="width: 1000px;height:550px;"></div>
    <script type="text/javascript">
        $.getJSON("static/data/genre-year-sales.json", d => {
            console.log(d);

            // Initialize the ecarts instance based on the prepared dom
            var myChart = echarts.init(document.getElementById('genre-year-sales'), 'light');
            option = {
                legend: {},
                tooltip: {
                    trigger: 'axis',
                    showContent: false
                },
                dataset: {
                    source: [
                        ['year', ...d['Indie'].map(v => `${v[1]}`)],
                        ...['Indie', 'Pop', 'Rap', 'Latino', 'Pop-Rock'].map(v => [v, ...d[v].map(v1 => v1[2])])
                    ]
                },
                xAxis: { type: 'category' },
                yAxis: { gridIndex: 0 },
                grid: { top: '55%' },
                series: [
                    { type: 'line', smooth: true, seriesLayoutBy: 'row' },
                    { type: 'line', smooth: true, seriesLayoutBy: 'row' },
                    { type: 'line', smooth: true, seriesLayoutBy: 'row' },
                    { type: 'line', smooth: true, seriesLayoutBy: 'row' },
                    { type: 'line', smooth: true, seriesLayoutBy: 'row' },
                    {
                        type: 'pie',
                        id: 'pie',
                        radius: '30%',
                        center: ['50%', '25%'],
                        label: {
                            formatter: '{b}: {@2000} ({d}%)' //b is the data name and d is the percentage
                        },
                        encode: {
                            itemName: 'year',
                            value: '2000',
                            tooltip: '2000'
                        }
                    }

                ]
            };

            myChart.on('updateAxisPointer', function (event) {
                var xAxisInfo = event.axesInfo[0];
                if (xAxisInfo) {
                    var dimension = xAxisInfo.value + 1;
                    myChart.setOption({
                        series: {
                            id: 'pie',
                            label: {
                                formatter: '{b}: {@[' + dimension + ']} ({d}%)'
                            },
                            encode: {
                                value: dimension,
                                tooltip: dimension
                            }
                        }
                    });
                }
            });

            // Use the configuration item and data you just specified to display the chart.
            myChart.setOption(option);

        })
    </script>
</body>
</html>

This is done by reading code / static / data / generic year sales According to the data in JSON, draw a fan chart and a broken line chart to show the proportion of the sales of various types of albums in the total sales in different years and the sales changes of the top five album types in each year.

(6) Create a new generic critical.xml file in the code/templates directory HTML file. Copy the following code:

<!DOCTYPE html>
<html>

<head>
    <meta charset="utf-8">
    <title>ECharts</title>
    <!-- introduce echarts.js -->
    <script src="static/js/echarts-gl.min.js"></script>
    <script src="static/js/jquery.min.js"></script>
</head>

<body>
    <a href="/">Return</a>
    <br>
    <br>
    <!-- by ECharts Prepare one with size (width and height) Dom -->
    <div id="genre-critic" style="width: 1000px;height:550px;"></div>
    <script type="text/javascript">
        $.getJSON("static/data/genre-critic.json", d => {
            console.log(d);

            // Initialize the ecarts instance based on the prepared dom
            var myChart = echarts.init(document.getElementById('genre-critic'), 'light');
            option = {
                legend: {},
                tooltip: {},
                dataset: {
                    source: [
                        ['genre', ...d.map(v => v[0])],
                        ['rolling_stone_critic', ...d.map(v => v[1])],
                        ['mtv_critic', ...d.map(v => v[2])],
                        ['music_maniac_critic', ...d.map(v => v[3])]
                    ]
                },
                xAxis: [
                    { type: 'category', gridIndex: 0 },
                    { type: 'category', gridIndex: 1 }
                ],
                yAxis: [
                    { gridIndex: 0 , min: 2.7},
                    { gridIndex: 1 , min: 2.7}
                ],
                grid: [
                    { bottom: '55%' },
                    { top: '55%' }
                ],
                series: [
                    // These series are in the first grid.
                    { type: 'bar', seriesLayoutBy: 'row' , barWidth: 30},
                    { type: 'bar', seriesLayoutBy: 'row' , barWidth: 30},
                    { type: 'bar', seriesLayoutBy: 'row' , barWidth: 30 },
                    // These series are in the second grid.
                    { type: 'bar', xAxisIndex: 1, yAxisIndex: 1 , barWidth: 35},
                    { type: 'bar', xAxisIndex: 1, yAxisIndex: 1 , barWidth: 35},
                    { type: 'bar', xAxisIndex: 1, yAxisIndex: 1 , barWidth: 35},
                    { type: 'bar', xAxisIndex: 1, yAxisIndex: 1 , barWidth: 35}
                ]
            };

            // Use the configuration item and data you just specified to display the chart.
            myChart.setOption(option);
        })
</script>
</body>
</html>

This is done by reading code / static / data / generic critical Based on the data in JSON, draw a column chart to show the average score of the top five album types in different scoring systems.

5) web application startup

① In another Ubuntu terminal window, log in as a hadoop user, run su hadoop on the command line, and enter the user password.
② Enter the directory where the code is located.
③ Run the following command on the command line:

spark-submit VisualizationFlask.py

④ Open in browser http://127.0.0.1:5000/ , you can see the following interface:

(1) Statistical chart of the number of albums of each type

As can be seen from the figure, Indie has the largest number of albums.

(2) Sales statistics of various types of albums

The figure counts the sales volume and of various types of albums. It can be seen from the figure that Indie has the highest sales volume, nearly 4.7 billion. Pop album sales ranked second, about 3.9 billion.

(3) Statistical chart of the number of albums and singles released each year in recent 20 years

(4) Sales volume analysis chart of the top five album types in each year

(5) Score analysis chart of the top five album types in total sales

5. References

Quick search of data science tools | Spark User Guide (RDD version) http://www.showmeai.tech/article-detail/106
Quick search of data science tools | Spark User Guide (SQL version) http://www.showmeai.tech/article-detail/107

ShowMeAI related articles recommended

ShowMeAI series tutorial recommendations

Keywords: Big Data Spark Data Analysis

Added by Spoiler on Tue, 08 Mar 2022 17:26:31 +0200

Programming VIP