After analyzing the data obtained from the new energy index of household lottery, I found a secret

About the author: program ape stone (ID: tangleishu), now Alibaba technical expert, Tsinghua xueslag, former Dajiang back-end Leader. Share high-quality technical articles from different perspectives for the purpose of making people gain from each article. Welcome to pay attention, exchange and guidance!

background

These two days, the ranking results of the index points of the first batch of household new energy passenger cars in Beijing were released.

Of course, as expected, unfortunately, I was not shortlisted and ranked around 8W. It seems that if we want to rank the scores of new energy, we have to pay close attention to increasing the number of generations of families. If we have a baby, my score is close to 60, and I still hope to be ranked as soon as possible.

Integral sorting result

Look at the data

Let's see what these data are like?

The website has downloaded pdf, which is not convenient for analysis. Obviously, as a program ape, I am still used to analyzing in the way of programmers. First of all, I'd better turn it into a plain text file, which can be converted into csv. (download address on official website: https://www.bjhjyd.gov.cn/jggb/2020910/1599732631719_1.htm )

  1. There are tools on the Internet, which can upload pdf and transfer to csv.
  2. You can copy it directly and paste plain text. The author adopted this method very quickly.

Based on the previous two articles, the following data can be easily obtained:

  1. 20 + command line artifact to improve development efficiency N times! (with demo)
  2. Didn't expect the shell command to play like this| Shell big data analysis

Because the copy comes out in plain text, the name and ID card are connected together. It needs to be split. It's very simple:

cat car2020.csv | awk '{print substr($3, 1, match($3,/[0-9]/)-1) "\t"  substr($3, match($3,/[0-9]/)) "\t" $4 "\t" $5 "\t" $6 "\t" $7}' > car-2020.csv 

Lottery data set csv

Score analysis

First, the file results are directly in reverse order of scores.

  • The highest score is 228, and the family is composed of 7 people.
  • The lowest score is 72, with most families with 4 people in 2 generations and 4 people in 3 generations.
cat car-2020.csv | awk '{print $5}'  | uniq -c  | sort -k 1 -nr

Integral distribution

Look at the distribution of scores. Most of them are around 70-120 points, accounting for 80%.

Family situation

Nearly 70% of family algebra are three generations.

➜ Downloads cat car-2020.csv | awk '{print $4}'  | sort | uniq -c
6621 2
13379 3
  • There are 9 people in the family. After looking at it, there are three generations of nine people. Guess that both parents have 4 + two 2 + 3 children? Anyway, I think the main family applicants have applied for lottery since the first phase. It's not easy for such a big family without a license plate.
➜  Downloads cat car-2020.csv | awk '{print $3}'  | sort | uniq -c
4245 3
5124 4
6912 5
2284 6
1240 7
 193 8
   2 9

Principal applicant

Too many people have been involved in lottery since the first year.

➜  Downloads cat car-2020.csv | awk '{print substr($6, 1, 4)}'  | sort | uniq -c
6874 2011
7401 2012
3729 2013
1384 2014
 482 2015
 111 2016
  19 2017

Look, there are 1367 from the first wave.

➜  Downloads cat car-2020.csv | awk '{print substr($6, 1, 7)}'  | sort | uniq -c 
1367 2011-01
 797 2011-02
 665 2011-03
 519 2011-04
 473 2011-05
 356 2011-06
 424 2011-07
 448 2011-08
 407 2011-09
 516 2011-10
 434 2011-11
 468 2011-12

Where does the winner come from?

This requires the help of the administrative area corresponding to the ID number. Didn't expect the shell command to play like this| Shell playing big data analysis As in this article, let's draw a conclusion directly.

  • Let's first look at the division according to the provincial level.
➜  Downloads join -1 2 -2 1 <(cat car-2020.csv | awk '{print substr($2, 1, 2)}' |sort | uniq -c | sort -k1 -nr | head -n 30 | sort -k2) id-area.code2.sort.txt | sort -k2 -nr
11 14792 Beijing
13 1101 Hebei Province
37 638 Shandong Province
41 360 Henan Province
14 330 Shanxi Province
23 296 Heilongjiang Province
21 291 Liaoning Province
42 239 Hubei province
15 224 Inner Mongolia Autonomous Region
22 217 Jilin Province
43 175 Hunan Province
61 174 Shaanxi Province
34 169 Anhui Province
32 162 Jiangsu Province
51 161 Sichuan Province
36 135 Jiangxi Province
12 107 Tianjin
62 82 Gansu Province
33 64 Zhejiang Province
35 53 Fujian Province
65 46 Xinjiang Uygur Autonomous Region
64 34 Ningxia Hui Autonomous Region
45 33 Guangxi Zhuang Autonomous Region
50 21 Chongqing City
63 19 Qinghai Province
53 19 Yunnan Province
44 19 Guangdong Province
52 17 Guizhou Province
31 12 Shanghai
46 6 Hainan 
  • The first column: the first two digits of the ID card are basically located in the province;
  • Column 2: corresponding number of people
  • Column 3: corresponding province

It can be seen that old Beijing still accounts for the largest proportion, accounting for 74%. Stone wanted to find a free thermal map generation tool to show it. I didn't find it for a short time, so I gave up first.

  • If you look at the urban level, please refer to the top 6 of your ID card.

Let's take a look at the top 30, which basically revolves around Beijing.

➜  Downloads join -1 2 -2 1 <(cat car-2020.csv | awk '{print substr($2, 1, 6)}' |sort | uniq -c | sort -k1 -nr | head -n 30 | sort -k2) address_code_uniq.csv | sort -k2 -nr
110111 1592 Fangshan District
110223 1531 Tongxian
110224 1439 Daxing County
110108 1109 Haidian District
110105 1108 Chaoyang District
110222 1075 Shunyi County
110229 846 Yanqing County
110106 734 Fengtai District
110221 725 Changping County
110226 610 Pinggu County
110228 599 Miyun County
110102 570 Xicheng District
110227 557 Huairou County
110101 475 Dongcheng District
110104 471 Xuanwu District
110109 415 Mentougou District
110103 382 Chongwen District
110107 332 Shijingshan District
110225 189 Fangshan County
131082 43 Sanhe City
230103 34 Nangang District
140202 28 city proper
110110 25 Yanshan District
220104 24 Chaoyang District
150102 23 New urban area
130102 23 Chang'an District
610103 22 Beilin District
130203 21 Lubei District
420106 19 Wuchang District
130681 18 Zhuozhou 

Where, address_code_uniq.csv from https://raw.githubusercontent.com/jxlwqq/address-code-of-china/master/address_code.csv , there is a hole in the administrative region code data downloaded from the official government website http://www.mca.gov.cn//article/sj/xzqh/2020/ It is the latest. Some administrative region codes have been revoked and are no longer used. (thanks to Shitou's verification and data synthesis, it is found that there is a large difference from the total number of 2000)

But the ID card that has been issued before can't be invalidated. For example, the administrative division code 110223 (Tongxian County, Beijing) has been revoked and will no longer be used in the newly issued ID card.

In addition, it was found that one case did not use ID card number as the ID number. Looks like a passport? CH1HFP ********** I don't understand this. Do you have any friends who know?

That's all for now. In addition, if interested friends need the data set analyzed in this article for exchange and learning, they can reply to "score sorting" to obtain and process Excel and CSV files.

Attached is the comparison table of Yaohao problem solving points

Back to the topic itself, this ㊙️ The secret is: if you want to win the lottery as soon as possible, give birth to a baby. Hahaha, don't hit me.

Added by zachatk1 on Mon, 28 Feb 2022 09:45:32 +0200