The first personal programming operation of software engineering practice in autumn 2021

Which course does this assignment belong to	Soft engineering practice in autumn 2021
What are the requirements for this assignment	The first personal programming operation of software engineering practice in autumn 2021 - CSDN community
The goal of this assignment	Use of github; Write programs to extract keywords at different levels of C/C + + code
Student number	031902218

GitHub usage

GitHub warehouse address

NoahQue/KeywordAnalysis
Screenshot of warehouse
commit screenshot

Code specification

Python code specification

PSP table

PSP2.1	Personal Software Process Stages	Estimated time (minutes)	Practice time (minutes)
Planning	plan	60	60
Estimate	Estimate how long this task will take	30	20
Development	development	300	360
Analysis	Needs analysis (including learning new technologies)	60	60
Design Spec	Generate design documents	120	180
Design Review	design review	20	25
Coding Standard	Code specification	15	20
Design	Specific design	120	120
Coding	Specific coding	300	360
Code Review	Code Review	60	40
Test	Test (self test, modify code, submit modification)	100	120
Reporting	report	90	120
Test Report	Test report	60	60
Size Measurement	Calculation workload	15	15
Postmortem & Process Improvement Plan	Summarize afterwards and put forward process improvement plan	60	90
total		1410	1650

Description of problem solving ideas

Title first impression

At first, I felt very approachable when I saw this problem 😂， At least be able to fully understand the topic, and there are no words that people need to search to understand at the beginning; The topic mainly involves the string operation of text. You need to extract enough information from C/C + + files to judge the structure of the code, and finally achieve the purpose of keyword extraction; The main difficulty of the topic is to select an appropriate way to extract information from the document
Language selection

After reading this question, I first thought of using regular expression to solve the problem, because this question may involve a large number of string operations, and there may be a large amount of text information and more interference items. Using regular expression processing can greatly reduce the workload; Python has a relatively mature regular library re, and I have been in contact with some before, so I choose to use Python to solve this problem
Keyword statistics

Before making keyword statistics, I learned the usage of regular again and found some information related to regular, regex101 You can view the effect of regular expressions in real time, which is still very easy to use.

To search for keywords, you can use regular expressions to extract all strings containing only letters, and then use the count function to traverse the keyword list and count all qualified strings. However, if you want to realize the above ideas, you must first solve the data cleaning. According to the knowledge of C/C + +, the strings in the text and in the comments that meet the conditions do not belong to keywords and do not need to be counted; In addition, if a string with the same keyword appears in the variable name, it does not need to be counted, so we have to exclude the following situations before counting:
- //Notes
- /** / Notes
- /*Comment (a single symbol will comment out all the following lines)
- '' single quote string
- ”Double quoted string
- Keyword in variable name (for example int1qq)
Regular expressions can be used to remove the first five above. In fact, the last one can also be used. However, in the keyword statistics phase, all strings containing only letters will be put in the list and then counted, so it will not have an impact

The following is the regular expression corresponding to the actual use
- All comments: (/ /. *) | (/ \ * [\ s\S]*?\*/)|(/\*[\s\S] *)
- Single quote string: (\ '[\ s \ S] *? \' | (\ '[\ s \ S] *)
- Double quoted string: (\ "[\ s\S] *? \") | (\ "[\ s\S] *)
- All strings that may be Keywords: [a-zA-Z]{2,}
PS. during the actual measurement in VSCode, it is found that a single single quotation mark "or a single double quotation mark" may also include all the following lines, so this part will also be taken into account during regularization
Switch case statistics

In this part, you can directly use the keyword list obtained from the C/C + + program code in the previous keyword statistics, and then calculate the number of cases from each switch to the next switch (the last switch is counted to the end). It is relatively simple on the basis of the previous keyword statistics. However, it should be noted that switches without internal case should be excluded (although this may not necessarily happen)
If else statistics

The idea of this part is also relatively clear. You can extract all if and else in the text through regular expressions and put them in the list. Then, since there are only two states of if else, you can use the stack in a way similar to the bracket matching problem. In this way, even if there are redundant and single if, you can handle it.

But this method also has preconditions: first, using this method, the text cannot contain if else if else structure; Secondly, it is also necessary to exclude interference items. For example, the variable names of qifsss and aelset will interfere with the statistics of if else. First screen out these interference items through [\ w](if|else)[\w], and then use (if|else) to screen out if else
if-(else if)-else statistics

This is the difficulty of this problem. First, if you use stack matching in the same way as if else statistics, there will be a problem: if - (else, if) - else has three states, and the stack has only two states: in stack and out of stack. You can't get the answer only through stack matching. This is very clear. If you want to use stack matching to solve problems, you need additional information, There are two kinds of information I can think of here:

1. Select the possible bracket information in the statement

2. Space length information in front of all if / else

If the bracket information is used, it should be considered that the statements in C/C + + do not need brackets if there is only a single line. Therefore, the use of bracket information needs classified discussion, and whether it can be solved cannot be determined in the analysis stage; Using regularization to obtain the space length in front of the keyword as auxiliary judgment information, as long as the code style of the sample can be maintained, that is, the keywords belonging to the same selection statement are aligned, this problem can be solved. Therefore, here I choose to be a gambling dog, so here I choose to use regularization to obtain the space length in front of the keyword, Then it can be used together with stack matching to solve this problem

Design and implementation process

Overall code framework

Key function flow chart

clean_data()
if_elseif_else_count()

Stack operation description based on the number of forward spaces
1. After extracting the keyword from the extracted keyword list, first judge the space position relationship between the keyword and the keyword at the top of the stack. If the number of forward spaces of the keyword at the top of the stack is greater than the keyword, the top of the stack will pop up; Repeat the above process until the judgment conditions are not met
2. If the keyword extracted from the extracted keyword list is if, it is directly put on the stack
3. If the keyword extracted from the extracted keyword list is else if, if the stack is not empty and the top of the stack is if, it will be put into the stack
4. If the keyword extracted from the extracted keyword list is else, if the stack top is if, the stack top pops up, if_else structure number plus 1; If the top of the stack is else if, all else if will pop up from the top of the stack, and finally another if, if will pop up_ elseif_ Else structure number plus 1
5. Cycle the above steps until the traversal of the extracted keyword list is completed

Code description

data processing

# Remove text interference components (notes, strings)
def clean_data(text):
    # Match all comments
    pattern_notes = r'(//.*)|(/\*[\s\S]*?\*/)|(/\*[\s\S]*)'
    # Match single quote string
    pattern_str1 = r'(\'[\s\S]*?\')|(\'[\s\S]*)'
    # Match double quoted string
    pattern_str2 = r'(\"[\s\S]*?\")|(\"[\s\S]*)'
    text = re.sub(pattern_notes, lambda x: generate_str(x.group()), text, flags=re.MULTILINE)
    text = re.sub(pattern_str1, lambda x: generate_str(x.group()), text, flags=re.MULTILINE)
    text = re.sub(pattern_str2, lambda x: generate_str(x.group()), text, flags=re.MULTILINE)
    return text

# Generate an equal length empty string when replacing with
def generate_str(str):
    temp = ""
    for i in range(len(str)):
        temp += " "
    return temp

Use regular to remove interference components (notes, character strings) in text

PS. since the spaces in front of if / else and if / else are used in step 4, you should pay great attention in the output preprocessing stage. Here, lambda expression is used to replace the interference item with spaces of equal length to prevent interference with if_ elseif_ else_ Statistics of count()

Keyword statistics

# Get the number of key words
def key_count(text):
    pattern_num = r'[a-zA-Z]{2,}'
    key_data = re.findall(pattern_num, text)
    num = 0
    for key in key_list:
        num += key_data.count(key)
    return key_data, num

[a-zA-Z]{2} is used here because the length of the minimum keyword is 2, so only strings with a length greater than 2 and containing only letters are filtered

Switch case statistics

# Get switch_ Number of case structures
def switch_case_count(key_data):
    case_num = []
    switch_num = 0
    temp_case = 0
    for value in key_data:
        if value == "switch":
            if switch_num > 0:
                case_num.append(temp_case)
                temp_case = 0
            switch_num += 1

        if value == "case":
            temp_case += 1
    case_num.append(temp_case)

    # Handle switch without case
    num = case_num.count(0)
    for i in range(num):
        case_num.remove(0)
    switch_num -= num
    return switch_num, case_num

It should be noted that the switch without case should be excluded

If else statistics

# Only if_else
def if_else_count(text):
    pattern_out = r'[\w](if|else)[\w]'
    pattern_key = r'(if|else)'
    # Exclude variable name interference
    text = re.sub(pattern_out, ' ', text, flags=re.MULTILINE)
    key_data = re.findall(pattern_key, text)
    # print(key_data)
    stack = []
    if_else_num = 0
    for index, values in enumerate(key_data):
        if values == 'if':
            stack.append(index)
        else:
            if len(stack) == 0:
                continue
            stack.pop()
            if_else_num += 1
    return if_else_num

Since the if else if else structure is not considered in the design of this function, the stack matching operation is directly used to count the number of if else

if-(else if)-else statistics

# If else and if else
def if_elseif_else_count(text):
    pattern_out = r'[\w](else if|if|else)[\w]'
    pattern_key = r'(else if|if|else)'
    # Exclude variable name interference
    text = re.sub(pattern_out, ' ', text, flags=re.MULTILINE)
    key_data = re.findall(pattern_key, text)

    # Count if / else forward spaces
    pattern_front_space = r'\n( *)(?=if|else if|else)'
    space_data = re.findall(pattern_front_space, text)
    space_data = [len(i) for i in space_data]
    # 1 for if/ 2 for else if/ 3 for else/
    stack = []
    if_else_num = 0
    if_elseif_else_num = 0
    for index, values in enumerate(key_data):
        while len(stack) > 0:
            if space_data[index] < space_data[stack[len(stack) - 1]]:
                stack.pop()
            else:
                break
        if values == 'if':
            stack.append(index)
        elif values == 'else if':
            if len(stack) == 0:
                continue
            if key_data[stack[len(stack) - 1]] == 'if':
                stack.append(index)
        else:
            if len(stack) == 0:
                continue
            if key_data[stack[len(stack) - 1]] == 'if':
                if_else_num += 1
                stack.pop()
            else:
                while len(stack) > 0:
                    if key_data[stack[len(stack) - 1]] == 'else if':
                        stack.pop()
                    else:
                        break
                stack.pop()
                if_elseif_else_num += 1
    return if_else_num, if_elseif_else_num

The stack operation based on the number of forward spaces is used here. It has been described in the flowchart stage and will not be repeated

unit testing

Test text

Three test texts have been constructed for the test, and the following is one of them

#include <stdio.h>
void test(){
    printf("asdfasdf  xxd dsff");
    printf('d');
}
int main(){
    int i=1;
    double j=0;
    long f;//sssasddd dcad
    //asdad
    switch(i){
        case 0:
            break;
        case 1:
            break;
        case 2:
            break;
        default:
            break;
    }/*ddsfsfsfd*/
    switch(i){
        case 0:
            break;
        case 1:
            break;
        default:
            break;
    }
    if(i<0){
        if(i<-1){}
        else{
            if(i==2);
        }
    }
    else if(i>0){
        if (i>2){
            if(i!=1);
            else if(i>1);
        }
        else if (i==2) {
            if(i<-1){}
            else if(i==1){}
            else{}
        }
        else if (i>1) {}
        else {
        }
    }
    else{
        if(j!=0){}
        else{}
    }
    switch(i){}
    return 0;
}
/*ds
asdasdasd
asdsadasd
as?

Test code

import unittest
import sys
sys.path.insert(0, "../code")
from keyword_recognition import *


class MyTestCase(unittest.TestCase):
    def test_clean_data(self):
        str1 = "/*sss*/ 'dd'"
        str_ans1 = "            "
        str2 = "/*ddddd \n*/"
        str_ans2 = "           "
        str3 = '"xxsubtxxxy"'
        str_ans3 = "            "
        self.assertEqual(clean_data(str1), str_ans1)
        self.assertEqual(clean_data(str2), str_ans2)
        self.assertEqual(clean_data(str3), str_ans3)

    def test_key_count(self):
        text = read_file("../data/key.c")
        temp, num = key_count(text)
        self.assertEqual(num, 35)

    def test_switch_case_count(self):
        key_data = ['include', 'stdio', 'int', 'main', 'int', 'double', 'long', 'switch',
                    'case', 'break', 'case', 'break', 'case', 'break', 'default', 'break',
                    'switch', 'case', 'break', 'case', 'break', 'default', 'break', 'if',
                    'if', 'else', 'else', 'if', 'if', 'else', 'if', 'else', 'if', 'else',
                    'else', 'if', 'else', 'return']
        switch_num, case_num = switch_case_count(key_data)
        self.assertEqual(switch_num, 2)
        self.assertEqual(case_num, [3, 2])

    def test_if_else_count(self):
        text = read_file("../data/text.c")
        if_else_num = if_else_count(text)
        self.assertEqual(if_else_num, 4)

    def test_if_elseif_else_count(self):
        text = read_file("../data/key.c")
        temp, if_elseif_else_num = if_elseif_else_count(text)
        self.assertEqual(if_elseif_else_num, 2)


if __name__ == '__main__':
    unittest.main()

The unit test results show that the function functions normally

Screenshot of coverage test

Since only a few key functions are tested here, the coverage is only 58%
performance testing

Because the single run time of the file is too short to present the test results well, the file will be run 10000 times
```
import sys
sys.path.insert(0, "../code")
from code.keyword_recognition import start

if __name__ == '__main__':
    for i in range(10000):
        start(filepath='../data/key.c', level=4)
```
Through the performance test, we can see that since I adopt the nested method when selecting the mode level, the level type function here takes a very high time. In the future, we can consider redesigning and decoupling the level to improve efficiency

Difficulty description and solution

Problem Description: during data cleaning, the / * * / annotation can span multiple lines, but only re A newline character was not recognized when replacing \ nsub

Solution: after searching the data, it is found that re The parameter flags = RE can be set in sub Multiline for multiline recognition
Problem Description: during if-(else if)-else statistics, it is found that there are loopholes in the previous way of cleaning data. I originally used to directly change all interference items into "" single spaces, but this will invalidate the counted forward spaces

Solution: I've thought about changing the use of functions and no longer using re Sub uses a more complex method to find all interference items with the finder and then replace them one by one, but the implementation is very complex; After searching for information and thinking, I found re The match object of sub operation can use lambda expression, so lambda expression is directly used here to return an empty string with the same length as the matched string

summary

Through this assignment, I further learned the use of regular expressions and had a deeper understanding of string operation; At the same time, I was also exposed to unit testing, coverage and performance testing for the first time. I had a deeper understanding of how to better use coding to complete a problem. In addition, I also learned how to write python code more standardized.
Through this assignment, I also understand the following principles:

1. Selection and planning are more important than efforts: if the overall workflow is planned in advance and the approximate workload is estimated, it will be more directional. Blind efforts are likely to be a waste of time

2.Learning by doing can well mobilize the knowledge you have learned and are learning, increase your understanding of knowledge, and use the process of practice can increase your sense of excitement and let you improve yourself in happiness

3. Although deadline is the primary productive force, you should start early. More time can make you more relaxed (you can sleep more QAQ)

Generally speaking, this assignment is full of harvest!

Keywords: Software Engineering

Added by jimbo_head on Wed, 12 Jan 2022 22:31:30 +0200

Programming VIP

The first personal programming operation of software engineering practice in autumn 2021

GitHub usage

GitHub warehouse address

Screenshot of warehouse

commit screenshot

Code specification

PSP table

Description of problem solving ideas

Title first impression

Language selection

Keyword statistics

Switch case statistics

If else statistics

if-(else if)-else statistics

Design and implementation process

Overall code framework

Key function flow chart

clean_data()

if_elseif_else_count()

Code description

data processing

Keyword statistics

Switch case statistics

If else statistics

if-(else if)-else statistics

unit testing

Test text

Test code

Screenshot of coverage test

performance testing

Difficulty description and solution

summary

Popular Keywords