The first personal programming operation of software engineering practice in autumn 2021

Which course does this assignment belong to Soft engineering practice in autumn 2021
What are the requirements for this assignment The first personal programming operation of software engineering practice in autumn 2021 - CSDN community
The goal of this assignment Use of github; Write programs to extract keywords at different levels of C/C + + code
Student number 031902218

GitHub usage

Code specification

Python code specification

PSP table

PSP2.1 Personal Software Process Stages Estimated time (minutes) Practice time (minutes)
Planning plan 60 60
Estimate Estimate how long this task will take 30 20
Development development 300 360
Analysis Needs analysis (including learning new technologies) 60 60
Design Spec Generate design documents 120 180
Design Review design review 20 25
Coding Standard Code specification 15 20
Design Specific design 120 120
Coding Specific coding 300 360
Code Review Code Review 60 40
Test Test (self test, modify code, submit modification) 100 120
Reporting report 90 120
Test Report Test report 60 60
Size Measurement Calculation workload 15 15
Postmortem & Process Improvement Plan Summarize afterwards and put forward process improvement plan 60 90
total 1410 1650

Description of problem solving ideas

  • Title first impression

    At first, I felt very approachable when I saw this problem 😂, At least be able to fully understand the topic, and there are no words that people need to search to understand at the beginning; The topic mainly involves the string operation of text. You need to extract enough information from C/C + + files to judge the structure of the code, and finally achieve the purpose of keyword extraction; The main difficulty of the topic is to select an appropriate way to extract information from the document

  • Language selection

    After reading this question, I first thought of using regular expression to solve the problem, because this question may involve a large number of string operations, and there may be a large amount of text information and more interference items. Using regular expression processing can greatly reduce the workload; Python has a relatively mature regular library re, and I have been in contact with some before, so I choose to use Python to solve this problem

  • Keyword statistics

    Before making keyword statistics, I learned the usage of regular again and found some information related to regular, regex101 You can view the effect of regular expressions in real time, which is still very easy to use.

    To search for keywords, you can use regular expressions to extract all strings containing only letters, and then use the count function to traverse the keyword list and count all qualified strings. However, if you want to realize the above ideas, you must first solve the data cleaning. According to the knowledge of C/C + +, the strings in the text and in the comments that meet the conditions do not belong to keywords and do not need to be counted; In addition, if a string with the same keyword appears in the variable name, it does not need to be counted, so we have to exclude the following situations before counting:

    • //Notes

    • /** / Notes

    • /*Comment (a single symbol will comment out all the following lines)

    • '' single quote string

    • ”Double quoted string

    • Keyword in variable name (for example int1qq)

    Regular expressions can be used to remove the first five above. In fact, the last one can also be used. However, in the keyword statistics phase, all strings containing only letters will be put in the list and then counted, so it will not have an impact

    The following is the regular expression corresponding to the actual use

    • All comments: (/ /. *) | (/ \ * [\ s\S]*?\*/)|(/\*[\s\S] *)

    • Single quote string: (\ '[\ s \ S] *? \' | (\ '[\ s \ S] *)

    • Double quoted string: (\ "[\ s\S] *? \") | (\ "[\ s\S] *)

    • All strings that may be Keywords: [a-zA-Z]{2,}

    PS. during the actual measurement in VSCode, it is found that a single single quotation mark "or a single double quotation mark" may also include all the following lines, so this part will also be taken into account during regularization

  • Switch case statistics

    In this part, you can directly use the keyword list obtained from the C/C + + program code in the previous keyword statistics, and then calculate the number of cases from each switch to the next switch (the last switch is counted to the end). It is relatively simple on the basis of the previous keyword statistics. However, it should be noted that switches without internal case should be excluded (although this may not necessarily happen)

  • If else statistics

    The idea of this part is also relatively clear. You can extract all if and else in the text through regular expressions and put them in the list. Then, since there are only two states of if else, you can use the stack in a way similar to the bracket matching problem. In this way, even if there are redundant and single if, you can handle it.

    But this method also has preconditions: first, using this method, the text cannot contain if else if else structure; Secondly, it is also necessary to exclude interference items. For example, the variable names of qifsss and aelset will interfere with the statistics of if else. First screen out these interference items through [\ w](if|else)[\w], and then use (if|else) to screen out if else

  • if-(else if)-else statistics

    This is the difficulty of this problem. First, if you use stack matching in the same way as if else statistics, there will be a problem: if - (else, if) - else has three states, and the stack has only two states: in stack and out of stack. You can't get the answer only through stack matching. This is very clear. If you want to use stack matching to solve problems, you need additional information, There are two kinds of information I can think of here:

    1. Select the possible bracket information in the statement

    2. Space length information in front of all if / else

    If the bracket information is used, it should be considered that the statements in C/C + + do not need brackets if there is only a single line. Therefore, the use of bracket information needs classified discussion, and whether it can be solved cannot be determined in the analysis stage; Using regularization to obtain the space length in front of the keyword as auxiliary judgment information, as long as the code style of the sample can be maintained, that is, the keywords belonging to the same selection statement are aligned, this problem can be solved. Therefore, here I choose to be a gambling dog, so here I choose to use regularization to obtain the space length in front of the keyword, Then it can be used together with stack matching to solve this problem

Design and implementation process

Overall code framework

Key function flow chart

  • clean_data()

  • if_elseif_else_count()

    Stack operation description based on the number of forward spaces

    1. After extracting the keyword from the extracted keyword list, first judge the space position relationship between the keyword and the keyword at the top of the stack. If the number of forward spaces of the keyword at the top of the stack is greater than the keyword, the top of the stack will pop up; Repeat the above process until the judgment conditions are not met
    2. If the keyword extracted from the extracted keyword list is if, it is directly put on the stack
    3. If the keyword extracted from the extracted keyword list is else if, if the stack is not empty and the top of the stack is if, it will be put into the stack
    4. If the keyword extracted from the extracted keyword list is else, if the stack top is if, the stack top pops up, if_else structure number plus 1; If the top of the stack is else if, all else if will pop up from the top of the stack, and finally another if, if will pop up_ elseif_ Else structure number plus 1
    5. Cycle the above steps until the traversal of the extracted keyword list is completed

Code description

  • data processing

    # Remove text interference components (notes, strings)
    def clean_data(text):
        # Match all comments
        pattern_notes = r'(//.*)|(/\*[\s\S]*?\*/)|(/\*[\s\S]*)'
        # Match single quote string
        pattern_str1 = r'(\'[\s\S]*?\')|(\'[\s\S]*)'
        # Match double quoted string
        pattern_str2 = r'(\"[\s\S]*?\")|(\"[\s\S]*)'
        text = re.sub(pattern_notes, lambda x: generate_str(, text, flags=re.MULTILINE)
        text = re.sub(pattern_str1, lambda x: generate_str(, text, flags=re.MULTILINE)
        text = re.sub(pattern_str2, lambda x: generate_str(, text, flags=re.MULTILINE)
        return text
    # Generate an equal length empty string when replacing with
    def generate_str(str):
        temp = ""
        for i in range(len(str)):
            temp += " "
        return temp

    Use regular to remove interference components (notes, character strings) in text

    PS. since the spaces in front of if / else and if / else are used in step 4, you should pay great attention in the output preprocessing stage. Here, lambda expression is used to replace the interference item with spaces of equal length to prevent interference with if_ elseif_ else_ Statistics of count()

  • Keyword statistics

    # Get the number of key words
    def key_count(text):
        pattern_num = r'[a-zA-Z]{2,}'
        key_data = re.findall(pattern_num, text)
        num = 0
        for key in key_list:
            num += key_data.count(key)
        return key_data, num

    [a-zA-Z]{2} is used here because the length of the minimum keyword is 2, so only strings with a length greater than 2 and containing only letters are filtered

  • Switch case statistics

    # Get switch_ Number of case structures
    def switch_case_count(key_data):
        case_num = []
        switch_num = 0
        temp_case = 0
        for value in key_data:
            if value == "switch":
                if switch_num > 0:
                    temp_case = 0
                switch_num += 1
            if value == "case":
                temp_case += 1
        # Handle switch without case
        num = case_num.count(0)
        for i in range(num):
        switch_num -= num
        return switch_num, case_num

    It should be noted that the switch without case should be excluded

  • If else statistics

    # Only if_else
    def if_else_count(text):
        pattern_out = r'[\w](if|else)[\w]'
        pattern_key = r'(if|else)'
        # Exclude variable name interference
        text = re.sub(pattern_out, ' ', text, flags=re.MULTILINE)
        key_data = re.findall(pattern_key, text)
        # print(key_data)
        stack = []
        if_else_num = 0
        for index, values in enumerate(key_data):
            if values == 'if':
                if len(stack) == 0:
                if_else_num += 1
        return if_else_num

    Since the if else if else structure is not considered in the design of this function, the stack matching operation is directly used to count the number of if else

  • if-(else if)-else statistics

    # If else and if else
    def if_elseif_else_count(text):
        pattern_out = r'[\w](else if|if|else)[\w]'
        pattern_key = r'(else if|if|else)'
        # Exclude variable name interference
        text = re.sub(pattern_out, ' ', text, flags=re.MULTILINE)
        key_data = re.findall(pattern_key, text)
        # Count if / else forward spaces
        pattern_front_space = r'\n( *)(?=if|else if|else)'
        space_data = re.findall(pattern_front_space, text)
        space_data = [len(i) for i in space_data]
        # 1 for if/ 2 for else if/ 3 for else/
        stack = []
        if_else_num = 0
        if_elseif_else_num = 0
        for index, values in enumerate(key_data):
            while len(stack) > 0:
                if space_data[index] < space_data[stack[len(stack) - 1]]:
            if values == 'if':
            elif values == 'else if':
                if len(stack) == 0:
                if key_data[stack[len(stack) - 1]] == 'if':
                if len(stack) == 0:
                if key_data[stack[len(stack) - 1]] == 'if':
                    if_else_num += 1
                    while len(stack) > 0:
                        if key_data[stack[len(stack) - 1]] == 'else if':
                    if_elseif_else_num += 1
        return if_else_num, if_elseif_else_num

    The stack operation based on the number of forward spaces is used here. It has been described in the flowchart stage and will not be repeated

unit testing

  • Test text

    Three test texts have been constructed for the test, and the following is one of them

    #include <stdio.h>
    void test(){
        printf("asdfasdf  xxd dsff");
    int main(){
        int i=1;
        double j=0;
        long f;//sssasddd dcad
            case 0:
            case 1:
            case 2:
            case 0:
            case 1:
        else if(i>0){
            if (i>2){
                else if(i>1);
            else if (i==2) {
                else if(i==1){}
            else if (i>1) {}
            else {
        return 0;
  • Test code

    import unittest
    import sys
    sys.path.insert(0, "../code")
    from keyword_recognition import *
    class MyTestCase(unittest.TestCase):
        def test_clean_data(self):
            str1 = "/*sss*/ 'dd'"
            str_ans1 = "            "
            str2 = "/*ddddd \n*/"
            str_ans2 = "           "
            str3 = '"xxsubtxxxy"'
            str_ans3 = "            "
            self.assertEqual(clean_data(str1), str_ans1)
            self.assertEqual(clean_data(str2), str_ans2)
            self.assertEqual(clean_data(str3), str_ans3)
        def test_key_count(self):
            text = read_file("../data/key.c")
            temp, num = key_count(text)
            self.assertEqual(num, 35)
        def test_switch_case_count(self):
            key_data = ['include', 'stdio', 'int', 'main', 'int', 'double', 'long', 'switch',
                        'case', 'break', 'case', 'break', 'case', 'break', 'default', 'break',
                        'switch', 'case', 'break', 'case', 'break', 'default', 'break', 'if',
                        'if', 'else', 'else', 'if', 'if', 'else', 'if', 'else', 'if', 'else',
                        'else', 'if', 'else', 'return']
            switch_num, case_num = switch_case_count(key_data)
            self.assertEqual(switch_num, 2)
            self.assertEqual(case_num, [3, 2])
        def test_if_else_count(self):
            text = read_file("../data/text.c")
            if_else_num = if_else_count(text)
            self.assertEqual(if_else_num, 4)
        def test_if_elseif_else_count(self):
            text = read_file("../data/key.c")
            temp, if_elseif_else_num = if_elseif_else_count(text)
            self.assertEqual(if_elseif_else_num, 2)
    if __name__ == '__main__':

    The unit test results show that the function functions normally

  • Screenshot of coverage test

    Since only a few key functions are tested here, the coverage is only 58%

  • performance testing

    Because the single run time of the file is too short to present the test results well, the file will be run 10000 times

    import sys
    sys.path.insert(0, "../code")
    from code.keyword_recognition import start
    if __name__ == '__main__':
        for i in range(10000):
            start(filepath='../data/key.c', level=4)

    Through the performance test, we can see that since I adopt the nested method when selecting the mode level, the level type function here takes a very high time. In the future, we can consider redesigning and decoupling the level to improve efficiency

Difficulty description and solution

  • Problem Description: during data cleaning, the / * * / annotation can span multiple lines, but only re A newline character was not recognized when replacing \ nsub

    Solution: after searching the data, it is found that re The parameter flags = RE can be set in sub Multiline for multiline recognition

  • Problem Description: during if-(else if)-else statistics, it is found that there are loopholes in the previous way of cleaning data. I originally used to directly change all interference items into "" single spaces, but this will invalidate the counted forward spaces

    Solution: I've thought about changing the use of functions and no longer using re Sub uses a more complex method to find all interference items with the finder and then replace them one by one, but the implementation is very complex; After searching for information and thinking, I found re The match object of sub operation can use lambda expression, so lambda expression is directly used here to return an empty string with the same length as the matched string


  • Through this assignment, I further learned the use of regular expressions and had a deeper understanding of string operation; At the same time, I was also exposed to unit testing, coverage and performance testing for the first time. I had a deeper understanding of how to better use coding to complete a problem. In addition, I also learned how to write python code more standardized.

  • Through this assignment, I also understand the following principles:

    1. Selection and planning are more important than efforts: if the overall workflow is planned in advance and the approximate workload is estimated, it will be more directional. Blind efforts are likely to be a waste of time

    2.Learning by doing can well mobilize the knowledge you have learned and are learning, increase your understanding of knowledge, and use the process of practice can increase your sense of excitement and let you improve yourself in happiness

    3. Although deadline is the primary productive force, you should start early. More time can make you more relaxed (you can sleep more QAQ)

Generally speaking, this assignment is full of harvest!

Keywords: Software Engineering

Added by jimbo_head on Wed, 12 Jan 2022 22:31:30 +0200