SCAU 1109 (comprehensive experiment: file operation and character processing)

Description

There is a file named in the current directory"case1.in"(among case Followed by a number 1, not a letter l,A text file that is wrongly written (will be judged wrong after submission),
Its content is an English article (in English) EOF As an end sign). Now it is required to read the content of the text file and count the number of occurrences of each word in the article,
And output the first five words with the largest number of occurrences and their occurrence times (output in the order of occurrence times from more to less, and output in the order of dictionary when the number of occurrences is the same,
When there are less than 5 words, output all words in order). Pay attention to the following details in the procedure:
(1)	Spaces, punctuation marks and carriage returns separate words.
(2)	There may be a hyphen at the end of a line of the article. When a hyphen appears, the string at the end of the line and the string that appears first in the next line form a word;
(3)	Noun abbreviations count as a word;
(4)	Numbers are not words;
(5)	Words are not case sensitive;
(6)	Output all words in lowercase;


#include "stdio.h" 
#include "math.h" 
#include "string.h" 
#include "stdlib.h" 

_______________________ 

main() 

         _______________________ 

Input format

file case1.in An English article in Chinese, including multiple paragraphs, with no more than 10000 words and no more than 20 characters per word

 

Output format

Output the answer according to the meaning of the question

 

 

sample input

(as case1.in (the content is as follows) 
I am a student. My school is SCAU. It is a beau-
tiful university. I like it.

 

 

sample output

a 2
i 2
is 2
it 2
am 1

 

 

author

admin

 

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This question is not very difficult. After all, I have just finished the course design of item auction, and I feel confident to do it again @#@, Seriously, if I had to do this problem last semester, I might not be able to do it.

Read the questions first. Roughly, let's read characters from a file, and then print them into words according to the regulations. When submitting the code, we find that it has a given header file, that is, we can't use c + +. Just write the function yourself.

As a passer-by, I will mention the pit of this problem:

1. At first, I thought it would play cards according to the routine, that is to say, it would test my program with English articles with standard grammar. Then after submitting, I found that I thought it was too beautiful and the test data was completely out of grammar!! (the test data given by it is attached at the end of the article)

2. The hyphen problem, at first, was only used to connect letters into a word, but after I wrote the program according to this understanding, I found that a group of data failed. It was really strange at that time. After repeated comparison, I understood the logic of hyphen, for example: "a-'\n'b" (\ n is a newline character), which is equivalent to the word "ab". If it is "a-b", it is equivalent to two words, namely "a" and "B".

3. The most important thing is that the file name cannot be wrong. It is recommended to copy and paste the file name directly.

 

Problem solving ideas:

1. At first, I didn't want to be so troublesome. I wanted to open a three-dimensional character array (one-dimensional word number, one-dimensional word itself, and the number of occurrences of one-dimensional words), but it was troublesome to operate, and I didn't use the three-dimensional array. Then I referred to the code of the newly written curriculum and established a structure to store words and the number of occurrences, Open a global structure array to store data.

2. I use the fgetc function to read characters one by one. It means that if I encounter a character set in the EOF array, I will read a character set in the EOF array, and then copy it to the EOF structure. I also made improvements after understanding the meaning of the question and read another character for judgment.

3. Later, I found that if the last character of the data is a letter rather than a punctuation mark, the characters in NewWord will not be copied to the structure, resulting in the loss of the data of the last word. I also thought of improvement methods: ① delete the NewWord array and store the data directly with the structure array; ② Keep the NewWord array and copy the data separately after the cyclic reading of characters. ① The scheme can simplify the code and speed up the operation, but it needs to change many lines of code; ② The scheme directly copies the above part of the code to the end of the cycle. The figure is easy, but the code readability is very poor and there are too many repetitions. I think it doesn't matter if it's an ordinary problem. I'll steal a lazy solution.

4. Sorting part. The title asks me to print the five words that appear most and the number of occurrences. The same number of occurrences is sorted alphabetically. I will define a global structure array max_time and a separate structure MAX (used to temporarily store data for comparison). I originally wanted to use algorithms such as bubbling and bisection. Later, I found it unnecessary, because I just printed five words. If bubbling is too slow, bisection is unnecessary (mainly because I'm not familiar with it), and it's troublesome to compare the two cases at the same time. I use the MAX structure array to save words and times, traverse the structure, and more words will overwrite MAX, and then save it into max_time structure array, and then clear the number of words that have been saved to 0, and cycle for 5 times. Finally, print max_time structure is good. The actual running time is 343 ms and the memory occupation is 544068 K. It's trading space for time.

Code part: (the code written by the newcomer focuses on being usable, so its readability is poor. The boss passes by and sprays @# @)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>

typedef struct
{
    char word[40];
    int time;
}WORD;

int sum=0;
WORD word[10005];
WORD max_time[5];
WORD MAX;//Structure with the largest storage time

//memset(NewWord,'\0',sizeof(NewWord));// format

int main()
{
    char NewWord[40];
    char ch;
    int pos=0,i=0,j=0,k=0;
    int flag=0;
    FILE *fp;
    fp = fopen("case1.in","r");
    if(fp == NULL)
    {
        return 0;
    }
    memset(NewWord,'\0',sizeof(NewWord));
    while((ch=fgetc(fp)) !=EOF)
    {
        if((ch>='A'&&ch<='Z')||(ch>='a'&&ch<='z'))
        {
            if(ch>='A'&&ch<='Z')
            {
                ch+=32;
            }
            NewWord[pos] = ch;
            pos++;
        }
        else if((ch>=32&&ch<=44)||(ch>=46&&ch<=47)||(ch>=58&&ch<=64)||(ch>='0'&&ch<='9')||(ch=='.'))//ASCII code of common symbols
        {
            if(NewWord[0]!='\0')//If the first letter is empty, it is not a word
            {
                flag = 0;
                for(i=0;i<sum;i++)
                {
                    if(strcmp(word[i].word,NewWord)==0)
                    {
                        flag = 1;//Find the same word
                        word[i].time++;
                        memset(NewWord,'\0',sizeof(NewWord));
                        pos = 0;
                        break;
                    }
                }
                if(flag == 0)
                {
                    strcpy(word[sum].word,NewWord);
                    word[sum].time = 1;
                    memset(NewWord,'\0',sizeof(NewWord));
                    sum++;
                    pos = 0;
                }
            }
        }
        else if(ch=='-')
        {
            ch = fgetc(fp);//Judge the next character. If it is not a newline character, the hyphen is equivalent to a space (blood lesson!!!)
            if(ch == '\n')
            {
                continue;//Don't worry about changing lines
            }
            else//Either a line break or a word is over. Open a new word
            {
                if(NewWord[0]!='\0')//If the first letter is empty, it is not a word
                {
                    flag = 0;
                    for(i=0;i<sum;i++)
                    {
                        if(strcmp(word[i].word,NewWord)==0)
                        {
                            flag = 1;//Find the same word
                            word[i].time++;
                            memset(NewWord,'\0',sizeof(NewWord));
                            pos = 0;
                            break;
                        }
                    }
                    if(flag == 0)
                    {
                        strcpy(word[sum].word,NewWord);
                        word[sum].time = 1;
                        memset(NewWord,'\0',sizeof(NewWord));
                        sum++;
                        pos = 0;
                    }
                    if((ch>='A'&&ch<='Z')||(ch>='a'&&ch<='z'))//If the new character is a letter, record it
                    {
                        if(ch>='A'&&ch<='Z')
                        {
                            ch+=32;
                        }
                        NewWord[pos] = ch;
                        pos++;
                    }
                }
            }
        }
    }
    if(NewWord[0]!='\0')//Process the last word
    {
        flag = 0;
        for(i=0;i<sum;i++)
        {
            if(strcmp(word[i].word,NewWord)==0)
            {
                flag = 1;//Find the same word
                word[i].time++;
                memset(NewWord,'\0',sizeof(NewWord));
                pos = 0;
                break;
            }
        }
        if(flag == 0)
        {
            strcpy(word[sum].word,NewWord);
            word[sum].time = 1;
            memset(NewWord,'\0',sizeof(NewWord));
            sum++;
            pos = 0;
        }
    }
    fclose(fp);
    MAX.time = 0;
    memset(MAX.word,'\0',sizeof(MAX.word));
    for(k=0;k<5;k++)//Open a new array to store the words to be output
    {
        for(i=0;i<sum;i++)
        {
            if(word[i].time>MAX.time)
            {
                MAX.time = word[i].time;
                strcpy(MAX.word,word[i].word);
                j = i;//Record location
            }
            else if(word[i].time == MAX.time)
            {
                if(strcmp(word[i].word,MAX.word)<0)
                {
                    MAX.time = word[i].time;
                    strcpy(MAX.word,word[i].word);
                    j = i;
                }
            }
        }
        strcpy(max_time[k].word,MAX.word);
        max_time[k].time = MAX.time;
        MAX.time = 0;
        memset(MAX.word,'\0',sizeof(MAX.word));
        word[j].time = 0;//Clear 0
    }
    for(k=0;k<5;k++)//Output word
    {
        printf("%s %d\n",max_time[k].word,max_time[k].time);
    }
    return 0;
}

Part of test data

test 1:

I am a student. My school is SCAU. It is a beau-
tiful university. I like it.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

test 2:

I am a student. -1 = 1 - 2. q=a-m. z = i - w. My school is SCAU. It is a beau-

tiful university. t-t-t-t, 123 123. I like it.

Standard output answer:

1|t 4

2|a 3

3|i 3

4|is 2

5|it 2

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 

test 3:

I am a student. My school is SCAU. It is a beau-

tiful university. I like it.

I am a student. My school is SCAU. It is a i-

s university. I like it.

1 2 3 4 5 6 7 8 9

12 12 12 12

II I

Standard output answer:

1|i 5

2|is 5

3|a 4

4|it 4

5|am 2

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Finally, the first time I wrote a blog, I didn't write well in some places, and I can point out any problems. Thank you for your appreciation. If you like it, please give it a compliment@_@

 

 

Added by edmore on Wed, 09 Feb 2022 23:54:41 +0200