2022-03-05 application of symbol table (blacklist, whitelist, CSV file, construction of symbol table, construction of query index symbol table, construction of file query index symbol table)

Symbol table

Symbol table is set and map. In c + +, there are map and set based on red black tree and unordered based on hash function_ map,unordered_set.

The symbol table used in this code is my own implementation, according to the implementation method of algorithm 4.

Blacklist

The blacklist is to filter out what is on the list and output what is not on the list.

The white list is to filter out those that are on the output list but not on the list.

CSV file construction symbol table

csv file is a formatted storage file with "," as separator. We can construct a corresponding symbol table by specifying two columns in csv file.

Query index symbol table

The symbol table is not necessarily one-to-one, but also one to many. For example, fruits can correspond to: apples, bananas, pears, pineapples, litchi, watermelon, etc. The symbol table according to this rule is the query index table.

File query index symbol table

This is to treat the file content as a key and the file name as a val. when searching for a similar file, use the file content to view the file containing this content.

Implementation difficulties

The most easy difficulty here is the separator "," of CSV file, because the stream of C + + is truncated by space or newline by default, and C + + does not
The split () method of string is fortunately provided with the getline() function, which can specify the separator.

The more difficult problem I encounter is to make an index symbol table. Its structure is hashmap < string, vector < string > >. In my hashmap implementation, the reference of vector cannot be returned, because C + + has no null type. When hashmap does not contain a key, the get() function returns {false, val()}, and this right value of val() cannot be converted into a reference.

Cannot insert element without reference. Therefore, only headache pointers can be introduced. It becomes the structure of HashMap < string, vector < string > * >, which solves the problem. The only thing is that the function finally delete s all the new memory.

Performance issues

The test encountered a txt file with the size of 3M. When building the query index symbol table, it takes 3 or 4 seconds to complete. The slow makes me wonder whether I use C + +. If there are only tens of thousands or hundreds of thousands of data, it's slow. Why do I use C + +?

After one-step debugging, it is found that the problem occurs in the hashmap growth method, which needs continuous growth, resulting in continuous copy reconstruction. Like the growth problem of vector, it can be solved if we can predict in advance how many elements there are in hashmap. The annotation part of hashmap construction in lookUpIndex() function is reflected.

The following is the reference code

    //Output stream de duplication
    void DeDup()
    {
        ST::hashSet<std::string> set;
        std::string keys;
        while (std::cin >> keys)
        {
            if (!set.contains(keys))
            {
                set.add(keys);
                std::cout << keys << " ";
            }
        }
    }
    // White list
    void whiteFilter(const std::string &file)
    {
        ST::hashSet<std::string> set;
        std::string keys;
        std::ifstream fileword(file);
        while (fileword >> keys)
        {
            set.add(keys);
        }
        fileword.close();
        while (std::cin >> keys)
        {
            if (set.contains(keys))
            {
                std::cout << keys << " ";
            }
        }
    }

    // blacklist
    void blackFilter(const std::string &file)
    {
        ST::hashSet<std::string> set;
        std::string keys;
        std::ifstream fileword(file);
        while (fileword >> keys)
        {
            set.add(keys);
        }
        fileword.close();
        while (std::cin >> keys)
        {
            if (!set.contains(keys))
            {
                std::cout << keys << " ";
            }
        }
    }

    // CSV file construction symbol table, parameters: CSV file name, subscript of key, subscript of value
    void lookUpCsv(const std::string &file, const std::string &kField, const std::string &vField)
    {
        int keyField = std::stoi(kField);
        int valField = std::stoi(vField);
        std::ifstream files(file);
        std::stringstream words;
        ST::LPHashST<std::string, std::string> st;
        std::vector<std::string> tokens;
        tokens.reserve(8);
        std::string line;
        std::string word;
        while (std::getline(files, line))
        {
            // std::stringstream words(line);
            words.str(line);
            while (std::getline(words, word, ','))
            {
                tokens.push_back(word);
            }
            st.put(tokens[keyField], tokens[valField]);
            tokens.clear();
            words.clear();
        }
        files.close();
        while (std::cin >> word)
        {
            std::pair<bool, std::string> temp = st.get(word);
            if (temp.first)
            {
                std::cout << temp.second << "\n";
            }
        }
    }

    //Build the query index symbol table. The parameters are data file name and separator
    void lookUpIndex(const std::string &files, const std::string &split)
    {
        std::ifstream file(files);
        std::stringstream line;
        std::vector<std::string> keyVec;
        keyVec.reserve(32);
        char sp = split.at(0);
        ST::LPHashST<std::string, std::vector<std::string> *> st; //(12289);
        ST::LPHashST<std::string, std::vector<std::string> *> ts; //(393241);
        std::string words;
        std::string word;
        std::string val;
        std::string key;
        while (std::getline(file, words))
        {
            line.str(words);
            while (std::getline(line, word, sp))
            {
                keyVec.push_back(word);
            }
            key = keyVec[0];
            for (int i = 1; i != keyVec.size(); ++i)
            {
                val = keyVec[i];
                if (!st.contains(key))
                {
                    st.put(key, new std::vector<std::string>());
                    st.get(key).second->reserve(16);
                }
                if (!ts.contains(val))
                {
                    ts.put(val, new std::vector<std::string>());
                    ts.get(val).second->reserve(16);
                }
                st.get(key).second->push_back(val);
                ts.get(val).second->push_back(key);
            }
            keyVec.clear();
            line.clear();
        }
        file.close();
        std::string query;
        while (std::getline(std::cin, query))
        {
            if (st.contains(query))
            {
                for (auto &&i : *st.get(query).second)
                {
                    std::cout << " " << i << "\n";
                }
            }
            if (ts.contains(query))
            {
                for (auto &&i : *ts.get(query).second)
                {
                    std::cout << " " << i << "\n";
                }
            }
        }
        auto delstVec = st.keyVec();
        auto deltsVec = st.keyVec();
        for (auto &&i : delstVec)
        {
            delete st.get(i).second;
        }
        for (auto &&i : deltsVec)
        {
            delete ts.get(i).second;
        }
    }

    //Build the query index symbol table. The parameters are the number of data file names - 1 and the character group of data file names.
    void fileIndex(int argc, char *argv[])
    {
        ST::LPHashST<std::string, ST::hashSet<std::string> *> st;
        std::string filename;
        std::string word;
        std::fstream file;
        for (int i = 1; i != argc; ++i)
        {
            file.open(argv[i]);
            while (file >> word)
            {
                if (!st.contains(word))
                {
                    st.put(word, new ST::hashSet<std::string>());
                }
                st.get(word).second->add(argv[i]);
            }
            file.close();
        }
        std::string query;
        while (std::cin >> query)
        {
            if (st.contains(query))
            {
                auto prtvec = st.get(query).second->keyVec();
                for (auto &&i : prtvec)
                {
                    std::cout << " " << i << "\n";
                }
            }
        }
        auto delvec = st.keyVec();
        for (auto &&i : delvec)
        {
            delete st.get(i).second;
        }
    }

Keywords: C++ data structure HashMap

Added by jofield on Sat, 05 Mar 2022 09:54:29 +0200