MongoDB - M001 Chapter 5 - index and aggregation pipeline

Aggregation framework

  • Aggregation framework: just another way to query data in MongoDB. Aggregate operation placed in []

  • Aggregation framework syntax: in the form of pipes, where phases are executed in the order they are listed.

  • MongoDB Aggregation Framework ($group) (calculate & reshape data) > mongodb query language (MQL) (filter & update data)

  • Difference between MQL and aggregation framework (Eg: find all documents with Wifi as one of the conveniences. Only include price and address in the result cursor)

    • // MQL(MongoDB Query Language: )
      db.listingsAndReviews.find({ "amenities": "Wifi" },
                                 { "price": 1, "address": 1, "_id": 0 }).pretty()
      
      // Aggregation framework:
      db.listingsAndReviews.aggregate([
                                        { "$match": { "amenities": "Wifi" } },
                                        { "$project": { "price": 1,
                                                        "address": 1,
                                                        "_id": 0 }}]).pretty()
      
  • Reason for using aggregation instead of MQL lookup:

    1. Because sometimes aggregation may be used: such as grouping operation to modify data in some way.

    2. It is not always necessary to filter out the correct documents.

    3. You can perform operations other than finding and projecting data, such as using aggregation for calculation.

    4. Its ability to calculate, reshape and reorganize data exceeds the filtering ability of MQL. (the aggregation framework allows us to calculate and reshape data by using stages such as \ (group, \) sum.)

  • Pipeline operation sequence

    • The aggregation framework is used as a pipeline, and the sequence of operations in the pipeline is important
  • Aggregate pipeline data

    • How is the data processed in the pipeline?

      • We provide data to the pipeline at one end of the pipeline - > we describe how the pipeline will process our data using the aggregation phase - > the converted data appears at the end of the pipeline.
    • The data in the aggregation pipeline exists in the pipeline, and the original data will not be modified in essence

Aggregation framework example

  • Eg1: each_ IDs are counted according to different types of prices: "total": {"$sum": "$price"}

  • Eg2: find a document in the collection and only include the address field in the result cursor.

    • db.listingsAndReviews.findOne({ },{ "address": 1, "_id": 0 })
  • Eg3: only project the address field value of each document, then group all documents into one document with each address.country value, and count one for each document in each group.

    • db.listingsAndReviews.aggregate([
                                        { "$project": { "address": 1, "_id": 0 }},
                                        { "$group": { "_id": "$address.country",
                                                      "count": { "$sum": 1 } } }
                                      ])
      
  • Eg4: What room types are present in the sample_airbnb.listingsAndReviews collection? Field: room_type:"xxx"

    • db.listingsAndReviews.aggregate([
                                        { "$project": { "room_type": 1}},
                                        { "$group": { "_id": "$room_type",
                                                      "count": { "$sum": 1 } } }
                                      ])
      
      Answer: db.listingsAndReviews.aggregate([ { "$group": { "_id": "$room_type" } }])
      
      output:
      ...                               
      [
        { _id: 'Entire home/apt', count: 3489 },
        { _id: 'Private room', count: 1983 },
        { _id: 'Shared room', count: 83 }
      ]
      
  • Eg5: which of the following commands will return sample_ Name and year of establishment of the five oldest companies in the training.companies collection?

    • // wrong 
      db.companies.find({}, { "name": 1, "founded_year": 1 }). sort({ "founded_year": 1 }).limit(5)
      // The command is missing search criteria, which means that null values will be included. Excluding null values is not always necessary, because storing these types of values in MongoDB is not a best practice anyway. So many collections have no null values. However, this collection contains null values and we need to exclude them from our cursor, otherwise they will be our minimum value when we sort.
      
      // yes
      db.companies.find({ "founded_year": {"$ne":null}},{"name":1,"founded_year": 1}).sort({ "found_year": 1 }).limit(5)
      // We must first filter out the documents whose establishment year is not empty, and then project the field we want to find, that is, the name, which is founded in this case_ year.  We then sort the cursors in ascending order, so the first result will have found_ The minimum value of the year field. Finally, we limit the results to the first five documents in the cursor to get the five oldest companies in this collection.
      
      // yes
      db.companies.find({ "founded_year": {"$ne":null}},{"name":1,"founded_year": 1}).limit(5).sort({ "found_year": 1 })
      // Although the limit() and sort() methods are not listed in the correct order, MongoDB will flip their order when executing the query, providing the result that the problem prompt is looking for.
      
      // wrong
      db.companies.find({ "name":1,"founded_year": 1}).sort({ "founded_year": 1 }).limit(5)
      // It seems that the query filter has the value that the projection part of the find() command should have, and projection is not used in this command. So this command will return all name s equal to 1 and created_year is also equal to 1. Unfortunately, it missed the requirement of question prompt.
      

sort() & limit()

  • Sort() & limit() is a cursor method. Other cursor methods are: pretty(), count().

    • Cursor method does not apply to: data stored in the database, it applies to the result set located in the cursor.

    • The two are generally used in combination. First execute sort and then limit. If ` ` cursor.limit().sort(), the internal execution is still cursor.sort().limit() '.

  • Sort sort syntax: db.COLLECTION_NAME.find().sort({KEY:1,key2:-1});

    • Where 1 is in ascending order and - 1 is in descending order
  • Eg1:

    • // Sort according to the order of increasing population, and return the first document = take the document with the least population
      db.zips.find().sort({ "pop": 1 }).limit(1)  
      
      // Sort according to the increasing and decreasing order of population, and return the first document = take the document with the largest population
      db.zips.find().sort({ "pop": -1 }).limit(1) 
      
      // Multiple sorting
      db.zips.find().sort({ "pop": 1, "city": -1 }) 
      
  • Eg2: sample_ When was the youngest cyclist in the training.trips series born?

    • wrong: db.trips.find( {"birth year":{"$ne":null} } ).sort({"birth year":-1}).limit(1)  Notice what the first curly bracket says
      
      wrong: db.trips.find({ "birth year": {"$ne":null}},{"birth year": 1}).sort({ "birth year": -1 }).limit(1)
      
      yes:  db.trips.find({ "birth year": { "$ne":"" } },
                           { "birth year": 1 }).sort({ "birth year": -1 }).limit(1)
      // Filter empty strings instead of null
      

index

  • What is it?

    • An index is a special data structure that stores a small part of a collection data set in an easy to traverse form
  • Function: make query efficient

    • There are many ways to optimize query speed, but the most effective way is to add indexes
  • Summary: in short, index is a data structure that optimizes query speed

  • Eg:

    • db.trips.find({ "birth year": 1989 })
      
      // Create a single field index single field index = single field index
      db.trips.createIndex({ "birth year": 1 })
      
      //If you encounter db.trips.find ({"start station ID": 476}). Sort ({"birth year": 1})
      
      // To make the second query more efficient, you need to use composite indexes
      db.trips.createIndex({ "start station id": 1, "birth year": 1 })
      

Data modeling

  • What data modeling is: a way to organize fields in documents to support application performance and query capabilities.

  • By default, MongoDB does not enforce the default data organization.

  • So how do you decide what structure to use to store data? Where should I create a sub file? Where should I use numeric arrays? At what point should data get its own set? Making these decisions about the shape and structure of data is called data modeling. Data modeling is a way to organize fields in files to support application performance and query capabilities.

  • Data is stored in a way that uses. When using MongoDB for data modeling, the data accessed together should be stored together.

  • As applications change and evolve, your data model should also evolve. MongoDB is built for rapid data model change and evolution.

Upsert

  • What is Upsert: update and insert blends.

  • Syntax: db.collection_ name.updateOne( {<query>}, {<update>}, {"upsert":true}) .

    • By default, upsert is false. If true, update / insert operations can be performed. Update if any, insert if none.

Keywords: Database

Added by gamber on Sat, 04 Dec 2021 01:44:17 +0200