postgreSQL source code analysis - storage management - external storage management

2021SC@SDUSC

summary

Last time, the TOAST mechanism of postgreSQL was analyzed. The TOAST mechanism belongs to the automatic processing mechanism of postgreSQL for variable length variables (big data), and can not handle blob s and clob s, which are large objects actively used by users. To deal with such large objects, you need to use another mechanism for postgreSQL to deal with large data - large object storage mechanism.
According to official documents, the large object storage mechanism supports the following types:

  1. BLOB, a binary large object, is used to save pictures, videos, mixed media, etc.
  2. CLOB, a large character object, stores large single byte character set data, such as documents.
  3. DBCLOB, a large double byte character object, is used to store large double byte character set data, such as variable length double byte character graphic string.

Source code analysis

Storage of large objects

All large objects are stored in a file called pg_largeobject In the system table of, the structure of the system table is as follows:

Attribute nameAttribute typedescribe
loidoidIdentifier of the large object that contains this page
pagenoint5The page number of this page in the large object to which it belongs (calculated from 0)
databyteaData actually stored in large objects. It will never exceed LOBLKSIZE bytes and may be less.

Directory pg_largeobject saves the data that constitutes a "large object". A large object is assigned an OID when it is created. Each large object is decomposed into segments or "pages", that is, each tuple is a page, so that it can be conveniently stored in PG as a row_ In the largeobject. The amount of data in each page is defined as LOBLKSIZE (currently BLCKSZ/4 or 2 kB).

The tuple size is set to 2KB for the following reasons:

  1. Updating tuples does not take up too much space.
  2. The TOAST mechanism also limits tuples to 2KB, so the TOAST mechanism can be triggered to reduce the size of tuples.

Data structure of large objects

The source code is located in src/include/storage/large_object.h, the data structure is used to store the currently open large object, and it is also the object operated by the following large object management. First, its name is LargeObjectDesc, and its full name is large object descriptor, that is, large object descriptor. Let's take a look at its specific structure:

typedef struct LargeObjectDesc
{
	Oid			id;				//OID of large object
	Snapshot	snapshot;		//Snapshot of large objects, used to check data visibility when reading and writing large objects.
	SubTransactionId subid;		//The ID of the child transaction that opened the large object
	uint64		offset;			//Similar to seek when opening a file in c language, the current read-write pointer position is recorded by offset
	int			flags;			//Flag bit, see below for details

//Definition of flag bit
#define IFS_RDLOCK 		 (1 << 0) 	// Read lock indicates that opening this large object is read-only
#define IFS_WRLOCK 		 (1 << 1) 	// Write lock indicates that reading and writing is required to open this large object

} LargeObjectDesc;

Snapshot technology

PostgreSQL provides developers with a rich set of tools to manage concurrent access to data. Internally, data consistency is maintained by using a multi version model (multi version concurrency control, MVCC). This means that each SQL statement sees only a snapshot of the data (a database version) a short time ago, regardless of the current state of the underlying data. This can protect statements from inconsistent data that may be caused by other concurrent transactions executing updates on the same data row, and provide transaction isolation for each database session. MVCC avoids the traditional locking method of database system and minimizes lock contention to allow reasonable performance in multi-user environment.

Management of large objects

The source file is located in src/backend/storage/large_object/inv_api.c, which contains methods related to user level large object management.
inv here is the abbreviation of inversion. The purpose should be to realize the conversion in object-oriented Control reversal Technology.

Create large objects

Creating large objects via inv_ The create function is implemented. The parameters passed in are:

  • lobjId represents the Oid of the new large object to be created, or automatically generates an Oid for invalid Oid

The whole process of creating large objects is:
Call LargeObjectCreate function to create an empty large object (by inserting a tuple in pg_largeobject table). This function will judge the input parameters. If it is invalid oid (no oid is specified), an oid will be automatically allocated; If oid has been specified, it will be displayed in PG_ Check whether the oid exists in the largeobject system table. If it already exists, an error is reported and returned.

Oid
inv_create(Oid lobjId),
{
	Oid			lobjId_new;

	lobjId_new = LargeObjectCreate(lobjId);//Use the LargeObjectCreate function to create a large object with empty data
	//The detailed analysis of this function is shown below. The input parameter OID will be judged to select whether to automatically generate OID or create large objects according to existing OID

	recordDependencyOnOwner(LargeObjectRelationId,
							lobjId_new, GetUserId());//Registers the specified user as the owner of the current large object
	//This function is not the focus of analysis, so the source code will not be posted

	InvokeObjectPostCreateHook(LargeObjectRelationId, lobjId_new, 0);
	
	CommandCounterIncrement();//Make tuples visible to future operations

	return lobjId_new;//Returns the OID of the newly created large object
}

LargeObjectCreate function
Located in Src / backend / catalog / PG_ In the largeobject. C file.
To create a large object by entering the OID given by the parameter is actually to PG_ largeobject_ An entry is inserted into the metadata without data, so an empty (size 0) large object is created.

Oid
LargeObjectCreate(Oid loid)
{
	Relation	pg_lo_meta;//pg_largeobject_metadata table
	HeapTuple	ntup;//A temporary variable that temporarily holds tuples
	Oid			loid_new;//OID used to create large objects
	Datum		values[Natts_pg_largeobject_metadata];//The system table stores metadata values of large objects
	bool		nulls[Natts_pg_largeobject_metadata];//The system table stores whether the large object is null

	pg_lo_meta = table_open(LargeObjectMetadataRelationId,
							RowExclusiveLock);//Open the system table in the form of exclusive lock to obtain metadata

	 //Modify the memory through memset, that is, insert the metadata of large objects
	memset(values, 0, sizeof(values));
	memset(nulls, false, sizeof(nulls));

	if (OidIsValid(loid))//If the OID to create a new large object is given
		loid_new = loid;//Direct assignment
	else//There is no OID for the new large object
		loid_new = GetNewOidWithIndex(pg_lo_meta,
									  LargeObjectMetadataOidIndexId,
									  Anum_pg_largeobject_metadata_oid);//Automatically generate an OID

	values[Anum_pg_largeobject_metadata_oid - 1] = ObjectIdGetDatum(loid_new);//Set information about large objects
	values[Anum_pg_largeobject_metadata_lomowner - 1]
		= ObjectIdGetDatum(GetUserId());//Set ownership of large objects
	nulls[Anum_pg_largeobject_metadata_lomacl - 1] = true;//No data stored, set to null

	ntup = heap_form_tuple(RelationGetDescr(pg_lo_meta),
						   values, nulls);//Get tuple

	CatalogTupleInsert(pg_lo_meta, ntup);//Update the index table after a new large object is created
	//In order to quickly find large objects, postgreSQL is pg_largeobject creates an index table, which can be searched quickly according to the index.

	heap_freetuple(ntup);//Free memory occupied by tuples

	table_close(pg_lo_meta, RowExclusiveLock);//Close open system tables

	return loid_new;//Returns the OID of the created large object
}

About Datum

Datum is a generic type that holds an internal representation of data that can be stored in a PostgreSQL table.
You can convert it to one of the specific data types using the DatumGet * macro.

The details are in the src\include\postgres.h file.
Definition of Datum data type:

typedef uintptr_t Datum;
typedef unsigned long long uintptr;

Open large object

Open large objects via inv_ The open function is implemented to access an existing large object.
The parameters passed in are:

  • lobjId OID of large object
  • flags flag bit, indicating whether to open large objects in read-only or read-write form
  • mcxt memory context

Return a large object descriptor (which has been analyzed in detail above)
Basic process:

  1. Initialize some basic attributes (large object descriptor pointer, flag bit and snapshot), initialize the flag bit by passing parameters, and judge whether it is abnormal.
  2. Take a snapshot based on the status bit.
  3. Judge whether it has read-write permission according to the snapshot and relevant information. If it does not match the permission of the flag bit, an error will be reported.
  4. After checking, allocate memory and assign values to the large object descriptor.
  5. Returns a large object descriptor.

The detailed analysis is as follows:

LargeObjectDesc *
inv_open(Oid lobjId, int flags, MemoryContext mcxt)
{
	LargeObjectDesc *retval;//Used to store the generated large object descriptor
	Snapshot	snapshot = NULL;//snapshot
	int			descflags = 0;//Large object descriptor flag bit, initialized to 0.

The flag bits of INV are as follows:

#define INV_WRITE 		 0x00020000 / / binary is 0000 0000 0010 0000 0000 0000
#define INV_READ 		 0x00040000 / / binary is 0000 0000 0100 0000 0000 0000 0000
//Therefore, the binary read and write at the same time is 0000 0000 0110 0000 0000 0000 0000 0000
//At this time, the & operation is used to judge the better

The following judgment will be used

	//Use the and operation to judge the flag bit. The reason why = = is not used should be because the & operation is better here (use the bit to judge that reading and writing are allowed at the same time)
	if (flags & INV_WRITE)//If it is write flag bit
		descflags |= IFS_WRLOCK | IFS_RDLOCK; //INV_WRITE | INV_READ and INV_WRITE means write, there is no difference
		//Since the initial value of descflags is 0, the or operation is used to assign the value to write.
		
	if (flags & INV_READ)//If it is read flag bit
		descflags |= IFS_RDLOCK;
		//Since the initial value of descflags is 0, the or operation is used to assign the value to read.
		
	if (descflags == 0)//After the above assignment, if descflags is still 0, the incoming flags are abnormal
		ereport(ERROR,
				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
				 errmsg("invalid flags for opening a large object: %d",
						flags)));//Error reporting termination

	if (descflags & IFS_WRLOCK)//If the status bit is write
		snapshot = NULL;//Leave snapshot blank for instantaneous snapshot
	else//The status bit is read
		snapshot = GetActiveSnapshot();//Get Active snapshot

	if (!myLargeObjectExists(lobjId, snapshot))//If the snapshot specified by the large object does not exist
		ereport(ERROR,
				(errcode(ERRCODE_UNDEFINED_OBJECT),
				 errmsg("large object %u does not exist", lobjId)));//Error reporting termination

	//Permission check
	if ((descflags & IFS_RDLOCK) != 0)//If the flag bit is read
	{
		if (!lo_compat_privileges &&
			pg_largeobject_aclcheck_snapshot(lobjId,
											 GetUserId(),
											 ACL_SELECT,
											 snapshot) != ACLCHECK_OK)//Check the SELECT permission and the specified snapshot
			ereport(ERROR,
					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
					 errmsg("permission denied for large object %u",
							lobjId)));//If you do not have permission, an error is reported and access is not allowed
	}
	if ((descflags & IFS_WRLOCK) != 0)//If the flag bit is write
	{
		if (!lo_compat_privileges &&
			pg_largeobject_aclcheck_snapshot(lobjId,
											 GetUserId(),
											 ACL_UPDATE,
											 snapshot) != ACLCHECK_OK)//Check the UPDATE permission and the specified snapshot
			ereport(ERROR,
					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
					 errmsg("permission denied for large object %u",
							lobjId)));//If you do not have permission, an error is reported and access is not allowed
	}

	//After checking that there is no problem, you can create a large object descriptor
	retval = (LargeObjectDesc *) MemoryContextAlloc(mcxt,
													sizeof(LargeObjectDesc));//Create a LargeObjectDesc object using the memory context
	retval->id = lobjId;//Assignment OID
	retval->subid = GetCurrentSubTransactionId();//Assign sub transaction ID
	retval->offset = 0;//Assignment read / write pointer offset
	retval->flags = descflags;//Assignment flag bit

	if (snapshot)//If the snapshot is not null
		snapshot = RegisterSnapshotOnOwner(snapshot,
										   TopTransactionResourceOwner);//Register the snapshot according to the owner of the transaction resource, because the snapshot must remain active until the large object is closed.
	retval->snapshot = snapshot;//Assignment snapshot

	return retval;//Returns the constructed large object descriptor
}

Read large objects

By calling inv_read function to read the file contents of large objects.
Input parameters:

  • obj_desc file descriptor, which is used to determine the large object to be read and judge whether there is reading permission according to the flag
  • buf buffer
  • nbytes number of bytes to read

Output results:
An integer that is the number of bytes successfully read.

Main process:
In the while loop, if a tuple is read, n is calculated according to the read-write pointer offset and the position and size of the current page, that is, the number of bytes that can be read, and then compared with the number of bytes that need to be read (nbytes nread), that is, the number of bytes you want to read minus the number of bytes that have been read, so as to judge how many bytes you need to read this time, and then read the corresponding number of bytes, Put in buffer. If the page is lost, the above operation is also carried out, except that the read data is set to 0. Until the number of bytes read meets the expected number of bytes read, the cycle is terminated.
The detailed analysis is as follows:

int
inv_read(LargeObjectDesc *obj_desc, char *buf, int nbytes)
{
	int			nread = 0;//Stores the number of bytes read
	int64		n;//Byte length read at a time
	int64		off;//In page offset
	int			len;//Length of data field in tuple (page)
	int32		pageno = (int32) (obj_desc->offset / LOBLKSIZE);//Page (block) number, calculated by read / write pointer offset / block size
	uint64		pageoff;//Page offset is the byte position of the page in the whole file.
	ScanKeyData skey[2];
	SysScanDesc sd;
	HeapTuple	tuple;

	Assert(PointerIsValid(obj_desc));//Determine whether it is a valid large object descriptor
	Assert(buf != NULL);//If the buffer pointer is NULL, an error is reported

	if ((obj_desc->flags & IFS_RDLOCK) == 0)//And operation, as described in detail above, here is the judgment if there is no read flag
		ereport(ERROR,
				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
				 errmsg("permission denied for large object %u",
						obj_desc->id)))//Report an error and refuse to read;

	if (nbytes <= 0)//If the number of bytes to be read is < = 0, no read is returned directly
		return 0;

	open_lo_relation();//Call open_ lo_ The relation function opens a large object table

	//The following three statements should be related to snapshots, not the focus of analysis.
	ScanKeyInit(&skey[0],
				Anum_pg_largeobject_loid,
				BTEqualStrategyNumber, F_OIDEQ,
				ObjectIdGetDatum(obj_desc->id));
	ScanKeyInit(&skey[1],
				Anum_pg_largeobject_pageno,
				BTGreaterEqualStrategyNumber, F_INT4GE,
				Int32GetDatum(pageno));
	sd = systable_beginscan_ordered(lo_heap_r, lo_index_r,
									obj_desc->snapshot, 2, skey);

	while ((tuple = systable_getnext_ordered(sd, ForwardScanDirection)) != NULL)//If a tuple is read
	{
		Form_pg_largeobject data;//Storing large object table data
		bytea	   *datafield;//Pointer to the storage data field
		bool		pfreeit;//Whether to release memory is determined by getdatafield function.

		if (HeapTupleHasNulls(tuple))//If the tuple is empty
			elog(ERROR, "null field found in pg_largeobject");
		data = (Form_pg_largeobject) GETSTRUCT(tuple);//Get large object
		 
		pageoff = ((uint64) data->pageno) * LOBLKSIZE;//Calculate the page offset, that is, page number * page size
		if (pageoff > obj_desc->offset)//If the current page offset is greater than the offset of the read-write pointer, it means that the page is lost.
		{
			n = pageoff - obj_desc->offset;//Calculate the byte distance from the read-write pointer to the current page
			n = (n <= (nbytes - nread)) ? n : (nbytes - nread);//Reduce the amount of code through the ternary operator. Here is the amount of data to be read. If it is less than the byte distance, set n to the amount of data to be read
			MemSet(buf + nread, 0, n);//Call the memset function to initialize the memory. buf+nread forms the starting address, that is, the n bytes of memory starting from the starting address are assigned to 0 because the page is lost.
			nread += n;//More bytes read n
			obj_desc->offset += n;//N bytes are read and written, and the read-write pointer advances n, so that the read-write pointer is synchronized with the page offset
		}

		if (nread < nbytes)//If the number of bytes read is less than the number of bytes expected to be read
		{
			Assert(obj_desc->offset >= pageoff);//If the read-write pointer is still less than the page offset, the previous error will be reported and terminated.
			off = (int) (obj_desc->offset - pageoff);//Here, it indicates that off is the intra page offset, that is, the read-write pointer minus the page offset can indicate how many bytes have been read in the page
			Assert(off >= 0 && off < LOBLKSIZE);//If off < 0 or > = the size of a large object block, an error is reported

			getdatafield(data, &datafield, &len, &pfreeit);//Gets the data field of the corresponding tuple (page)
			//len is the length of the data field
			if (len > off)//If the length of the data field is greater than the in page offset
			{
				n = len - off;//The number of bytes needed to read the current data
				n = (n <= (nbytes - nread)) ? n : (nbytes - nread);//Same as above, no explanation here
				memcpy(buf + nread, VARDATA(datafield) + off, n);//Call the memcpy function to copy the data into buf
				nread += n;//More bytes have been read n
				obj_desc->offset += n;//Read / write pointer forward n
			}
			if (pfreeit)//Free up memory if necessary
				pfree(datafield);//Freeing memory for datafield
		}

		if (nread >= nbytes)//If the number of bytes read is greater than or equal to the number of bytes required
			break;//Terminate cycle
	}

	systable_endscan_ordered(sd);

	return nread;//Returns the number of bytes that have been read
}

Write large object

By calling inv_write function to write data to a large object.
The parameters passed in are:

  • obj_desc large object descriptor
  • buf data buffer to write
  • nbytes the length of data to write

The returned result is the number of bytes successfully written

The main processes are:
Similar to reading large objects above, it is also in the while loop. As long as the number of bytes written is less than the number of bytes to be written, the cycle continues. First, judge whether to get the next page. If so, get it. Otherwise, continue to write to the current page. If you are currently on the page you want to write, first judge whether there is any missing, if any, fill in 0, and then insert the data into the correct position; If you are not currently on the page you want to write, create a new page, and the insertion process is the same as above. Finally, the number of bytes written is returned.
The detailed analysis is as follows:

int
inv_write(LargeObjectDesc *obj_desc, const char *buf, int nbytes)
{
	int			nwritten = 0;//Number of bytes written
	int			n;//Length of one write
	int			off;//In page offset
	int			len;//Length of data field
	int32		pageno = (int32) (obj_desc->offset / LOBLKSIZE);//The current page number to be written is calculated in the same way as reading in large objects
	ScanKeyData skey[2];
	SysScanDesc sd;
	HeapTuple	oldtuple;//Store old tuples
	Form_pg_largeobject olddata;//Store old data
	bool		neednextpage;//Need next page
	bytea	   *datafield;//Data domain
	bool		pfreeit;//Need to free memory
	union//Consortium
	{
		bytea		hdr;
		char		data[LOBLKSIZE + VARHDRSZ];//Allocate enough space to store the amount of data with one page and page header
		int32		align_it;
	}			workbuf;//Use consortium to form cache
	char	   *workb = VARDATA(&workbuf.hdr);
	HeapTuple	newtup;
	Datum		values[Natts_pg_largeobject];
	bool		nulls[Natts_pg_largeobject];
	bool		replace[Natts_pg_largeobject];
	CatalogIndexState indstate;
	
	//Check the descriptor and buffer, which are the same as the read data above
	Assert(PointerIsValid(obj_desc));
	Assert(buf != NULL);

	if ((obj_desc->flags & IFS_WRLOCK) == 0)//Make sure you have write permission
		ereport(ERROR,
				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
				 errmsg("permission denied for large object %u",
						obj_desc->id)));//If there is no write permission, an error is reported

	if (nbytes <= 0)//Expected write byte < = 0, directly return 0 (not written)
		return 0;

	//Judge whether the number of bytes to be written from the position of the read-write pointer exceeds the size of the largest large object
	//Although this is an addition, it will not overflow because nbytes is int32
	if ((nbytes + obj_desc->offset) > MAX_LARGE_OBJECT_SIZE)
		ereport(ERROR,
				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
				 errmsg("invalid large object write request size: %d",
						nbytes)));//If it exceeds, an error is reported

	open_lo_relation();

	indstate = CatalogOpenIndexes(lo_heap_r);

	//Similar to reading files, it is also a series of related operations, not the focus of analysis.
	ScanKeyInit(&skey[0],
				Anum_pg_largeobject_loid,
				BTEqualStrategyNumber, F_OIDEQ,
				ObjectIdGetDatum(obj_desc->id));
	ScanKeyInit(&skey[1],
				Anum_pg_largeobject_pageno,
				BTGreaterEqualStrategyNumber, F_INT4GE,
				Int32GetDatum(pageno));
	sd = systable_beginscan_ordered(lo_heap_r, lo_index_r,
									obj_desc->snapshot, 2, skey);
	//Initialize the following parameters
	oldtuple = NULL;
	olddata = NULL;
	neednextpage = true;

	while (nwritten < nbytes)//The number of bytes written is less than the number of bytes expected to be written, and the cycle continues
	{
		if (neednextpage)//If necessary, first obtain an existing page to store data (not necessarily used)
		{
			if ((oldtuple = systable_getnext_ordered(sd, ForwardScanDirection)) != NULL)//Get tuple (page)
			{
				if (HeapTupleHasNulls(oldtuple))//If the tuple is empty
					elog(ERROR, "null field found in pg_largeobject");//report errors
				olddata = (Form_pg_largeobject) GETSTRUCT(oldtuple);
				Assert(olddata->pageno >= pageno);//If the obtained (tuple) page number is less than the current page number, it means that the next page is not obtained, so an error occurs.
			}
			neednextpage = false;//After getting the next page, set the variable to false
		}

		if (olddata != NULL && olddata->pageno == pageno)//Judge whether to get the next page and whether it is the page number to be written
		{
			getdatafield(olddata, &datafield, &len, &pfreeit);//get data
			memcpy(workb, VARDATA(datafield), len);//Store data in workb buffer
			if (pfreeit)//Release if necessary
				pfree(datafield);//Release data field

			//Fill in missing pages
			off = (int) (obj_desc->offset % LOBLKSIZE);//Calculates the intra page offset of the read / write pointer
			if (off > len)//If the intra page offset is greater than the data length of the page, the outgoing line of the page is missing
				MemSet(workb + len, 0, off - len);//All missing bytes are set to 0

			n = LOBLKSIZE - off;//Gets the size of data that can also be inserted into the page
			n = (n <= (nbytes - nwritten)) ? n : (nbytes - nwritten);//Judge whether n is less than the size of data to be written. If n is less than, write all. Otherwise, write the size to be written
			memcpy(workb + off, buf + nwritten, n);//Write the data at the corresponding position of the buffer to the workb buffer
			nwritten += n;//The number of bytes written increases by n
			obj_desc->offset += n;//Read / write pointer position increases by n
			off += n;//Intra page offset increment n
			len = (len >= off) ? len : off;//Calculate the new length of the page after writing.
			SET_VARSIZE(&workbuf.hdr, len + VARHDRSZ);

			 //The purpose of the following operation is to generate and insert the updated tuple
			memset(values, 0, sizeof(values));
			memset(nulls, false, sizeof(nulls));
			memset(replace, false, sizeof(replace));
			values[Anum_pg_largeobject_data - 1] = PointerGetDatum(&workbuf);
			replace[Anum_pg_largeobject_data - 1] = true;
			newtup = heap_modify_tuple(oldtuple, RelationGetDescr(lo_heap_r),
									   values, nulls, replace);//Using the data of old tuples to generate new meta groups
			CatalogTupleUpdateWithInfo(lo_heap_r, &newtup->t_self, newtup,
									   indstate);//Update generated tuples
			heap_freetuple(newtup);//Free up memory occupied by new metagroups

			//The old tuple is replaced by a new meta group, so the old tuple is no longer needed.
			oldtuple = NULL;
			olddata = NULL;
			neednextpage = true;
		}
		else//If not for the page you want to write
		{
			//A new page will be written
			off = (int) (obj_desc->offset % LOBLKSIZE);//Calculate in page offset
			if (off > 0)//If the in page offset is not 0
				MemSet(workb, 0, off);//Set the first off bytes of the buffer to 0 to fill the missing.

			n = LOBLKSIZE - off;//Gets the size of data that can also be inserted into the page
			n = (n <= (nbytes - nwritten)) ? n : (nbytes - nwritten);//Judge whether n is less than the size of data to be written. If n is less than, write all. Otherwise, write the size to be written
			memcpy(workb + off, buf + nwritten, n);//Write the data at the corresponding position of the buffer to the workb buffer
			nwritten += n;//The number of bytes written increases by n
			obj_desc->offset += n;//Read / write pointer position increases by n
			len = off + n;//Calculate the length of the new page after inserting data
			SET_VARSIZE(&workbuf.hdr, len + VARHDRSZ);

			//The difference between generating and inserting new tuples and the above is that there are no old tuples. It does not rely on the old tuples to generate new meta groups, and there is no need to release the old tuples
			memset(values, 0, sizeof(values));
			memset(nulls, false, sizeof(nulls));
			values[Anum_pg_largeobject_loid - 1] = ObjectIdGetDatum(obj_desc->id);
			values[Anum_pg_largeobject_pageno - 1] = Int32GetDatum(pageno);
			values[Anum_pg_largeobject_data - 1] = PointerGetDatum(&workbuf);
			newtup = heap_form_tuple(lo_heap_r->rd_att, values, nulls);
			CatalogTupleInsertWithInfo(lo_heap_r, newtup, indstate);
			heap_freetuple(newtup);
		}
		pageno++;//Page number self increment
	}

	systable_endscan_ordered(sd);
	CatalogCloseIndexes(indstate);
	CommandCounterIncrement();

	return nwritten;//Returns the number of bytes successfully written
}

Delete large objects

Call inv_drop function to delete an existing large object.
Incoming parameters:

  • lobjId is the OID of the large object to be deleted

The detailed analysis is as follows:

int
inv_drop(Oid lobjId)
{
	ObjectAddress object;//Used to locate large objects

	 //Delete dependencies related to large objects
	object.classId = LargeObjectRelationId;//OID of the table where the large object is located
	object.objectId = lobjId;//OID of large object
	object.objectSubId = 0;
	performDeletion(&object, DROP_CASCADE, 0);//Call performDeletion for cascading deletion
	//The 0 passed in is a flag bit, which is described in its function. I will describe it below.
	
	return 1;//For historical reasons, a return of 1 indicates that the deletion was successful
}

Introduction to the flag bits mentioned above:

  • PERFORM_DELETION_INTERNAL: indicates the deletion of internal system calls, not user calls.
  • PERFORM_DELETION_CONCURRENTLY: delete at the same time. It will only take effect in time for the deleted index.
  • PERFORM_ DELETION_ Query: lower the level of the report.
  • PERFORM_DELETION_SKIP_ORIGINAL: do not delete the specified object.
  • PERFORM_DELETION_SKIP_EXTENSIONS: do not delete extensions. Even if the deleted object is part of the extension, it will be used when deleting temporary objects.
  • PERFORM_DELETION_CONCURRENT_LOCK: if it cannot be deleted at the same time, it will be locked.

Definition in source code:

#define PERFORM_DELETION_INTERNAL			0x0001	
#define PERFORM_DELETION_CONCURRENTLY		0x0002	
#define PERFORM_DELETION_QUIETLY			0x0004	
#define PERFORM_DELETION_SKIP_ORIGINAL		0x0008	
#define PERFORM_DELETION_SKIP_EXTENSIONS	0x0010	
#define PERFORM_DELETION_CONCURRENT_LOCK	0x0020	

summary

Through the source code analysis, we understand another large object storage mechanism of postgreSQL. This mechanism is completely different from the mechanism analyzed in the previous article and can be actively used by users. This also involves the snapshot technology, the core of postgreSQL's MVCC mechanism.
In general, I learned how to store large objects in the system table, the data structure of large objects, and some core operations.

Keywords: Database PostgreSQL

Added by TheTitans on Mon, 08 Nov 2021 08:37:40 +0200