How to parse container images in golang code

Introduction: container image plays an extremely important role in our daily development work. Usually, we package the application into the container image and upload it to the image warehouse, and then pull it down in the production environment. Then, when using containers such as docker/containerd, the image will be started and the application will be executed. However, for some operation and maintenance platforms, the scanning and analysis of a mirror product itself is the real focus. This article briefly introduces how to parse a container image in the code.

Author Mu Qi
Source: Ali technical official account

One background

Container image plays an extremely important role in our daily development work. Usually, we package the application into the container image and upload it to the image warehouse, and then pull it down in the production environment. Then, when using containers such as docker/containerd, the image will be started and the application will be executed. However, for some operation and maintenance platforms, the scanning and analysis of a mirror product itself is the real focus. This article briefly introduces how to parse a container image in the code.

II. Go containerregistry

Go container registry is an open source project of google. It provides an operation interface for images. The resources behind this interface can be remote resources of the image warehouse, mirrored tar packages, or even docker daemon processes. Let's briefly introduce how to use this project to achieve our goal - parsing images in code.

In addition to providing external third-party packages, the project also provides crane (the client interacting with the remote image) and gcrane (the client interacting with gcr).

Three basic interfaces

1. Basic concept of image

Before introducing the specific interface, first introduce some simple concepts

  • ImageIndex, according to the OCI specification, is a data structure created to be compatible with multi architecture (amd64, arm64) images. We can associate multiple images in one ImageIndex. Using the same image tag, the client (docker, ctr) will pull the image of the corresponding architecture according to the infrastructure of the operating system where the client is located
  • Image Manifest basically corresponds to an image, which contains all layers digest s of an image. When the client pulls an image, it usually obtains the manifest file first and pulls each layer of the image (tar+gzip) according to the content in the manifest file
  • Image Config has a one-to-one correspondence with ImageManifest. Image Config mainly contains some basic configurations of images, such as creation time, author, infrastructure of the image, diffID (uncompressed ChangeSet) and ChainID of the image layer. Generally, when docker image is executed on the host, the ImageID seen is the hash value of ImageConfig.
  • Layer is the image layer. The image layer information does not contain any runtime information (environment variables, etc.) but only the information of the file system. The image is composed of the lowest rootfs and the changeset of each layer (add, update and delete operations on the upper layer).
  • layer diffid is the hash value of the uncompressed layer. It is common in the local environment. Diffid is used. Because the client usually downloads imageconfig, there is a diffid referenced in imageconfig.
  • Layer digest is the hash value of the compressed layer. It is commonly used in the image warehouse. The layers seen are generally digest Because the manifest references are layer digest.
  • There is no direct conversion between the two. At present, the only way is to correspond in order.

Summarize with a picture.

// ImageIndex defines the interface to interact with OCI ImageIndex
type ImageIndex interface {
  // Returns the MediaType of the current imageIndex
  MediaType() (types.MediaType, error)

  // Returns the sha256 value of this ImageIndex manifest.
  Digest() (Hash, error)

  // Returns the size of this ImageIndex manifest
  Size() (int64, error)

  // Returns the manifest structure of this ImageIndex
  IndexManifest() (*IndexManifest, error)

  // Returns the array of manifest bytes of this ImageIndex
  RawManifest() ([]byte, error)

  // Returns the Image referenced by this ImageIndex
  Image(Hash) (Image, error)

  // Returns the ImageIndex referenced by this ImageIndex
  ImageIndex(Hash) (ImageIndex, error)
}

// Image defines the interface to interact with OCI Image
type Image interface {
  // All levels of the current image are returned. The oldest / most basic layer is in front of the array, and the uppermost / latest layer is behind the array
  Layers() ([]Layer, error)

  // Returns the MediaType of the current image
  MediaType() (types.MediaType, error)

  // Returns the size of the Image manifest
  Size() (int64, error)

  // Returns the hash value of the ConfigFile of the image, which is also the ImageID of the image
  ConfigName() (Hash, error)

  // Returns the ConfigFile of this image
  ConfigFile() (*ConfigFile, error)

  // Returns the byte array of ConfigFile for this image
  RawConfigFile() ([]byte, error)

  // Returns the sha256 value of this Image Manifest
  Digest() (Hash, error)

  // Return this Image Manifest
  Manifest() (*Manifest, error)

  // Returns the bytes array of ImageManifest
  RawManifest() ([]byte, error)

  // Return a layer in the image and find it according to digest (compressed hash value)
  LayerByDigest(Hash) (Layer, error)

  // Return a layer in the image and find it according to diffid (uncompressed hash value)
  LayerByDiffID(Hash) (Layer, error)
}

// Layer defines the interface to access the specific layer of OCI Image
type Layer interface {
  // Returns the sha256 value of the compressed layer
  Digest() (Hash, error)

  // The sha256 value of uncompressed layer is returned
  DiffID() (Hash, error)

  // The compressed mirror layer is returned
  Compressed() (io.ReadCloser, error)

  // Uncompressed mirror layer returned
  Uncompressed() (io.ReadCloser, error)

  // Returns the size of the mirror layer after compression
  Size() (int64, error)

  // Returns the MediaType of the current layer
  MediaType() (types.MediaType, error)
}

Relevant interface functions have been described in the notes and will not be repeated.

IV. obtain image related meta information

Let's illustrate how to use remote mode (pull remote image).

package main

import (
  "github.com/google/go-containerregistry/pkg/authn"
  "github.com/google/go-containerregistry/pkg/name"
  "github.com/google/go-containerregistry/pkg/v1/remote"
)

func main() {
  ref, err := name.ParseReference("xxx")
  if err != nil {
    panic(err)
  }
  tryRemote(context.TODO(), ref, GetDockerOption())
  if err != nil {
    panic(err)
  }

  // do stuff with img
}

type DockerOption struct {
  // Auth
  UserName string
  Password string

  // RegistryToken is a bearer token to be sent to a registry
  RegistryToken string

  // ECR
  AwsAccessKey    string
  AwsSecretKey    string
  AwsSessionToken string
  AwsRegion       string

  // GCP
  GcpCredPath string

  InsecureSkipTLSVerify bool
  NonSSL                bool
  SkipPing              bool // this is ignored now
  Timeout               time.Duration
}

func GetDockerOption() (types.DockerOption, error) {
  cfg := DockerConfig{}
  if err := env.Parse(&cfg); err != nil {
    return types.DockerOption{}, fmt.Errorf("unable to parse environment variables: %w", err)
  }

  return types.DockerOption{
    UserName:              cfg.UserName,
    Password:              cfg.Password,
    RegistryToken:         cfg.RegistryToken,
    InsecureSkipTLSVerify: cfg.Insecure,
    NonSSL:                cfg.NonSSL,
  }, nil
}

func tryRemote(ctx context.Context, ref name.Reference, option types.DockerOption) (v1.Image, extender, error) {
  var remoteOpts []remote.Option
  if option.InsecureSkipTLSVerify {
    t := &http.Transport{
      TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
    }
    remoteOpts = append(remoteOpts, remote.WithTransport(t))
  }

  domain := ref.Context().RegistryStr()
  auth := token.GetToken(ctx, domain, option)

  if auth.Username != "" && auth.Password != "" {
    remoteOpts = append(remoteOpts, remote.WithAuth(&auth))
  } else if option.RegistryToken != "" {
    bearer := authn.Bearer{Token: option.RegistryToken}
    remoteOpts = append(remoteOpts, remote.WithAuth(&bearer))
  } else {
    remoteOpts = append(remoteOpts, remote.WithAuthFromKeychain(authn.DefaultKeychain))
  }

  desc, err := remote.Get(ref, remoteOpts...)
  if err != nil {
    return nil, nil, err
  }

  img, err := desc.Image()
  if err != nil {
    return nil, nil, err
  }

  // Return v1.Image if the image is found in Docker Registry
  return img, remoteExtender{
    ref:        implicitReference{ref: ref},
    descriptor: desc,
  }, nil
}

After the tryRemote code is executed, you can obtain an instance of the Image object and then operate on the instance. Clarify the following key points

  • remote. The get () method will only actually pull the manifestList/manifest of the image, not the entire image.
  • The desc.Image() method will judge remote The media type returned by get(). If it is an image, an Image interface will be returned directly. If it is a manifest list, the schema of the current host will be resolved and the image corresponding to the specified schema will be returned. Similarly, the image will not be pulled here.
  • All data is lazy load. Only get it when you need it.

V. read the information of the system software in the image

From the above interface definition, we can use image Layerbydiffid (hash) (layer, error) obtains a layer object. After obtaining the layer object, we can call layer The uncompressed () method gets the IO of an uncompressed layer Reader, that is, a tar file.

// Taronceoperator reads the file once and shares the content so that the analyzer can share the data
func tarOnceOpener(r io.Reader) func() ([]byte, error) {
  var once sync.Once
  var b []byte
  var err error

  return func() ([]byte, error) {
    once.Do(func() {
      b, err = ioutil.ReadAll(r)
    })
    if err != nil {
      return nil, xerrors.Errorf("unable to read tar file: %w", err)
    }
    return b, nil
  }
}

// This method mainly traverses the entire io stream, first parses the path (prefix, suffix) of the file, and then calls the analyzeFn method to parse the contents of the file.
func WalkLayerTar(layer io.Reader, analyzeFn WalkFunc) ([]string, []string, error) {
  var opqDirs, whFiles []string
  var result *AnalysisResult
  tr := tar.NewReader(layer)
  opq := ".wh..wh..opq"
  wh  := ".wh."
  for {
    hdr, err := tr.Next()
    if err == io.EOF {
      break
    }
    if err != nil {
      return nil, nil, xerrors.Errorf("failed to extract the archive: %w", err)
    }

    filePath := hdr.Name
    filePath = strings.TrimLeft(filepath.Clean(filePath), "/")
    fileDir, fileName := filepath.Split(filePath)

    // e.g. etc/.wh..wh..opq
    if opq == fileName {
      opqDirs = append(opqDirs, fileDir)
      continue
    }
    // etc/.wh.hostname
    if strings.HasPrefix(fileName, wh) {
      name := strings.TrimPrefix(fileName, wh)
      fpath := filepath.Join(fileDir, name)
      whFiles = append(whFiles, fpath)
      continue
    }

    if hdr.Typeflag == tar.TypeSymlink || hdr.Typeflag == tar.TypeLink || hdr.Typeflag == tar.TypeReg {
      analyzeFn(filePath, hdr.FileInfo(), tarOnceOpener(tr), result)
      if err != nil {
        return nil, nil, xerrors.Errorf("failed to analyze file: %w", err)
      }
    }
  }

  return opqDirs, whFiles, nil
}

// Call different driver s to parse the same file
func analyzeFn(filePath string, info os.FileInfo, opener analyzer.Opener,result *AnalysisResult) error {
    if info.IsDir() {
        return nil, nil
    }
    
    var wg sync.WaitGroup
    for _, d := range drivers {
      // filepath extracted from tar file doesn't have the prefix "/"
      if !d.Required(strings.TrimLeft(filePath, "/"), info) {
        continue
      }
      b, err := opener()
      if err != nil {
        return nil, xerrors.Errorf("unable to open a file (%s): %w", filePath, err)
      }

      if err = limit.Acquire(ctx, 1); err != nil {
        return nil, xerrors.Errorf("semaphore acquire: %w", err)
      }
      wg.Add(1)

      go func(a analyzer, target AnalysisTarget) {
        defer limit.Release(1)
        defer wg.Done()

        ret, err := a.Analyze(target)
        if err != nil && !xerrors.Is(err, aos.AnalyzeOSError) {
          log.Logger.Debugf("Analysis error: %s", err)
          return nil, err
        }
        result.Merge(ret)
      }(d, AnalysisTarget{Dir: dir, FilePath: filePath, Content: b})
    }
    
    
    return result, nil
}

// drivers: used to parse files in tar package
func (a alpinePkgAnalyzer) Analyze(target analyzer.AnalysisTarget) (*analyzer.AnalysisResult, error) {
  scanner := bufio.NewScanner(bytes.NewBuffer(target.Content))
  var pkg types.Package
  var version string
  for scanner.Scan() {
    line := scanner.Text()

    // check package if paragraph end
    if len(line) < 2 {
      if analyzer.CheckPackage(&pkg) {
        pkgs = append(pkgs, pkg)
      }
      pkg = types.Package{}
      continue
    }

    switch line[:2] {
    case "P:":
      pkg.Name = line[2:]
    case "V:":
      version = string(line[2:])
      if !apkVersion.Valid(version) {
        log.Printf("Invalid Version Found : OS %s, Package %s, Version %s", "alpine", pkg.Name, version)
        continue
      }
      pkg.Version = version
    case "o:":
      origin := line[2:]
      pkg.SrcName = origin
      pkg.SrcVersion = version
    }
  }
  // in case of last paragraph
  if analyzer.CheckPackage(&pkg) {
    pkgs = append(pkgs, pkg)
  }

  parsedPkgs := a.uniquePkgs(pkgs)

  return &analyzer.AnalysisResult{
    PackageInfos: []types.PackageInfo{
      {
        FilePath: target.FilePath,
        Packages: parsedPkgs,
      },
    },
  }, nil
}

The above code focuses on the Analyze(target analyzer.AnalysisTarget) method. Before introducing this method, there are two special files that need to be introduced slightly. As we all know, mirroring is layered, and all layers are read-only. When the container is based on the image, it will combine all the files contained in the image layer into rootfs for the container temporarily. When we commit the container into a new image, the file modifications in the container will be overwritten in the original image in the form of a new layer. There are two special documents:

  • .wh..wh..opq: indicates that the directory where this file is located has been deleted
  • .wh.: A file beginning with this affix indicates that the file has been deleted in the current layer

Therefore, to sum up, the deletion of files in all containers is not a real deletion. So we record two files in the WalkLayerTar method and skip parsing.

1 Analyze(target analyzer.AnalysisTarget)

  • First we call bufio scanner. Scan () method, which will continuously scan the information in the file. When false is returned, it means that the file is scanned to the end. If there is no error in the scanning process, the Err field of scanner is nil
  • We passed scanner Text() gets each line of the scanned file, intercepts the first two characters of each line, and obtains the package name & package version of apk package.

Vi. read java application information in the image

Let's actually look at how to read the dependency information in java applications, including application dependency & jar package dependency. First, we use the above method to read the file information of a certain layer.

  • If the file is found to be a jar package
  • Initialize the zip reader and start reading the contents of the jar package
  • Start to resolve the name and version of the artifact through the jar package name, for example: spring-core-5.3.4-snapshot jar => sprint-core, 5.3.4-SNAPSHOT
  • Read the compressed file from the zip reader

Determine file type

  • Call parseArtifact for recursive parsing
  • Put the returned innerLibs into the libs object
  • From manifest The MF file parses the manifest and returns
  • Parse groupid, artifactid, version from the properties file and return
  • Put the above information into the libs object
  • If POM properties
  • If it is manifest MF
  • If it is a file such as jar/war/ear

If artifactid or groupid is not found

  • Query the corresponding package information according to jar sha256
  • Find direct return
  • Return the resolved libs
func parseArtifact(c conf, fileName string, r io.ReadCloser) ([]types.Library, error) {
  defer r.Close()
  b, err := ioutil.ReadAll(r)
  if err != nil {
    return nil, xerrors.Errorf("unable to read the jar file: %w", err)
  }
  zr, err := zip.NewReader(bytes.NewReader(b), int64(len(b)))
  if err != nil {
    return nil, xerrors.Errorf("zip error: %w", err)
  }

  fileName = filepath.Base(fileName)
  fileProps := parseFileName(fileName)

  var libs []types.Library
  var m manifest
  var foundPomProps bool

  for _, fileInJar := range zr.File {
    switch {
    case filepath.Base(fileInJar.Name) == "pom.properties":
      props, err := parsePomProperties(fileInJar)
      if err != nil {
        return nil, xerrors.Errorf("failed to parse %s: %w", fileInJar.Name, err)
      }
      libs = append(libs, props.library())
      if fileProps.artifactID == props.artifactID && fileProps.version == props.version {
        foundPomProps = true
      }
    case filepath.Base(fileInJar.Name) == "MANIFEST.MF":
      m, err = parseManifest(fileInJar)
      if err != nil {
        return nil, xerrors.Errorf("failed to parse MANIFEST.MF: %w", err)
      }
    case isArtifact(fileInJar.Name):
      fr, err := fileInJar.Open()
      if err != nil {
        return nil, xerrors.Errorf("unable to open %s: %w", fileInJar.Name, err)
      }

      // Recursive parsing jar/war/ear 
      innerLibs, err := parseArtifact(c, fileInJar.Name, fr)
      if err != nil {
        return nil, xerrors.Errorf("failed to parse %s: %w", fileInJar.Name, err)
      }
      libs = append(libs, innerLibs...)
    }
  }

  // If you find POM Properties file, the libs object is returned directly
  if foundPomProps {
    return libs, nil
  }
  // If POM is not found Properties file, parse manifest MF file
  manifestProps := m.properties()
  if manifestProps.valid() {
    // Even if artifactid or groupid is found here, it may be illegal. Here, you will visit maven and other warehouses to confirm whether the jar package really exists
    if ok, _ := exists(c, manifestProps); ok {
      return append(libs, manifestProps.library()), nil
    }
  }
  p, err := searchBySHA1(c, b)
  if err == nil {
    return append(libs, p.library()), nil
  } else if !xerrors.Is(err, ArtifactNotFoundErr) {
    return nil, xerrors.Errorf("failed to search by SHA1: %w", err)
  }
  return libs, nil
}

Above, we have completed the function of reading information from the container image.

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.

Keywords: Docker Container

Added by veryconscious on Fri, 07 Jan 2022 12:46:28 +0200