Intensive computing with WASM: Introduction

In the article "getting started with WebAssembly using Docker and Golang quickly", I introduced how to make a general WASM conforming to WASI interface standard and how to call programs in several different scenarios.

This article will continue the previous article and talk about how to use WASM to enhance Golang and Node.js for more efficient and intensive computing.

Write in front

Perhaps some readers will say that since using Golang and why use Node.js, the performance of the two scenarios is not at all in an order of magnitude, which is not objective enough. But in fact, there is no need for such language opposition. In addition to the absolute performance difference between the two languages, perhaps we should also pay attention to the following things:

In the same computing scenario, if the business implementation is based on the implementation methods of the two languages and combined with WASM, is it possible to make better changes in program execution efficiency, program distribution content security and plug-in ecology?

Moreover, we can have a bigger brain hole. Can the two languages be ecologically integrated through WASM, so that rich Node / NPM ecology, flexible JavaScript applications and rigorous and efficient Golang can produce chemical reactions?

Before starting this article, let me first talk about some unexpected observations (for specific data, please refer to the table at the end of the article):

Would you be surprised to say that the general execution speed of Go in the WASM mode is not much worse than "native", and even in individual scenarios, the execution speed is faster than native?
If the execution speed of the original Node multi process / thread is 200% lower than that of the Golang program with the same function, without extreme optimization, the gap can be reduced to 100% only by introducing WASM.
Even in some cases, you will find that the gap between the performance of Node + WASM and the best performance of Go can be shortened from 300% to nearly 30%.

If you are surprised but curious, you might as well follow the article and try it yourself.

In order to simulate the worst situation, several programs in the test environment use the CLI to trigger the execution at one time to collect the results, so as to reduce the GC optimization during the run of Go language, or the JIT optimization during the run of Node, and the beneficial help brought by bytecode cache.

Pre preparation

Before starting the tossing trip, we first need to prepare the environment. In order to ensure that the program operation and test results are as objective as possible, we uniformly use the same machine, prepare the same container environment, limit the amount of CPU resources that can be used by the container, and reserve some resources for the host machine to ensure that the tested machine will not have insufficient CPU.

For the way to build the program and container environment, please refer to the "environment preparation" in the previous article. In order to save space, I won't go into detail.

The following are versions of various dependencies and components.

# go version
go version go1.17.3 linux/amd64

# node --version
v17.1.0

# wasmer --version
wasmer 2.0.0

#tinygo version
tinygo version 0.21.0 linux/amd64 (using go version go1.17.3 and LLVM version 11.0.0)

Prepare WASM program

In order to better simulate and compare the results, we need a "job" with a long calculation time. "Fibonacci series" is a good choice, especially the "basic version" without dynamic programming optimization.

Convention simulation intensive computing scenarios

Here, in order to facilitate the comparison of program performance, we uniformly let the program perform fast repeated calculation for the 40th bit of Fibonacci sequence (about more than 100 million integer).

In order to ensure that the CPU in the container is fully used and the results are relatively objective, the program uses the method of "parallel computing" for numerical calculation.

Use Go to write WASM program with WASI standard interface

If the above requirements are translated, only the calculation of Fibonacci sequence can be realized. Then, if Go is used without any algorithm optimization, the code for the implementation of purely computational functions will be similar to the following:

package main

func main() {}

//export fibonacci
func fibonacci(n uint) uint {
 if n == 0 {
  return 0
 } else if n == 1 {
  return 1
 } else {
  return fibonacci(n-1) + fibonacci(n-2)
 }
}

Save the above content as wasm.go, refer to the general WASI program mentioned in the previous article, and execute tinygo build -- no debug - O module. Wasm - wasm ABI = generic - target = WASI wasm.go.

We will successfully get a compiled and general WASM program for later program testing.

Write Go language benchmark program

Although we have obtained the WASM program, for intuitive comparison, we still need a program that completely uses Go to implement the basic version:

package main

import (
 "fmt"
 "time"
)

func fibonacci(n uint) uint {
 if n == 0 {
  return 0
 } else if n == 1 {
  return 1
 } else {
  return fibonacci(n-1) + fibonacci(n-2)
 }
}

func main() {
 start := time.Now()
 n := uint(40)
 fmt.Println(fibonacci(n))
 fmt.Printf("🎉 It's all done. Time:%v \n", time.Since(start).Milliseconds())
}

Save the above code as base.go, then execute go run base.go, and you will see the following results:

102334155
🎉 It's all done. Time: 574

If you want to pursue absolute performance, you can go build base.go and execute. / base. However, because the code is too simple, the performance difference is not big from the output results.

After the basic computing function is completed, let's simply adjust the code to make the code have the ability of parallel computing:

package main

import (
 "fmt"
 "os"
 "runtime"
 "strconv"
 "time"
)

func fibonacci(n uint) uint {
 if n == 0 {
  return 0
 } else if n == 1 {
  return 1
 } else {
  return fibonacci(n-1) + fibonacci(n-2)
 }
}

func calc(n uint, ch chan string) {
 ret := fibonacci(n)
 msg := strconv.FormatUint(uint64(ret), 10)
 fmt.Println(fmt.Sprintf("📦 Results received %s", msg))
 ch <- msg
}

func main() {
 numCPUs := runtime.NumCPU()
 n := uint(40)
 ch := make(chan string, numCPUs)

 fmt.Printf("🚀 Main process online #ID %v\n", os.Getpid())

 start := time.Now()

 for i := 0; i < numCPUs; i++ {
  fmt.Printf("👷 Distribute computing tasks #ID %v\n", i)
  go calc(n, ch)
 }

 for i := 0; i < numCPUs; i++ {
  <-ch
 }

 fmt.Printf("🎉 It's all done. Time:%v \n", time.Since(start).Milliseconds())
}

Save the code as full.go, then execute go run full.go, and you will see the following results:

🚀 Main process online #ID 2248
👷 Distribute computing tasks #ID 0
👷 Distribute computing tasks #ID 1
👷 Distribute computing tasks #ID 2
👷 Distribute computing tasks #ID 3
👷 Distribute computing tasks #ID 4
👷 Distribute computing tasks #ID 5
👷 Distribute computing tasks #ID 6
👷 Distribute computing tasks #ID 7
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
🎉 It's all done. Time: 658

Is there a smell of cost performance? The previous calculation result is close to 600 milliseconds. This time, there are 8 concurrent calculations (limited by the resource constraints of Docker environment), which is 650 milliseconds.

Write Go language to call WASM program

package main

import (
 "fmt"
 "os"
 "runtime"
 "time"

 "io/ioutil"

 wasmer "github.com/wasmerio/wasmer-go/wasmer"
)

func main() {
 wasmBytes, _ := ioutil.ReadFile("module.wasm")
 store := wasmer.NewStore(wasmer.NewEngine())
 module, _ := wasmer.NewModule(store, wasmBytes)

 wasiEnv, _ := wasmer.NewWasiStateBuilder("wasi-program").Finalize()
 GenerateImportObject, err := wasiEnv.GenerateImportObject(store, module)
 check(err)

 instance, err := wasmer.NewInstance(module, GenerateImportObject)
 check(err)

 wasmInitial, err := instance.Exports.GetWasiStartFunction()
 check(err)
 wasmInitial()

 fibonacci, err := instance.Exports.GetFunction("fibonacci")
 check(err)

 numCPUs := runtime.NumCPU()
 ch := make(chan string, numCPUs)

 fmt.Printf("🚀 Main process online #ID %v\n", os.Getpid())

 start := time.Now()

 for i := 0; i < numCPUs; i++ {
  fmt.Printf("👷 Distribute computing tasks #ID %v\n", i)

  calc := func(n uint, ch chan string) {
   ret, _ := fibonacci(n)
   msg := fmt.Sprintf("%d", ret)
   fmt.Println(fmt.Sprintf("📦 Results received %s", msg))
   ch <- msg
  }

  go calc(40, ch)
 }

 for i := 0; i < numCPUs; i++ {
  <-ch
 }

 fmt.Printf("🎉 It's all done. Time:%v \n", time.Since(start).Milliseconds())

}

func check(e error) {
 if e != nil {
  panic(e)
 }
}

Save the code as wasi.go, execute go run wasi.go, and you will get the following results:

🚀 Main process online #ID 3198
👷 Distribute computing tasks #ID 0
👷 Distribute computing tasks #ID 1
👷 Distribute computing tasks #ID 2
👷 Distribute computing tasks #ID 3
👷 Distribute computing tasks #ID 4
👷 Distribute computing tasks #ID 5
👷 Distribute computing tasks #ID 6
👷 Distribute computing tasks #ID 7
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
🎉 It's all done. Time: 595

If you execute it a few more times, you will find that most of the execution results are faster than those implemented entirely with Go. In the extreme case, it is just about as fast. Is the taste of cost performance coming again. (however, in order to ensure objectivity, several calculation results will be attached at the end of the text)

Use Node to write benchmark program (Cluster mode)

If the same code is implemented using Node.js, the number of lines will be a little more.

Although the name of the program file can be anything, in order to be lazy in CLI execution, call it index.js:

const cluster = require('cluster');
const { cpus } = require('os');
const { exit, pid } = require('process');

function fibonacci(num) {
  if (num === 0) {
    return 0;
  } else if (num === 1) {
    return 1;
  } else {
    return fibonacci(num - 1) + fibonacci(num - 2);
  }
}


const numCPUs = cpus().length;

if (cluster.isPrimary) {
  let dataStore = [];
  const readyChecker = () => {
    if (dataStore.length === numCPUs) {
      console.log(`🎉 It's all done. Time: ${new Date - start}`);
      exit(0);
    }
  }

  console.log(`🚀 Main process online #ID ${pid}`);
  const start = new Date();

  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on('online', (worker) => {
    console.log(`👷 Distribute computing tasks #ID ${worker.id}`);
    worker.send(40);
  });

  const messageHandler = function (msg) {
    console.log("📦 Results received", msg.ret)
    dataStore.push(msg.ret);
    readyChecker()
  }

  for (const id in cluster.workers) {
    cluster.workers[id].on('message', messageHandler);
  }

} else {
  process.on('message', (msg) => {
    process.send({ ret: fibonacci(msg) });
  });
}

Save the file and execute node. The program output will be similar to the following:

🚀 Main process online #ID 2038
👷 Distribute computing tasks #ID 1
👷 Distribute computing tasks #ID 2
👷 Distribute computing tasks #ID 3
👷 Distribute computing tasks #ID 4
👷 Distribute computing tasks #ID 7
👷 Distribute computing tasks #ID 5
👷 Distribute computing tasks #ID 6
👷 Distribute computing tasks #ID 8
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
🎉 It's all done. Time: 1747

At present, the execution time is almost three times that of the Go version. However, there should be a lot of room for improvement.

Using Node to write benchmark programs (Worker Threads)

Of course, whenever Node.js Cluster is mentioned, some people inevitably say that the data exchange of Cluster is inefficient and cumbersome, so we also test a Worker Threads.

In fact, starting from Node.js 12.x LTS, Node has the ability of Worker Threads. For how to use the worker streams of Node.js and how to use the SharedArrayBuffer step by step, please refer to the article How to work with worker threads in NodeJS. This article will not provide the basis. Considering that we are simulating the worst case, there is no need to pool threads using a third-party library such as Node Worker Threads pool. (almost no difference in actual measurement)

By simply adjusting the code of the above Cluster version, we can get the following code. The difference is that in order to write simply, we need to split it into two files this time:

const { Worker } = require("worker_threads");
const { cpus } = require('os');
const { exit } = require('process');

let dataStore = [];

const readyChecker = () => {
    if (dataStore.length === numCPUs) {
        console.log(`🎉 It's all done. Time: ${new Date - start}`)
        exit(0);
    }
}

const num = [40];
const sharedBuffer = new SharedArrayBuffer(Int32Array.BYTES_PER_ELEMENT * num.length);
const sharedArray = new Int32Array(sharedBuffer);
Atomics.store(sharedArray, 0, num);

const numCPUs = cpus().length;

console.log(`🚀 Main process online #ID ${process.pid}`);

const start = new Date();

for (let i = 0; i < numCPUs; i++) {
    const worker = new Worker("./worker.js");

    worker.on("message", msg => {
        console.log("📦 Results received", msg.ret)
        dataStore.push(msg.ret);
        readyChecker()
    });

    console.log(`👷 Distribute computing tasks`);
    worker.postMessage({ num: sharedArray });
}

You can consider changing the directory. First, save the above contents as index.js. Let's continue to complete the content of worker.js.

const { parentPort } = require("worker_threads");

function fibonacci(num) {
    if (num === 0) {
        return 0;
    } else if (num === 1) {
        return 1;
    } else {
        return fibonacci(num - 1) + fibonacci(num - 2);
    }
}

parentPort.on("message", data => {
    parentPort.postMessage({ num: data.num, ret: fibonacci(data.num) });
});

After all the files are saved, execute node. You can also see the following output:

🚀 Main process online #ID 2190
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
🎉 It's all done. Time: 1768

This should be because the amount of data exchanged is relatively small, so the execution time is actually about the same as that of the Cluster version.

Use Node to call WASM program (Cluster)

Simply adjust the code of Cluster mode above to implement a program that can use WASM for calculation:

const { readFileSync } = require('fs');
const { WASI } = require('wasi');
const { argv, env } = require('process');

const cluster = require('cluster');
const { cpus } = require('os');
const { exit, pid } = require('process');

const numCPUs = cpus().length;

if (cluster.isPrimary) {
    let dataStore = [];
    const readyChecker = () => {
        if (dataStore.length === numCPUs) {
            console.log(`🎉 It's all done. Time: ${new Date - start}`);
            exit(0);
        }
    }

    console.log(`🚀 Main process online #ID ${pid}`);
    const start = new Date();

    for (let i = 0; i < numCPUs; i++) {
        cluster.fork();
    }

    cluster.on('online', (worker) => {
        console.log(`👷 Distribute computing tasks #ID ${worker.id}`);
        worker.send(40);
    });

    const messageHandler = function (msg) {
        console.log("📦 Results received", msg.ret)
        dataStore.push(msg.ret);
        readyChecker()
    }

    for (const id in cluster.workers) {
        cluster.workers[id].on('message', messageHandler);
    }

} else {

    process.on('message', async (msg) => {
        const wasi = new WASI({ args: argv, env });
        const importObject = { wasi_snapshot_preview1: wasi.wasiImport };
        const wasm = await WebAssembly.compile(readFileSync("./module.wasm"));
        const instance = await WebAssembly.instantiate(wasm, importObject);
        wasi.start(instance);
        const ret = await instance.exports.fibonacci(msg)
        process.send({ ret });
    });
}

After saving the content as index.js, execute node --experimental-wasi-unstable-preview1. You will see the following output results:

🚀 Main process online #ID 2338
(node:2338) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
👷 Distribute computing tasks #ID 1
👷 Distribute computing tasks #ID 3
👷 Distribute computing tasks #ID 2
👷 Distribute computing tasks #ID 5
👷 Distribute computing tasks #ID 6
👷 Distribute computing tasks #ID 8
👷 Distribute computing tasks #ID 4
👷 Distribute computing tasks #ID 7
(node:2345) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2346) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2360) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2350) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2365) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2377) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2371) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2354) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
🎉 It's all done. Time: 808

Did you smell a real smell again. With little effort, simply adjust the code and the execution efficiency is half less than that without WASM. Even after it is reduced by an order of magnitude, if you continue to optimize and keep the program running, it may even be infinitely close to the performance of the program implemented in Go version.

Using Node to call WASM program (Worker Threads)

In order to keep our code simple, we can divide the program into three parts: entry program, Worker program and WASI calling program. Let's implement the portal program index.js first:

const { Worker } = require("worker_threads");
const { cpus } = require('os');
const { exit, pid } = require('process')

let dataStore = [];

const readyChecker = () => {
    if (dataStore.length === numCPUs) {
        console.log(`🎉 It's all done. Time: ${new Date - start}`);
        exit(0);
    }
}

const num = [40];
const sharedBuffer = new SharedArrayBuffer(Int32Array.BYTES_PER_ELEMENT * num.length);
const sharedArray = new Int32Array(sharedBuffer);
Atomics.store(sharedArray, 0, num);

const numCPUs = cpus().length;
console.log(`🚀 Main process online #ID ${pid}`);

const start = new Date();

for (let i = 0; i < numCPUs; i++) {
    const worker = new Worker("./worker.js");

    worker.on("message", msg => {
        console.log("📦 Results received", msg.ret)
        dataStore.push(msg.ret);
        readyChecker()
    });

    console.log(`👷 Distribute computing tasks`);
    worker.postMessage({ num: sharedArray });
}

Then implement the worker.js part:

const { parentPort } = require("worker_threads");
const wasi = require("./wasi");

parentPort.on("message", async (msg) => {
    const instance = await wasi();
    const ret = await instance.exports.fibonacci(msg.num)
    parentPort.postMessage({ ret });
});

Finally, implement the newly added wasi.js part:

const { readFileSync } = require('fs');
const { WASI } = require('wasi');
const { argv, env } = require('process');

module.exports = async () => {
    const wasi = new WASI({ args: argv, env });
    const importObject = { wasi_snapshot_preview1: wasi.wasiImport };
    const wasm = await WebAssembly.compile(readFileSync("./module.wasm"));
    const instance = await WebAssembly.instantiate(wasm, importObject);
    wasi.start(instance);
    return instance;
};

After saving all the files, execute node. You can see the running output similar to the Cluster version above:

🚀 Main process online #ID 2927
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
👷 Distribute computing tasks
(node:2927) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2927) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2927) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2927) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2927) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2927) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2927) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:2927) ExperimentalWarning: WASI is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
📦 Received result 102334155
🎉 It's all done. Time: 825

The execution time is the same as using Cluster and Threads to implement the benchmark program respectively.

Batch test the above procedures

In order to make the results more objective, let's write a small program, repeat the above program 100 times, eliminate the best and worst cases, and take the average value.

Let's implement a simple program instead of our manual execution, execute different commands for different program directories, and collect 100 program run data:

const { execSync } = require('child_process');
const { writeFileSync } = require('fs');
const sleep = (ms) => new Promise((res) => setTimeout(res, ms));

const cmd = './base';
const cwd = '../go-case';
const file = './go-base.json';

let data = [];

(async () => {
    for (let i = 1; i <= 100; i++) {
        console.log(`⌛️ Collecting ${i} Secondary results`)
        const stdout = execSync(cmd, { cwd }).toString();
        const words = stdout.split('\n').filter(n => n && n.includes('Done'))[0];
        const time = Number(words.match(/\d+/)[0]);
        data.push(time)
        await sleep(100);
    }
    writeFileSync(file, JSON.stringify(data))
})();

The output after program execution will be similar to the following:

⌛️ Collecting results 1 time
⌛️ Collecting results 2 times
⌛️ Collecting results 3 times
...
⌛️ Collecting 100 results

After data collection, let's write a simpler program to check the extreme conditions of program operation and the average value after excluding the extreme conditions. (in fact, we should also run a distribution, but we won't do it if we are lazy. Interested students can try it, and there will be surprises)

const raw = require('./data.json').sort((a, b) => a - b);
const body = raw.slice(1,-1);

const sum = body.reduce((x,y)=>x+y, 0);
const average = Math.floor(sum / body.length);

const best = raw[0];
const worst = raw[raw.length-1];

console.log('optimum',best);
console.log('worst',worst);
console.log('average', average);

test result

As mentioned above, this test can be regarded as poor data of the above program under very limited conditions.

Because I did not use the cloud server this time, but used the local container on my laptop, during the continuous operation, the device will heat up due to multiple intensive calculations. The calculation results are enlarged due to temperature. If you use cloud equipment for testing, the numerical results should be beautiful.

type	optimum	worst	average
Go	574	632	586
Go + WASM	533	800	610
Node Cluster	1994	3054	2531
Node Threads	1997	3671	2981
Node Cluster + WASM	892	1694	1305
Node Threads + WASM	887	2160	1334

But overall, the trend is still obvious, which may be enough to support us to adopt WASM + Node or WASM + GO for hybrid development in some suitable scenarios.

last

If the purpose of the last article is to let students who want to toss WASM get started quickly, this article hopes to help students who "hesitate whether to adopt lightweight heterogeneous scheme" and give the simplest practical code.

For many objective reasons, the development and promotion of WASM may be as slow as Docker. At the end of 2021, we saw the huge energy that Docker has exploded.

If you are willing to keep an open mind and treat WASM like Docker seven or eight years ago, I believe this lightweight, standardized and heterogeneous solution should bring a lot of differences between you and your project.

Added by aruns on Sun, 05 Dec 2021 06:42:57 +0200

Programming VIP